Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)

From:	Tim Ruehsen
Subject:	Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)
Date:	Tue, 15 Dec 2015 09:39:23 +0100
User-agent:	KMail/4.14.10 (Linux/4.3.0-1-amd64; KDE/4.14.14; x86_64; ; )

On Monday 14 December 2015 22:15:32 Tim Rühsen wrote:
> Am Montag, 14. Dezember 2015, 21:58:59 schrieb Eli Zaretskii:
> > > From: Tim Rühsen <address@hidden>
> > > Date: Mon, 14 Dec 2015 20:22:41 +0100
> > > 
> > > >  1. The functions that call 'iconv' (in iri.c) don't make a point of
> > > >  
> > > >     flushing the last portion of the converted URL after 'iconv'
> > > >     returns successfully having converted the input string in its
> > > >     entirety.  IME, you need then to call 'iconv' one last time with
> > > >     either the 2nd or the 3rd argument set to NULL, otherwise
> > > >     sometimes the last converted character doesn't get output.  In my
> > > >     case, some URLs converted from CP1255 to UTF-8 lost their last
> > > >     character.  It sounds like no one has actually used this
> > > >     conversion in iri.c, except for trivially converting UTF-8 to
> > > >     itself.  Is that possible/reasonable?
> > > 
> > > Possibly.
> > > Could you please give an example string ? I would like to test it on
> > > GNU/Linux, BSD and Solaris to see if the output is always the same.
> > 
> > This is what gave me trouble:
> > 
> > https://he.wikipedia.org/wiki/%F9._%F9%F4%F8%E4
> > 
> > This is https://he.wikipedia.org/wiki/ש._שפרה that Andries was using
> > in his tests, but it's encoded in CP1255 (and hex-encoded after that).
> > Try converting it into UTF-8, and you will get the last character
> > chopped off after 'iconv' returns.  Or at least that's what happens
> > for me.
> > 
> > > >  2. Wget assumes that the URL given on its command line is encoded in
> > > >  
> > > >     the locale's encoding.  This is a good assumption when the user
> > > >     herself types the URL at the shell prompt, but not when the URL is
> > > >     copy-pasted from a browser's address bar.  In the latter case, the
> > > >     URL tends to be in UTF-8 (sometimes hex-encoded).  At least that's
> > > >     what I get from Firefox.  We don't seem to have in wget any
> > > >     facilities to specify a separate (3rd) encoding for the URLs on
> > > >     the command line, do we?
> > > 
> > > I stumbled upon this a while ago when thinking about the design of
> > > wget2.
> > > And wget2 already has a working --input-encoding option for such cases.
> > > AFAIK, nobody asked for such an option during the last years - so I
> > > assume this to be a somewhat 'expert' or 'fancy' option, at least a low
> > > priority one. It is an optional goodie.
> > 
> > IMO, it's a sorely missing feature, since copy/pasting URLs from a
> > browser is something people do very often.  I do it all the time,
> > because many times wget is much better in downloading large files than
> > a browser.
> 
> Arg, one step back please (my fault).
> What you are looking for is --local-encoding. That is the encoding of the
> URLs given on the command line.
> --input-encoding specifies the encoding of an (additional) input file and/or
> input from stdin.
> 
> wget converts your example correctly (with --locale-encoding=cp1255):
> 
> converted 'https://he.wikipedia.org/wiki/%F9._%F9%F4%F8%E4' (CP1255) ->
> 'https://he.wikipedia.org/wiki/ש._שפר' (UTF-8)
> 
> Also wget2, which uses iconv() differently than wget:
> 
> 14.220742.058 converted 'https://he.wikipedia.org/wiki/�._����' (CP1255) ->
> 'https://he.wikipedia.org/wiki/ש._שפר' (utf-8)
> 14.220742.058 converted 'ש._שפר' (utf-8) -> '�._���' (CP1255)

I should not write posts while doing homework with the kids and playing with 
the dog :-(

Your are right, ה is missing. 

> IME, you need then to call 'iconv' one last time with
> either the 2nd or the 3rd argument set to NULL, otherwise
> sometimes the last converted character doesn't get output.

I'll give it a try.

Tim

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Bug-wget] GNU wget 1.17.1 released, (continued)

Prev by Date: Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)
Next by Date: Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)
Previous by thread: Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)
Next by thread: Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)
Index(es):
- Date
- Thread