bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)


From: Tim Rühsen
Subject: Re: [Bug-wget] URL encoding issues (Was: GNU wget 1.17.1 released)
Date: Mon, 14 Dec 2015 20:22:41 +0100
User-agent: KMail/4.14.10 (Linux/4.2.0-1-amd64; KDE/4.14.14; x86_64; ; )

Am Montag, 14. Dezember 2015, 18:33:38 schrieb Eli Zaretskii:
> > Date: Sun, 13 Dec 2015 20:04:31 +0100
> > From: "Andries E. Brouwer" <address@hidden>
> > Cc: "Andries E. Brouwer" <address@hidden>, address@hidden
> > 
> > On Sun, Dec 13, 2015 at 08:01:27PM +0200, Eli Zaretskii wrote:
> > > If no one is going to pick up the gauntlet, I will sit down and do it
> > > myself, although I'm terribly busy with Emacs 25.1 release.
> > 
> > Good!
> 
> While working on this, I bumped into 2 related issues:
> 
>  1. The functions that call 'iconv' (in iri.c) don't make a point of
>     flushing the last portion of the converted URL after 'iconv'
>     returns successfully having converted the input string in its
>     entirety.  IME, you need then to call 'iconv' one last time with
>     either the 2nd or the 3rd argument set to NULL, otherwise
>     sometimes the last converted character doesn't get output.  In my
>     case, some URLs converted from CP1255 to UTF-8 lost their last
>     character.  It sounds like no one has actually used this
>     conversion in iri.c, except for trivially converting UTF-8 to
>     itself.  Is that possible/reasonable?

Possibly. 
Could you please give an example string ? I would like to test it on 
GNU/Linux, BSD and Solaris to see if the output is always the same.


>  2. Wget assumes that the URL given on its command line is encoded in
>     the locale's encoding.  This is a good assumption when the user
>     herself types the URL at the shell prompt, but not when the URL is
>     copy-pasted from a browser's address bar.  In the latter case, the
>     URL tends to be in UTF-8 (sometimes hex-encoded).  At least that's
>     what I get from Firefox.  We don't seem to have in wget any
>     facilities to specify a separate (3rd) encoding for the URLs on
>     the command line, do we?

I stumbled upon this a while ago when thinking about the design of wget2. And 
wget2 already has a working --input-encoding option for such cases.
AFAIK, nobody asked for such an option during the last years - so I assume 
this to be a somewhat 'expert' or 'fancy' option, at least a low priority one.
It is an optional goodie.

Tim




reply via email to

[Prev in Thread] Current Thread [Next in Thread]