Re: [Bug-wget] bad filenames (again)

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)

From:	Tim Ruehsen
Subject:	Re: [Bug-wget] bad filenames (again)
Date:	Thu, 20 Aug 2015 10:47:35 +0200
User-agent:	KMail/4.14.2 (Linux/4.1.0-1-amd64; KDE/4.14.2; x86_64; ; )

On Wednesday 19 August 2015 17:38:39 Eli Zaretskii wrote:
> > Date: Wed, 19 Aug 2015 02:52:57 +0200
> > From: "Andries E. Brouwer" <address@hidden>
> > Cc: address@hidden
> > 
> > Look at the remote filename.
> > 
> > Assign a character set as follows:
> > - if the user specified a from-charset, use that
> > - if the name is printable ASCII (in 0x20-0x7f), take ASCII
> > - if the name is non-ASCII and valid UTF-8, take UTF-8
> > - otherwise take Unknown.
> 
> I think this is simpler and produces the same results:
>  - if the user specified a from-charset, use that
>  - otherwise assume UTF-8
> 
> > Determine a local character set as follows:
> > - if the user specified a to-charset, use that
> > - if the locale uses UTF-8, use that
> > - otherwise take ASCII
> 
> I suggest this instead:
>  - if the user specified a to-charset, use that
>  - otherwise, call nl_langinfo(CODESET) to find out the current
>    locale's encoding
> 
> > Convert the name from from-charset to to-charset:
> > - if the user asked for unmodified filenames, do nothing
> > - if the name is ASCII, do nothing
> > - if the name is UTF-8 and the locale uses UTF-8, do nothing
> > - convert from Unknown by hex-escaping the entire name
> > - convert to ASCII by hex-escaping the entire name
> > - otherwise invoke iconv(); upon failure, escape the illegal bytes
> 
> My suggestion:
>  - if the user asked for unmodified filenames, do nothing
>  - else invoke 'iconv' to convert from remote to local encoding
>  - if 'iconv' fails, convert to ASCII by hex-escaping
> 
> Hex-escaping only the bytes that fail 'iconv' is better than
> hex-escaping all of them, but it's more complex, and I'm not sure it's
> worth the hassle.  But if it can be implemented without undue trouble,
> I'm all for it, as it will make wget more user-friendly in those
> cases.
> 
> > Once we know what we want it is trivial to write the code,
> > but it may take a while to figure out what we want.
> > I think we should start applying the current patch.
> 
> Tim says he has some/most of that coded on a branch, so I think we
> should start by merging that branch, and then take it from there.

It is in branch 'tim/wget2'. Wget2 is a rewrite from scratch, so you can just 
'click on the merge button' to merge.
Basically, I keep track of the charset of each URL input (command line, input 
file, stdin, downloaded+scanned). So when generating the filename we have the 
to and from charset. When iconv fails here (e.g. Chinese input, ASCII output), 
escaping takes place.

Tim

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Bug-wget] bad filenames (again), (continued)

Prev by Date: [Bug-wget] [bug #45801] Allowing to configure HTML engine which links to follow
Next by Date: Re: [Bug-wget] bad filenames (again)
Previous by thread: Re: [Bug-wget] bad filenames (again)
Next by thread: Re: [Bug-wget] bad filenames (again)
Index(es):
- Date
- Thread