bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] GNU wget 1.15 released


From: Andries E. Brouwer
Subject: Re: [Bug-wget] GNU wget 1.15 released
Date: Sat, 25 Jan 2014 18:31:38 +0100
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Jan 22, 2014 at 06:50:55PM +0100, Giuseppe Scrivano wrote:
> Hello,
> 
> I am pleased to announce the new version of GNU wget.

Good!

> ftp://ftp.gnu.org/gnu/wget/wget-1.15.tar.gz

Testing shows that wget still by default creates unusable filenames
on a UTF-8 system when downloading files with UTF-8 filenames.
(It mistakenly considers the middle of certain UTF-8 symbols as "control"
and escapes them, which is terrible. Not escaping would be correct.)

Presently, 0-31 and 127-159 are considerd "control".
Since ASCII is a subset of almost every character set in use,
this is reasonable for 0-31 and 127.
Since more and more systems use UTF-8, this is definitely
unreasonable for 128-159. These are just internal bytes
inside a UTF-8 multibyte character.
Escaping these internal bytes yields illegal filenames,
difficult or impossible to handle on the local system.

This means that one probably wants to split the concept "control"
into "control" and "highcontrol", say, in url.c

...
#define D filechr_highcontrol
...
  D, D, D, D,  D, D, D, D,  D, D, D, D,  D, D, D, D, /* 128-143 */
  D, D, D, D,  D, D, D, D,  D, D, D, D,  D, D, D, D, /* 144-159 */
...
#undef D

where highcontrol is considered control unless LC_CTYPE
contains UTF-8 or UTF8 or utf-8 or utf8, in which case
highcontrol characters are ordinary.

Andries



reply via email to

[Prev in Thread] Current Thread [Next in Thread]