bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filename


From: Andries E. Brouwer
Subject: Re: [Bug-wget] bad filename
Date: Thu, 24 Apr 2014 12:21:54 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

On Thu, Apr 24, 2014 at 09:56:15AM +0200, Tim Ruehsen wrote:
> On Wednesday 23 April 2014 15:32:47 Andries E. Brouwer wrote:
> > On Wed, Apr 23, 2014 at 02:43:21PM +0200, Tim Ruehsen wrote:
> > Wget has a serious problem. It creates by default illegal filenames.
> 
> I couldn't read that in your post before (I still can't). If Wget puts  
> "illegal" characters into filenames, that is a bug and has to be fixed.

Then let me clarify this point. Sorry for the length.

Under Unix and most filesystem types, a filename is just
a sequence of bytes without NUL or slash.
Pathnames are constructed from filenames with slash as separator.
So, anything goes.

That means that files with a filename that is not valid UTF-8
can be created. Of course, there is no way the kernel can know
how the user wants to interpret the bytes in a filename, and different
users on the same system may have different locales.

UTF-8 encodes integer values in a byte sequence. Character codes
have variable length, and the first byte defines the length.
Values 0-0x7f are encoded as themselves, in bits 0.......
Values 0x80-0x7ff are encoded in two bytes, 110..... 10......
Values 0x800-0xffff are encoded in three bytes, 1110.... 10...... 10......
Non-first bytes all look like 10......, that is, have a value in 0x80-0xbf.

What wget does by default: bytes 0-0x1f and 0x77-0x9f are considered
"control" and escaped. Escaping replaces a single byte by a sequence
of three bytes, namely %dd, where dd is the hex value of the byte.

Now consider the example I gave, of a filename ש._שפרה.
The bytes occurring here are d7 a9 2e 5f d7 a9 d7 a4 d7 a8 d7 94 (hex).
The 0x94 at the end is considered control, and replaced
by %94, that is, by the three bytes 25 39 34 (hex).
The resulting string is d7 a9 2e 5f d7 a9 d7 a4 d7 a8 d7 25 39 34.
It parses as d7 a9 / 2e / 5f / d7 a9 / d7 a4 / d7 a8 / d7 ???
At the end there is d7 that announces a 2-byte sequence,
but no non-first byte follows and the parse fails.

This means that the filenames that wget creates
cannot be shown by ls and cannot be stored in a UTF-8 text file
and cannot be typed on a keyboard (on a UTF-8 system).
These names are not valid UTF-8 strings.

This behaviour of wget is very unfortunate, and people have been
complaining for many years, but so far nobody took the trouble
of fixing this. People not bitten by it consider it low priority
and people bitten mostly live in China or Russia or other faraway
places, and mostly do not mail bug-wget. Still, I found quite a few
bug reports about this problem.

---

So far about the problem. Next about the fix.

By far the simplest fix is to change the default.
That is the 1-word change true -> false in
     opt.restrict_files_ctrl = true;

If people like this default (it is a bad default
as I will argue below, but it is current practice)
one can choose many fixes. One is to scan the filename,
and if it is valid UTF-8 leave it unchanged.

---

About data integrity.

Sometimes programs try to be helpful and change data for the user.
This is always very unfortunate. In the old days ftp had the default
"ascii" and did some conversion that destroyed all files one downloaded
(probably compressed archives), and one had to throw the downloaded file
away, and download again, this time not forgetting to add "binary".

Wget by default does not change the contents of files, but it changes
the names. Also this is data. If one mirrors a website then the names
of the files also occur as links inside the files. The escaping that wget
does breaks all links, and one has to throw away the mirror copy and
mirror again, this time not forgetting to add "nocontrol".

The desirable state of affairs is that programs designed to
copy information do not modify it, unless explicitly asked.

Andries



reply via email to

[Prev in Thread] Current Thread [Next in Thread]