[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] bad filename
From: |
Tim Ruehsen |
Subject: |
Re: [Bug-wget] bad filename |
Date: |
Thu, 24 Apr 2014 15:43:40 +0200 |
User-agent: |
KMail/4.11.5 (Linux/3.13-1-amd64; KDE/4.11.5; x86_64; ; ) |
On Thursday 24 April 2014 12:21:54 Andries E. Brouwer wrote:
> > I couldn't read that in your post before (I still can't). If Wget puts
> > "illegal" characters into filenames, that is a bug and has to be fixed.
>
> Then let me clarify this point. Sorry for the length.
Andries, first of thanks for your exhaustive and well written explanation.
> What wget does by default: bytes 0-0x1f and 0x77-0x9f are considered
> "control" and escaped....
In fact, I oversaw the intersection of Wget's 'control' characters and UTF-8
which is 0x80-0x9f.
So I simply missed in your example:
> The bytes occurring here are d7 a9 2e 5f d7 a9 d7 a4 d7 a8 d7 94 (hex).
> The 0x94 at the end is considered control, and replaced...
> These names are not valid UTF-8 strings.
> This behaviour of wget is very unfortunate, and people have been
> complaining for many years, but so far nobody took the trouble
> of fixing this. People not bitten by it consider it low priority
> and people bitten mostly live in China or Russia or other faraway
> places, and mostly do not mail bug-wget. Still, I found quite a few
> bug reports about this problem.
The only bug report I remember did not state a bug, it was more of wish to
change Wget's default behavior. But maybe I had the same misunderstanding as
with your original post.
> By far the simplest fix is to change the default.
> That is the 1-word change true -> false in
> opt.restrict_files_ctrl = true;
>
> If people like this default (it is a bad default
> as I will argue below, but it is current practice)
> one can choose many fixes. One is to scan the filename,
> and if it is valid UTF-8 leave it unchanged.
I just want to mention my concerns about a quick and dirty solution, just that
we think about it. (I am not the one to decide, and if it were my private
project, I would fix this bug immediately, no doubt.)
1. How do you know, what filesystem you are writing to ? If you suspect the
user not being able to change behavior, how should she be able to know about
filesystems. I just think of these fat32 USB sticks flying around everywhere.
UTF-8 might be a problem (see
http://en.wikipedia.org/wiki/Comparison_of_file_systems). I just mention
fat32, because it is pretty common. There might be other file systems having a
limited charset... A compile/configure time option could be one solution.
2. Backward compatibility. Since the current Wget behavior exist for a long
time now, there a definitely many work-arounds (in the means of 'relying onto
current behavior') in production. Changing the default might blow up these
scripts/programs and may cause some damage.
Of course we can say, it is the admin's responsibility to check each software
update before rolling out on production, but I guess the reality is different.
3. (Strictly another issue) If we touch the code, what about --restrict-file-
names=nocontrol,lowercase ? Should we case-convert UTF-8 ?
My answer is yes (and that is what I did in the already mentioned Mget).
> Sometimes programs try to be helpful and change data for the user.
> This is always very unfortunate. In the old days ftp had the default
> "ascii" and did some conversion that destroyed all files one downloaded
> (probably compressed archives), and one had to throw the downloaded file
> away, and download again, this time not forgetting to add "binary".
Not only in the old days. It is still a problem and I stumbled over it two
times within the last 6 months.
> The desirable state of affairs is that programs designed to
> copy information do not modify it, unless explicitly asked.
Yes, definitely. But changing historic defaults should be carefully thought
of.
Tim
- [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/22
- Re: [Bug-wget] bad filename, Darshit Shah, 2014/04/23
- Re: [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/23
- Re: [Bug-wget] bad filename, Tim Ruehsen, 2014/04/23
- Re: [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/23
- Re: [Bug-wget] bad filename, Tim Ruehsen, 2014/04/24
- Re: [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/25
- Re: [Bug-wget] bad filename,
Tim Ruehsen <=
- Re: [Bug-wget] bad filename, Andries E. Brouwer, 2014/04/24
- Re: [Bug-wget] bad filename, Tim Rühsen, 2014/04/24