bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)


From: Tim Ruehsen
Subject: Re: [Bug-wget] bad filenames (again)
Date: Fri, 07 Aug 2015 16:14:45 +0200
User-agent: KMail/4.14.2 (Linux/4.0.0-2-amd64; KDE/4.14.2; x86_64; ; )

Hi Andries,

as I already mentioned, changing the default behavior of wget is not a good 
idea.

But I started a wget2 branch that produces wget and wget2 executables.
wget2's default behavior is to keep filenames as they are.

I am not sure how it compiles and works on Windows (Cygwin could work).
If you dare to check it out: any feedback is highly welcome.

Regards, Tim

On Thursday 06 August 2015 23:40:45 Andries E. Brouwer wrote:
> Today I again downloaded a large tree with wget and got only unusable
> filenames. Fortunately I have the utility wgetfix that repairs the
> consequences of this bug (see
> http://www.win.tue.nl/~aeb/linux/misc/wget.html ), but nevertheless this
> wget bug should be fixed.
> 
> (Maybe it has been fixed already? I looked at this in detail last year,
> and there was some correspondence but I think nothing happened.
> Have not looked at the latest sources.)
> 
> What happens is that wget under certain circumstances escapes
> certain bytes in a filename. I think that this was always a mistake,
> but it did not occur very much and was defendable: filenames with
> embedded control characters are a pain.
> 
> Today the situation is just the opposite: when copying from a remote
> utf8 system to a local utf8 system correct and normal filenames
> are "escaped" to create illegal filenames that cannot be used
> and are worse than a pain, one cannot do much else than discard them.
> 
> What can the user do?
> 
> If she is on Windows, she is told to switch to Linux:
> > I can't help Windows users, but Wget is a power-user tool.
> > And a Windows power-user should be able to start a virtual
> > machine with Linux running to use tools like Wget.
> 
> Is she is on Linux, the easiest is to discard all that was downloaded
> and start over again, this time with the option
> --restrict-file-names=nocontrol
> 
> If the user knows about wgetfix, that is an alternative.
> 
> One can also use curl instead of wget.
> 
> See also
> 
> http://savannah.gnu.org/bugs/?37564
> http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors
> http://stackoverflow.com/questions/27054765/wget-japanese-characters
> http://askubuntu.com/questions/233882/how-to-download-link-with-unicode-usin
> g-wget http://www.win.tue.nl/~aeb/linux/misc/wget.html
> 
> Below I suggested an easy fix, and discussed some details.
> 
> Andries
> 
> On Wed, Apr 23, 2014 at 01:57:15PM +0200, Andries E. Brouwer wrote:
> > On Wed, Apr 23, 2014 at 12:59:43PM +0200, Darshit Shah wrote:
> > > On Tue, Apr 22, 2014 at 10:57 PM, Andries E. Brouwer wrote:
> > >> If I ask wget to download the wikipedia page
> > >> 
> > >> http://he.wikipedia.org/wiki/ש._שפרה
> > >> 
> > >> then I hope for a resulting file ש._שפרה.
> > >> Instead, wget gives me ש._שפר\327%94, where the \327
> > >> is an unpronounceable byte that cannot be typed
> > >> (This is an UTF-8 system and the filename
> > >> that wget produces is not valid UTF-8.)
> > >> 
> > >> Maybe it would be better if wget by default used the original filename.
> > >> This name mangling is a vestige of old times, it seems to me.
> > > 
> > > This is a commonly reported grievance and as you correctly mention a
> > > vestige of old times. With UTF-8 supported filesystems, Wget should
> > > simply write the correct characters.
> > > 
> > > I sincerely hope this issue is resolved as fast as possible, but I
> > > know not how to. Those who understand i18n should work on this.
> > 
> > It is very easy to resolve the issue, but I don't know how backwards
> > compatible the wget developers want to be.
> > 
> > The easiest solution is to change the line (in init.c:defaults())
> > 
> >     opt.restrict_files_ctrl = true;
> > 
> > into
> > 
> >     opt.restrict_files_ctrl = false;
> > 
> > That is what I would like to see:
> > the default should be to preserve the name as-is,
> > and there should be options "escape_control" or so
> > to force the current default behaviour.
> > 
> > There are also more complicated solutions.
> > One can ask for LC_CTYPE or LANG or some such thing,
> > and try to find out whether the current system is UTF-8,
> > and only in that case set restrict_files_ctrl to false.
> > 
> > I don't know anything about the Windows environment.
> > 
> > Andries
> > 
> > 
> > [Discussion:
> > 
> > There is a flag --restrict-file-names. The manual page says
> > "By default, Wget escapes the characters that are not valid or safe
> > 
> >  as part of file names on your operating system, as well as control
> >  characters that are typically unprintable."
> > 
> > Presently this is false: On a UTF-8 system Wget by default introduces
> > illegal characters. The option "nocontrol" is needed to preserve the
> > correct name.
> > 
> > The flag is handled in init.c:cmd_spec_restrict_file_names()
> > where opt.restrict_files_{os,case,ctrl,nonascii} are set.
> > Of interest is the restrict_files_ctrl flag.
> > Today init.c does by default:
> > 
> > #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
> > 
> >   opt.restrict_files_os = restrict_windows;
> > 
> > #else
> > 
> >   opt.restrict_files_os = restrict_unix;
> > 
> > #endif
> > 
> >   opt.restrict_files_ctrl = true;
> >   opt.restrict_files_nonascii = false;
> >   opt.restrict_files_case = restrict_no_case_restriction;
> > 
> > The value of these flags is used in url.c:append_uri_pathel
> > where FILE_CHAR_TEST (*p, mask) is used to decide what bytes
> > in the filename need quoting.
> > 
> > This is too simplistic an approach: quoting is introduced
> > in the middle of multibyte characters. So the current setup
> > is buggy and wrong. Basically the choice is between making
> > the unfortunately named "nocontrol" (it should be called
> > "preserve_name" or so) the default and adding more heuristics
> > to detect and solve the worst problems. For example,
> > UTF-8 is easy to detect, so if a filename is valid UTF-8
> > one can preserve it. Of course there are other multi-byte
> > character sets in widespread use in East Asia.]




reply via email to

[Prev in Thread] Current Thread [Next in Thread]