bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)


From: Andries E. Brouwer
Subject: Re: [Bug-wget] bad filenames (again)
Date: Thu, 6 Aug 2015 23:40:45 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

Today I again downloaded a large tree with wget and got only unusable filenames.
Fortunately I have the utility wgetfix that repairs the consequences
of this bug (see http://www.win.tue.nl/~aeb/linux/misc/wget.html ),
but nevertheless this wget bug should be fixed.

(Maybe it has been fixed already? I looked at this in detail last year,
and there was some correspondence but I think nothing happened.
Have not looked at the latest sources.)

What happens is that wget under certain circumstances escapes
certain bytes in a filename. I think that this was always a mistake,
but it did not occur very much and was defendable: filenames with
embedded control characters are a pain.

Today the situation is just the opposite: when copying from a remote
utf8 system to a local utf8 system correct and normal filenames
are "escaped" to create illegal filenames that cannot be used
and are worse than a pain, one cannot do much else than discard them.

What can the user do?
If she is on Windows, she is told to switch to Linux:

> I can't help Windows users, but Wget is a power-user tool. 
> And a Windows power-user should be able to start a virtual 
> machine with Linux running to use tools like Wget. 

Is she is on Linux, the easiest is to discard all that was downloaded
and start over again, this time with the option
--restrict-file-names=nocontrol

If the user knows about wgetfix, that is an alternative.

One can also use curl instead of wget.

See also

http://savannah.gnu.org/bugs/?37564
http://stackoverflow.com/questions/22010251/wget-unicode-filename-errors
http://stackoverflow.com/questions/27054765/wget-japanese-characters
http://askubuntu.com/questions/233882/how-to-download-link-with-unicode-using-wget
http://www.win.tue.nl/~aeb/linux/misc/wget.html

Below I suggested an easy fix, and discussed some details.

Andries



On Wed, Apr 23, 2014 at 01:57:15PM +0200, Andries E. Brouwer wrote:
> On Wed, Apr 23, 2014 at 12:59:43PM +0200, Darshit Shah wrote:
> > On Tue, Apr 22, 2014 at 10:57 PM, Andries E. Brouwer wrote:
> >
> >> If I ask wget to download the wikipedia page
> >>
> >> http://he.wikipedia.org/wiki/ש._שפרה
> >>
> >> then I hope for a resulting file ש._שפרה.
> >> Instead, wget gives me ש._שפר\327%94, where the \327
> >> is an unpronounceable byte that cannot be typed
> >> (This is an UTF-8 system and the filename
> >> that wget produces is not valid UTF-8.)
> >>
> >> Maybe it would be better if wget by default used the original filename.
> >> This name mangling is a vestige of old times, it seems to me.
> > 
> > This is a commonly reported grievance and as you correctly mention a
> > vestige of old times. With UTF-8 supported filesystems, Wget should
> > simply write the correct characters.
> > 
> > I sincerely hope this issue is resolved as fast as possible, but I
> > know not how to. Those who understand i18n should work on this.
> 
> It is very easy to resolve the issue, but I don't know how backwards
> compatible the wget developers want to be.
> 
> The easiest solution is to change the line (in init.c:defaults())
>       opt.restrict_files_ctrl = true;
> into
>       opt.restrict_files_ctrl = false;
> 
> That is what I would like to see:
> the default should be to preserve the name as-is,
> and there should be options "escape_control" or so
> to force the current default behaviour.
> 
> There are also more complicated solutions.
> One can ask for LC_CTYPE or LANG or some such thing,
> and try to find out whether the current system is UTF-8,
> and only in that case set restrict_files_ctrl to false.
> 
> I don't know anything about the Windows environment.
> 
> Andries
> 
> 
> [Discussion:
> 
> There is a flag --restrict-file-names. The manual page says
> "By default, Wget escapes the characters that are not valid or safe
>  as part of file names on your operating system, as well as control
>  characters that are typically unprintable."
> Presently this is false: On a UTF-8 system Wget by default introduces
> illegal characters. The option "nocontrol" is needed to preserve the
> correct name.
> 
> The flag is handled in init.c:cmd_spec_restrict_file_names()
> where opt.restrict_files_{os,case,ctrl,nonascii} are set.
> Of interest is the restrict_files_ctrl flag.
> Today init.c does by default:
> 
> #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
>   opt.restrict_files_os = restrict_windows;
> #else
>   opt.restrict_files_os = restrict_unix;
> #endif
>   opt.restrict_files_ctrl = true;
>   opt.restrict_files_nonascii = false;
>   opt.restrict_files_case = restrict_no_case_restriction;
> 
> The value of these flags is used in url.c:append_uri_pathel
> where FILE_CHAR_TEST (*p, mask) is used to decide what bytes
> in the filename need quoting.
> 
> This is too simplistic an approach: quoting is introduced
> in the middle of multibyte characters. So the current setup
> is buggy and wrong. Basically the choice is between making
> the unfortunately named "nocontrol" (it should be called
> "preserve_name" or so) the default and adding more heuristics
> to detect and solve the worst problems. For example,
> UTF-8 is easy to detect, so if a filename is valid UTF-8
> one can preserve it. Of course there are other multi-byte
> character sets in widespread use in East Asia.]



reply via email to

[Prev in Thread] Current Thread [Next in Thread]