bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filename


From: Andries E. Brouwer
Subject: Re: [Bug-wget] bad filename
Date: Wed, 23 Apr 2014 13:57:15 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Apr 23, 2014 at 12:59:43PM +0200, Darshit Shah wrote:
> On Tue, Apr 22, 2014 at 10:57 PM, Andries E. Brouwer wrote:
>
>> If I ask wget to download the wikipedia page
>>
>> http://he.wikipedia.org/wiki/ש._שפרה
>>
>> then I hope for a resulting file ש._שפרה.
>> Instead, wget gives me ש._שפר\327%94, where the \327
>> is an unpronounceable byte that cannot be typed
>> (This is an UTF-8 system and the filename
>> that wget produces is not valid UTF-8.)
>>
>> Maybe it would be better if wget by default used the original filename.
>> This name mangling is a vestige of old times, it seems to me.
> 
> This is a commonly reported grievance and as you correctly mention a
> vestige of old times. With UTF-8 supported filesystems, Wget should
> simply write the correct characters.
> 
> I sincerely hope this issue is resolved as fast as possible, but I
> know not how to. Those who understand i18n should work on this.

It is very easy to resolve the issue, but I don't know how backwards
compatible the wget developers want to be.

The easiest solution is to change the line (in init.c:defaults())
        opt.restrict_files_ctrl = true;
into
        opt.restrict_files_ctrl = false;

That is what I would like to see:
the default should be to preserve the name as-is,
and there should be options "escape_control" or so
to force the current default behaviour.

There are also more complicated solutions.
One can ask for LC_CTYPE or LANG or some such thing,
and try to find out whether the current system is UTF-8,
and only in that case set restrict_files_ctrl to false.

I don't know anything about the Windows environment.

Andries


[Discussion:

There is a flag --restrict-file-names. The manual page says
"By default, Wget escapes the characters that are not valid or safe
 as part of file names on your operating system, as well as control
 characters that are typically unprintable."
Presently this is false: On a UTF-8 system Wget by default introduces
illegal characters. The option "nocontrol" is needed to preserve the
correct name.

The flag is handled in init.c:cmd_spec_restrict_file_names()
where opt.restrict_files_{os,case,ctrl,nonascii} are set.
Of interest is the restrict_files_ctrl flag.
Today init.c does by default:

#if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
  opt.restrict_files_os = restrict_windows;
#else
  opt.restrict_files_os = restrict_unix;
#endif
  opt.restrict_files_ctrl = true;
  opt.restrict_files_nonascii = false;
  opt.restrict_files_case = restrict_no_case_restriction;

The value of these flags is used in url.c:append_uri_pathel
where FILE_CHAR_TEST (*p, mask) is used to decide what bytes
in the filename need quoting.

This is too simplistic an approach: quoting is introduced
in the middle of multibyte characters. So the current setup
is buggy and wrong. Basically the choice is between making
the unfortunately named "nocontrol" (it should be called
"preserve_name" or so) the default and adding more heuristics
to detect and solve the worst problems. For example,
UTF-8 is easy to detect, so if a filename is valid UTF-8
one can preserve it. Of course there are other multi-byte
character sets in widespread use in East Asia.]



reply via email to

[Prev in Thread] Current Thread [Next in Thread]