bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)


From: Eli Zaretskii
Subject: Re: [Bug-wget] bad filenames (again)
Date: Mon, 17 Aug 2015 05:39:34 +0300

> Date: Sun, 16 Aug 2015 22:21:20 +0200
> From: "Andries E. Brouwer" <address@hidden>
> Cc: "Andries E. Brouwer" <address@hidden>, address@hidden,
>         address@hidden
> 
> On Sun, Aug 16, 2015 at 05:43:50PM +0300, Eli Zaretskii wrote:
> 
> (i)
> 
> >> #if defined(WINDOWS) || defined(MSDOS) || defined(__CYGWIN__)
> >>   /* insert some test for Windows */
> >> #else
> >>  ... code that uses getenv to test LC_ALL, LC_CTYPE, LANG ...
> >> #endif
> 
> > I'm not sure this is the right way to fix this.  First, relying on
> > UTF-8 locale to be announced in the environment is less portable than
> > it could be: it's better to call 'setlocale' with the 2nd argument
> > NULL to glean the same information.  Then the ugly #ifdef above could
> > be dropped, and at least Cygwin will not be excluded from this
> > feature.
> 
> I left the wget behaviour for MSDOS / Windows / Cygwin unchanged
> because I do not know anything about these platforms. It is quite
> possible that the #ifdef is unneeded.
> 
> Are you saying that it in fact is needed when getenv() is used,
> but unneeded when setlocale() is used?

Yes.  These systems don't normally have the LC_* environment
variables, and their 'setlocale' (with the exception of Cygwin) does
not look at those variables.  But you _can_ obtain the current locale
on all supported systems by calling 'setlocale'.

> And then what about LANG?

What about it?  You can test it in the environment, if you want, but
IMO it's unnecessary, since either 'setlocale' already does, or the
variable is not relevant to the issue at hand.  (You need the codeset,
not the language.)

> > Moreover, even if the locale is not UTF-8, wget should attempt to
> > convert the file names to the current locale using iconv (which I
> > believe was what Tim suggested).  This will DTRT in much more cases
> > than the above UTF-8 centric approach, IMO.
> 
> Hmm. My own point of view is almost the opposite. In my life I have
> spent countless hours trying to repair the damage done by software
> that helpfully modified my data.
> I prefer my data as-is, unless I explicitly ask for conversion.

This is a philosophical question: is a Cyrillic file name encoded in
koi8-r and the same name encoded in UTF-8 a "modified data" or the
same data expressed in different codesets.

Converting encoding as required by the locale is the expected
behavior.  Windows, for example, does that automatically (if
possible).

> The patch enlarges the number of cases where the original data
> is preserved. Yes, I am all in favour of enlarging that number of
> cases even further. This is only a first step. But in my eyes
> applying iconv would be a step back. It can be really tricky to
> decode the mojibake obtained by converting A to C, while
> the original really was in B.

If iconv succeeds to convert, you won't see any mojibake to begin
with.  If it fails, then yes, the conversion should be abandoned.

> What should happen when iconv() returns EILSEQ?

Turn on the restrict_files_highctrl option, like you do now.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]