Re: [Bug-wget] bad filenames (again)

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)

From:	Andries E. Brouwer
Subject:	Re: [Bug-wget] bad filenames (again)
Date:	Mon, 17 Aug 2015 12:59:05 +0200
User-agent:	Mutt/1.5.21 (2010-09-15)

On Mon, Aug 17, 2015 at 05:39:34AM +0300, Eli Zaretskii wrote:

(i) [about using setlocale]

> > > First, relying on UTF-8 locale to be announced in the environment
> > > is less portable than it could be: it's better to call 'setlocale'
> > > Then ... at least Cygwin will not be excluded from this feature.
> > 
> > I left the wget behaviour for MSDOS / Windows / Cygwin unchanged
> > because I do not know anything about these platforms.
> 
> These systems don't normally have the LC_* environment
> variables, and their 'setlocale' (with the exception of Cygwin) does
> not look at those variables.  But you _can_ obtain the current locale
> on all supported systems by calling 'setlocale'.

Good. Then perhaps using setlocale would be better.

I will not do so - do not feel confident on the Windows platform.
After all, the goal is not to find out what locale we are in,
but to find out whether it might be a good idea to escape certain
bytes in a filename. The original author's code was based on the
idea that the system is using an ISO-8859-n character set.
On Windows I guess that FAT filesystems will use some code page,
and NTFS filesystems will use Unicode.
If that is correct, then perhaps it never makes sense
to do this escape of "high control bytes" on a Windows system.

[So, I conjecture that we could make Windows users happy
by replacing
  /* insert some test for Windows */
by
  return true;
(and updating the functionname).]

(ii) [about possibly using iconv]

>> How do you guess the original character set?

Since you pass silently over this point, it seems
there is no good way to involve iconv.

> This is a philosophical question: is a Cyrillic file name encoded in
> koi8-r and the same name encoded in UTF-8 a "modified data" or the
> same data expressed in different codesets.

Unix filenames are not necessarily in any particular character set.
They are sequences of bytes different from NUL and '/'.
A different sequence of bytes is a different filename.

Also, "the same name encoded in UTF-8" is an optimistic description.
Should the Unicode be NFC? Or NFD? MacOS has a third version.
Even if the filename had a well-defined and known character set,
conversion to UTF-8 is not uniquely defined.

So, it seems to me that one cannot use iconv unless
--remote-encoding and --local-encoding have been specified
by the user. And if that is the case, then perhaps iconv
is already invoked (in the iri code; I have not checked the details).

Andries

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Bug-wget] bad filenames (again), (continued)

Prev by Date: [Bug-wget] [bug #45777] possibility to enable NORMAL:%COMPAT GNUTLS priority strings
Next by Date: Re: [Bug-wget] bad filenames (again)
Previous by thread: Re: [Bug-wget] bad filenames (again)
Next by thread: Re: [Bug-wget] bad filenames (again)
Index(es):
- Date
- Thread