Re: [Bug-wget] bad filenames (again)

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)

From:	Eli Zaretskii
Subject:	Re: [Bug-wget] bad filenames (again)
Date:	Mon, 17 Aug 2015 18:27:05 +0300

> Date: Mon, 17 Aug 2015 12:59:05 +0200
> From: "Andries E. Brouwer" <address@hidden>
> Cc: "Andries E. Brouwer" <address@hidden>, address@hidden,
>         address@hidden
> 
> On Mon, Aug 17, 2015 at 05:39:34AM +0300, Eli Zaretskii wrote:
> 
> (i) [about using setlocale]
> 
> > > > First, relying on UTF-8 locale to be announced in the environment
> > > > is less portable than it could be: it's better to call 'setlocale'
> > > > Then ... at least Cygwin will not be excluded from this feature.
> > > 
> > > I left the wget behaviour for MSDOS / Windows / Cygwin unchanged
> > > because I do not know anything about these platforms.
> > 
> > These systems don't normally have the LC_* environment
> > variables, and their 'setlocale' (with the exception of Cygwin) does
> > not look at those variables.  But you _can_ obtain the current locale
> > on all supported systems by calling 'setlocale'.
> 
> Good. Then perhaps using setlocale would be better.
> 
> I will not do so - do not feel confident on the Windows platform.

You don't need to -- do it on your OS, and the same will work
elsewhere.

> After all, the goal is not to find out what locale we are in,
> but to find out whether it might be a good idea to escape certain
> bytes in a filename.

Indeed, you want the current locale's codeset, see below.

> On Windows I guess that FAT filesystems will use some code page,
> and NTFS filesystems will use Unicode.

Not exactly.  The functions that emulate Posix and accept file names
as "char *" strings cannot use Unicode on Windows, because using
Unicode means using wchar_t strings instead.  So, unless Someone™
changes wget to do that, at least on Windows, the Windows port will
still use the current system codepage, even on NTFS, because that's
what functions like 'fopen', 'open', 'stat', etc. assume.

> (ii) [about possibly using iconv]
> 
> >> How do you guess the original character set?
> 
> Since you pass silently over this point

No, I just missed that, sorry.

The answer is call "nl_langinfo (CODESET)".  Windows doesn't have
'nl_langinfo', but it is easily emulated with more or less a single
API call, or we could use the Gnulib replacement (which already does
support Windows).

> it seems there is no good way to involve iconv.

Actually, there's no problem, see above.  Many programs do it like
that already.

> > This is a philosophical question: is a Cyrillic file name encoded in
> > koi8-r and the same name encoded in UTF-8 a "modified data" or the
> > same data expressed in different codesets.
> 
> Unix filenames are not necessarily in any particular character set.
> They are sequences of bytes different from NUL and '/'.
> A different sequence of bytes is a different filename.

As long as you treat them as UTF-8 encoded strings, they are, for all
practical purposes, in the Unicode character set.  (Which, btw, brings
up the question what to do if the UTF-8 sequence is for u+FFFD or is
simply invalid -- do we treat them as control characters or don't we?)

> Also, "the same name encoded in UTF-8" is an optimistic description.
> Should the Unicode be NFC? Or NFD? MacOS has a third version.

It doesn't matter, since any filesystem worth its sectors will DTRT
and any ls-like program will, too, and will show you a perfectly
legible file name.

> Even if the filename had a well-defined and known character set,
> conversion to UTF-8 is not uniquely defined.

Do whatever iconv does, and we will be fine.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Bug-wget] bad filenames (again), (continued)

Prev by Date: Re: [Bug-wget] URL rewriting when resource name is in a variable
Next by Date: [Bug-wget] [bug #45732] Please document --ask-password in manual section 2.1
Previous by thread: Re: [Bug-wget] bad filenames (again)
Next by thread: Re: [Bug-wget] bad filenames (again)
Index(es):
- Date
- Thread