[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] Patch: Make url_file_name also convert remote path to loc
From: |
Tim Rühsen |
Subject: |
Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded |
Date: |
Wed, 15 Nov 2017 20:28:17 +0100 |
User-agent: |
KMail/5.2.3 (Linux/4.13.0-1-amd64; KDE/5.37.0; x86_64; ; ) |
On Montag, 13. November 2017 18:32:46 CET Eli Zaretskii wrote:
> > Cc: address@hidden, address@hidden
> > From: Tim Rühsen <address@hidden>
> > Date: Mon, 13 Nov 2017 16:36:39 +0100
> >
> > > I don't think it's a Gnulib issue. The problem is that on Windows,
> > > the implicit call at the beginning of Wget
> > >
> > > setlocale (LC_ALL, "C");
> >
> > Why is there an explicit call with "C" ? There is an explicit call with
> > "".
>
> I said "implicit", not "explicit". Such an implicit call is made at
> the beginning of every C program, per ANSI C Standard. Right?
>
> The MSDN documentation says it clearly:
>
> At program startup, the equivalent of the following statement is executed:
>
> setlocale( LC_ALL, "C" );
>
> > From the man page:
> > "If locale is an empty string, "", each part of the locale that should
> > be modified is set according to the environment variables."
>
> The call with a locale of "" is only done in a build that has
> ENABLE_NLS defined. I was talking about a build which didn't define
> ENABLE_NLS.
>
> > > is not good enough to work in multibyte locales of the Far East,
> > > because the Windows runtime assumes a single-byte locale after that
> > > call. And since Wget happens to need to display text and create files
> > > with non-ASCII characters, it gets hit more than other programs.
> >
> > I (hopefully) can understand why this doesn't work. NTFS uses UTF-16 for
> > the filenames. If your environment specifies a single-character encoding
> > (e.g. C) and we use at some point a multi-character encoding (e.g.
> > utf-8), then any automatic conversion to UTF-16 filenames are likely to
> > fail. For me the question is: a) does wget has a bug (e.g. creating a
> > filename with a wrong encoded name string or b) does the Windows API has
> > a problem.
> >
> > > The proposed solution is to add a special call to setlocale which gets
> > > this right on Windows.
> >
> > Why can't we just convert the filename string into the correct encoding
> > and then create the file ? What do I miss ?
>
> I guess you are missing a short introduction to the Windows l10n/i18n
> mess. Let me try.
>
> First, the fact that NTFS uses UTF-16 is not really relevant. Wget
> uses 'char *' strings, not 'wchar *' strings to store file names and
> call C library functions that accept file names. So we cannot use the
> UTF-16 encoding of non-ASCII file names directly. Instead, we use the
> locale's codepage (the C library and the OS APIs then convert to
> UTF-16 before hitting the disk, but that's not important now).
>
> Next, creating and opening file names is not the only problem: we need
> also to display these file names and URLs, and that also needs to use
> the encoding expected by the Windows console.
>
> Now, in any locale which uses single-byte encoding of non-ASCII
> characters, the C locale will support those characters, both for I/O
> and for functions like strcmp, strlen, strcoll, etc. But not in
> double-byte locales of the Far East: there, you must explicitly call
> setlocale with the correct codepage, to have the local character set
> supported. This support includes manipulating file names, calling C
> library functions to access files, and displaying non-ASCII text, such
> as file names and URLs, on the console.
>
> IOW, this is a Windows runtime subtlety that unfortunately needs to be
> fixed in the application code.
>
> (UTF-8 is not relevant at all here, because Windows doesn't support
> UTF-8 as the locale's codeset; if you try to call setlocale to set
> UTF-8 as the codeset, setlocale will simply fail. So if we have a
> UTF-8 encoded URL or file name inside wget, we must convert it to the
> current codepage by calling libiconv functions.)
>
> Does the above make sense? Let me know if I have to explain some
> more.
Thank you, Eli.
I just wonder if we have the same problem on Linux console as well.
I mean, *not* calling setlocale(LC_ALL, "") (when ENABLE_NLS is undefined)
would leave the program with the C locale, even if the console/environment has
something else.
But no one complained so far... so my question:
did you test the patch and does it work for you ?
If yes, I am going to apply it.
Regards, Tim
signature.asc
Description: This is a digitally signed message part.
- [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, YX Hao, 2017/11/02
- Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Tim Rühsen, 2017/11/12
- Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Eli Zaretskii, 2017/11/12
- Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Yuxi Hao, 2017/11/13
- Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Tim Rühsen, 2017/11/13
- Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Eli Zaretskii, 2017/11/13
- Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Yuxi Hao, 2017/11/14
- Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded,
Tim Rühsen <=
- Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Eli Zaretskii, 2017/11/15
Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Yuxi Hao, 2017/11/13
Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded, Yuxi Hao, 2017/11/13