bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Patch: Make url_file_name also convert remote path to loc


From: Yuxi Hao
Subject: Re: [Bug-wget] Patch: Make url_file_name also convert remote path to local encoded
Date: Tue, 14 Nov 2017 19:59:55 +0800

Dear Eli and Tim,

First, I would say, my last 2 patches are for different problems.

Next, let's make it clear:

'Make url_file_name also convert remote path to local encoded', is to convert 
all characters from URL (server, most UTF8) to locale encoded (GBK for 
example), and then append them to the '-P' specified local path. Or if we use 
iconv on a mix-encoded string, error occurs. Right? :)
It is for iconv.

'Fix printing mutibyte characters as unprintable characters on Windows', this 
one need 'setlocale' to be called in case of 'ENABLE_NLS' is not defined for 
windows, to make it display the non-ASC chas correctly in console. :) As Eli 
said. Please refer to https://msdn.microsoft.com/en-us/library/x99tb11d.aspx.
It is for displaying in console.


Best Regards,
YX Hao


> -----Original Message-----
> From: Eli Zaretskii [mailto:address@hidden
> Sent: 2017年11月14日 0:33
> To: Tim Rühsen <address@hidden>
> Cc: address@hidden; address@hidden
> Subject: Re: [Bug-wget] Patch: Make url_file_name also convert remote path to
> local encoded
> 
> > Cc: address@hidden, address@hidden
> > From: Tim Rühsen <address@hidden>
> > Date: Mon, 13 Nov 2017 16:36:39 +0100
> >
> > > I don't think it's a Gnulib issue.  The problem is that on Windows,
> > > the implicit call at the beginning of Wget
> > >
> > >   setlocale (LC_ALL, "C");
> >
> > Why is there an explicit call with "C" ? There is an explicit call with "".
> 
> I said "implicit", not "explicit".  Such an implicit call is made at the 
> beginning
> of every C program, per ANSI C Standard.  Right?
> 
> The MSDN documentation says it clearly:
> 
>   At program startup, the equivalent of the following statement is executed:
> 
>     setlocale( LC_ALL, "C" );
> 
> > From the man page:
> > "If locale is an empty string, "", each part of the locale that should
> > be modified is set according to the environment variables."
> 
> The call with a locale of "" is only done in a build that has ENABLE_NLS 
> defined.
> I was talking about a build which didn't define ENABLE_NLS.
> 
> > > is not good enough to work in multibyte locales of the Far East,
> > > because the Windows runtime assumes a single-byte locale after that
> > > call.  And since Wget happens to need to display text and create
> > > files with non-ASCII characters, it gets hit more than other programs.
> >
> > I (hopefully) can understand why this doesn't work. NTFS uses UTF-16
> > for the filenames. If your environment specifies a single-character
> > encoding (e.g. C) and we use at some point a multi-character encoding (e.g.
> > utf-8), then any automatic conversion to UTF-16 filenames are likely
> > to fail. For me the question is: a) does wget has a bug (e.g. creating
> > a filename with a wrong encoded name string or b) does the Windows API
> > has a problem.
> >
> > > The proposed solution is to add a special call to setlocale which
> > > gets this right on Windows.
> >
> > Why can't we just convert the filename string into the correct
> > encoding and then create the file ? What do I miss ?
> 
> I guess you are missing a short introduction to the Windows l10n/i18n mess.
> Let me try.
> 
> First, the fact that NTFS uses UTF-16 is not really relevant.  Wget uses 
> 'char *'
> strings, not 'wchar *' strings to store file names and call C library 
> functions that
> accept file names.  So we cannot use the
> UTF-16 encoding of non-ASCII file names directly.  Instead, we use the 
> locale's
> codepage (the C library and the OS APIs then convert to
> UTF-16 before hitting the disk, but that's not important now).
> 
> Next, creating and opening file names is not the only problem: we need also to
> display these file names and URLs, and that also needs to use the encoding
> expected by the Windows console.
> 
> Now, in any locale which uses single-byte encoding of non-ASCII characters, 
> the
> C locale will support those characters, both for I/O and for functions like 
> strcmp,
> strlen, strcoll, etc.  But not in double-byte locales of the Far East: there, 
> you
> must explicitly call setlocale with the correct codepage, to have the local
> character set supported.  This support includes manipulating file names,
> calling C library functions to access files, and displaying non-ASCII text, 
> such as
> file names and URLs, on the console.
> 
> IOW, this is a Windows runtime subtlety that unfortunately needs to be fixed 
> in
> the application code.
> 
> (UTF-8 is not relevant at all here, because Windows doesn't support
> UTF-8 as the locale's codeset; if you try to call setlocale to set
> UTF-8 as the codeset, setlocale will simply fail.  So if we have a
> UTF-8 encoded URL or file name inside wget, we must convert it to the current
> codepage by calling libiconv functions.)
> 
> Does the above make sense?  Let me know if I have to explain some more.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]