[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)

From: Eli Zaretskii
Subject: Re: [Bug-wget] bad filenames (again)
Date: Tue, 18 Aug 2015 21:32:16 +0300

> Date: Tue, 18 Aug 2015 19:51:58 +0200
> From: "Andries E. Brouwer" <address@hidden>
> Cc: "Andries E. Brouwer" <address@hidden>, address@hidden,
>         address@hidden
> On Tue, Aug 18, 2015 at 07:43:05PM +0300, Eli Zaretskii wrote:
> > > > If we convert the file names using iconv, Windows users will also be
> > > > happier, at least when the remote URL can be encoded in their system
> > > > codepage.
> > > 
> > > Windows does not differ from Unix - since the remote character set
> > > is unknown and not necessarily constant, a conversion is impossible.
> > 
> > Windows does differ from Unix, in that arbitrary byte sequences cannot
> > be used in file names.
> Of course. The code already tries to take care of that.

It does that badly.

> >  See
> > 
> >   
> > https://msdn.microsoft.com/en-us/library/windows/desktop/aa365247%28v=vs.85%29.aspx
> > 
> > for the gory details.
> Thanks for the reference!

You are welcome.

> > > I already indicated the 1-line change that fixes the Windows problems.
> > 
> > It doesn't, unfortunately.
> You are too brief. What is wrong with the change that changes
>     /* insert some test for Windows */
> into
>     return true;
> ?

It preserves the current behavior, whereby almost every non-ASCII URL
out there gets saved in a file name that is either inaccessible to
localized programs, or shows as illegible mujibake.

> That change only changes what wget does with bytes in the 128-159 range,
> and reading the gory details I fail to see any problem. Almost the opposite:
> "Use any character in the current code page for a name, including Unicode 
> characters
>  and characters in the extended character set (128–255)"

You need to read between the lines, as it's Microsoft speak.  First,
not every codepoint between 128 and 255 is valid in every codepage.
Second, Windows stores file names in UTF-16, so it attempts to convert
the byte stream into UTF-16 assuming the byte stream is in the current
codepage (which is incorrect in most cases, as we get UTF-8 instead).
The result is an utmost mess.

> Thanks to your reference I now feel confident to make that 1-line change
> so that also Windows users are happy.

Do you still think that?  Then allow me a small demonstration:

  --2015-08-18 21:23:38--  
  Loaded CA certificate 'd:/usr/etc/ssl/ca-bundle.crt'
  Resolving ru.wikipedia.org (ru.wikipedia.org)...
  Connecting to ru.wikipedia.org (ru.wikipedia.org)||:443... 
  HTTP request sent, awaiting response... 404 Not Found
  2015-08-18 21:23:39 ERROR 404: Not Found.

  --2015-08-18 21:23:39--  
  Reusing existing connection to ru.wikipedia.org:443.
  HTTP request sent, awaiting response... 200 OK
  Length: unspecified [text/html]
  Saving to: '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡'

  ╫%80┬í╫%80┬╡╫%81Γ%8     [ <=>                  ] 180.32K   923KB/s   in 0.2s

  2015-08-18 21:23:40 (923 KB/s) - '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡' 
saved [184652]

Do you really think that '╫%80┬í╫%80┬╡╫%81Γ%82¼╫%80┬┤╫%81Γ%80á╫%80┬╡'
is a good way to express 'Сердце'?  Do you think someone will be able
to read and understand such a file name?  How would you go about
converting it back to what it should be?

> (There are restrictions involving filenames that wget perhaps does not 
> enforce:
> no LPT3, no final space or period, ... It might be useful to teach wget about
> such details.)

Indeed.  But that's a different issue, I think.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]