bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)


From: Andries E. Brouwer
Subject: Re: [Bug-wget] bad filenames (again)
Date: Wed, 19 Aug 2015 02:52:57 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Aug 19, 2015 at 01:43:51AM +0200, Ángel González wrote:

> And of course, there's the question of what to do if the filename we
> are trying to convert to utf-16 is not in fact valid utf-8.

My current understanding:

(i) there is a current patch, that fixes most problems on Unix
and can be applied today

(ii) one also wants to fix Windows problems, and in the process
do something more general for Unix. We can discuss a future
patch that does something like:

Look at the remote filename.

Assign a character set as follows:
- if the user specified a from-charset, use that
- if the name is printable ASCII (in 0x20-0x7f), take ASCII
- if the name is non-ASCII and valid UTF-8, take UTF-8
- otherwise take Unknown.

Determine a local character set as follows:
- if the user specified a to-charset, use that
- if the locale uses UTF-8, use that
- otherwise take ASCII

Convert the name from from-charset to to-charset:
- if the user asked for unmodified filenames, do nothing
- if the name is ASCII, do nothing
- if the name is UTF-8 and the locale uses UTF-8, do nothing
- convert from Unknown by hex-escaping the entire name
- convert to ASCII by hex-escaping the entire name
- otherwise invoke iconv(); upon failure, escape the illegal bytes

See whether the resulting name can be used. On Unix all strings
(without NUL and '/') are ok. On Windows there are many restrictions.
Further hex escape problematic characters on Windows.

Since conversions to 8-bit character sets will often fail,
it is desirable to convince Windows to use Unicode as current codeset.
Maybe that requires a copy of the common fileio routines.

That is my view of the result of the present conversation.
Probably some refinements will be needed. Moreover, there is
interference with iri stuff that should be looked at.

Once we know what we want it is trivial to write the code,
but it may take a while to figure out what we want.
I think we should start applying the current patch.

Andries



reply via email to

[Prev in Thread] Current Thread [Next in Thread]