bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)


From: Andries E. Brouwer
Subject: Re: [Bug-wget] bad filenames (again)
Date: Wed, 19 Aug 2015 23:52:12 +0200
User-agent: Mutt/1.5.21 (2010-09-15)

On Wed, Aug 19, 2015 at 10:46:30PM +0300, Eli Zaretskii wrote:

> OK, then let me explain my line of reasoning.  Plain ASCII is valid
> UTF-8, and if converting with iconv assuming it's UTF-8 fails, you
> know it's not valid UTF-8.  So the last 3 possibilities in your
> suggestion boil down to "try converting as if it were UTF-8, and if
> that fails, you know it's Unknown".

Yes, although I would not invoke iconv to actually convert from UTF-8 to
UTF-8. Unicode is a complicated beast, and it is not certain that
conversion from UTF-8 to UTF-8 is the identity transformation.
(For example, implementations may prefer either NFC or NFD.
MacOS has its own NFD-like version for filenames.)
But you are right, one can use it as test.

After finding out that the charset is unknown I want to hex-encode
the entire filename. On the other hand, if the appropriate thing
is to invoke iconv to convert from one charset to another, I want
to hex-encode only the failing bytes.

This difference because (a) if there is reason to expect that
conversion should be possible, for example because the user
specified the from-charset as GB18030, and it fails, then often
only in a few isolated places where Microsoft extensions are used,
and it is more user-friendly to do the conversion where possible.
but (b) if nothing is known, then the character set can be a
multibyte one like SJIS where ASCII bytes occur as second halves
of symbols, and not escaping such ASCII bytes is confusing
and sometimes leads to strange problems.

Andries



reply via email to

[Prev in Thread] Current Thread [Next in Thread]