Re: [Bug-wget] bad filenames (again)

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] bad filenames (again)

From:	Andries E. Brouwer
Subject:	Re: [Bug-wget] bad filenames (again)
Date:	Wed, 19 Aug 2015 23:52:12 +0200
User-agent:	Mutt/1.5.21 (2010-09-15)

On Wed, Aug 19, 2015 at 10:46:30PM +0300, Eli Zaretskii wrote:

> OK, then let me explain my line of reasoning.  Plain ASCII is valid
> UTF-8, and if converting with iconv assuming it's UTF-8 fails, you
> know it's not valid UTF-8.  So the last 3 possibilities in your
> suggestion boil down to "try converting as if it were UTF-8, and if
> that fails, you know it's Unknown".

Yes, although I would not invoke iconv to actually convert from UTF-8 to
UTF-8. Unicode is a complicated beast, and it is not certain that
conversion from UTF-8 to UTF-8 is the identity transformation.
(For example, implementations may prefer either NFC or NFD.
MacOS has its own NFD-like version for filenames.)
But you are right, one can use it as test.

After finding out that the charset is unknown I want to hex-encode
the entire filename. On the other hand, if the appropriate thing
is to invoke iconv to convert from one charset to another, I want
to hex-encode only the failing bytes.

This difference because (a) if there is reason to expect that
conversion should be possible, for example because the user
specified the from-charset as GB18030, and it fails, then often
only in a few isolated places where Microsoft extensions are used,
and it is more user-friendly to do the conversion where possible.
but (b) if nothing is known, then the character set can be a
multibyte one like SJIS where ASCII bytes occur as second halves
of symbols, and not escaping such ASCII bytes is confusing
and sometimes leads to strange problems.

Andries

[Prev in Thread]

Current Thread

[Next in Thread]

Re: [Bug-wget] bad filenames (again), (continued)

Prev by Date: Re: [Bug-wget] bad filenames (again)
Next by Date: Re: [Bug-wget] [bug #43799] wget should implement OCSP + OCSP stapling
Previous by thread: Re: [Bug-wget] bad filenames (again)
Next by thread: Re: [Bug-wget] bad filenames (again)
Index(es):
- Date
- Thread