bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] patch: Stored file name coversion logic correction


From: Tim Ruehsen
Subject: Re: [Bug-wget] patch: Stored file name coversion logic correction
Date: Thu, 16 Feb 2017 12:06:13 +0100
User-agent: KMail/5.2.3 (Linux/4.9.0-1-amd64; KDE/5.28.0; x86_64; ; )

On Thursday, February 16, 2017 4:10:22 PM CET YX Hao wrote:
> My bad! I made a stupid mistake!
> Then, how can Tim's case pass the 'iconv' function? Maybe the
> 'from_encoding' in 'convert_fname' function is the same as the
> 'to_encoding'. Did he download from a same encoding server???
> 在2017年02月16 14时07分, "Eli Zaretskii"<address@hidden>写道:
> > Date: Thu, 16 Feb 2017 12:42:23 +0800 (CST)
> > From: "YX Hao" <address@hidden>
> > 
> > I downloaded the 'mbox format' original, and found out the reason why you
> > can't reproduce the issue. The non-ASCII characters you use is encoded in
> > "iso-8859-1" in your email, and should be displayed correctly in your
> > environment. So, your encoding is compatible with 'UTF8', which is the
> > remote server's default encoding. That won't cause iconv error :) Think
> > about 'UFT8' incompatible encoding envrionments ...
> 
> Maybe I misunderstand, but ISO-8859-1 (a.k.a. "Latin-1") is NOT
> compatible with UTF-8.  Trying to decode Latin-1 text as UTF-8 will
> get you errors from the conversion routines, because Latin-1 byte
> sequences are generally not valid UTF-8 sequences.

You might be right... I made up a test that reproduces the issue (i guess/
hope). The patch is attached for playing around and here are the steps that I 
made, depending on my installed locales available.

$ locale
LANG=en_US.UTF-8
... (everything set to en_US.UTF-8)

$ locale -a
C
C.UTF-8
address@hidden
address@hidden
de_DE.utf8
en_US.iso885915
en_US.utf8
POSIX
tr_TR.utf8

Convert a special character from utf-8 to iso and get it's byte sequence:
$ echo -n ü|iconv -f utf-8 -t iso-8859-15|od -t x1
0000000 fc

Now I copied tests/Test-iri.px to Test-iri-P.px amended it and added it to 
Makefile.am (don't forget to recreate Makefile with ./config.status in the main 
directory).
All I changed in the new test is
my $iso885915_path = "\xfc";
my $cmdline = $WgetTest::WGETPATH . " -d -P ${iso885915_path} --iri --trust-
server-names --restrict-file-names=nocontrol -nH -r http://localhost:
{{port}}/";

$ cd tests
$ LC_ALL=en_US.iso885915 make check TESTS=Test-iri-P

And voila, in the .log file:
Incomplete or invalid multibyte sequence encountered
Failed to convert file name 'ü/index.html' (UTF-8) -> '?' (ISO-8859-15)

My editor (kwrite) auto-detected iso-8859-15, so by copy&pasting the above 'ü' 
is whatever encoding this email might have. But in the log it is correctly 
iso-8859-15 encoded (0xFC).

The above error occurs even before the first download (I guess when building 
the local filename). That means, we can reduce the test much further...

Regards, Tim

Attachment: 0001-Add-tests-Test-iri-P.log.patch
Description: Text Data

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]