[Top][All Lists]

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] wget 1.12 utf-8 webpage with convert-links generate illeg

From: Ángel González
Subject: Re: [Bug-wget] wget 1.12 utf-8 webpage with convert-links generate illegail utf-8 sequence
Date: Sat, 09 Jun 2012 20:39:12 +0200
User-agent: Thunderbird

On 08/06/12 18:26, address@hidden wrote:
> Hi,
> I have a problem when using --convert-links (-k) on a utf-8 encoded web page.
> How to reproduce is:
> wget -k --restrict-file-names=nocontrol
> http://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%8B%E3%81%93%E3%81%86%E3%81%B8%E3%81%84
> (This is a Japanese wiki page.)
> The file name is utf-8. To check the utf-8 sequence.
> iconv -f utf-8 -t utf-8 [downloadedfile(replaced for non-utf-8 env)]
>> /dev/null
> iconv: illegal input sequence at position 77822
> (or open with gedit show the corruption.)
> If I don't have -k option, there is no broken file. This usually happens
> near end of the file. Typically only one or two bytes illegal utf-8
> characters. And at near the illegal characters, some of the data is
> missing. Added illegal characters are typically 0xe3, or 0xe383, but not
> limited to. This problem happens depends on the input file, around 20% of
> Japanese wiki pages show this problem.
> I have not yet tried wget 1.13 and I could not find any regarding
> information on the web. I looked up the convert.c, but, I am not familiar
> with the code.
I'm not seeing that error (wget 1.13.4).

I ran:
> wget
> http://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%8B%E3%81%93%E3%81%86%E3%81%B8%E3%81%84
> -O Without-k
> wget -k
> http://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%8B%E3%81%93%E3%81%86%E3%81%B8%E3%81%84
> -O With-k
A comparison of the changes between both files seem to be the expected ones.
(I found it is converting <a href="#cite_ref-0"> to <a
href="With-k#cite_ref-0">, which is unneeded, but that'd be a different

Iconv conversion doesn't show any error either:
> iconv -f utf-8 -t utf-8 < With-k  > /dev/null
> iconv -f utf-8 -t utf-8 < Without-k-k  > /dev/null

reply via email to

[Prev in Thread] Current Thread [Next in Thread]