[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] wget 1.12 utf-8 webpage with convert-links generate illeg
Re: [Bug-wget] wget 1.12 utf-8 webpage with convert-links generate illegail utf-8 sequence
Sat, 09 Jun 2012 20:39:12 +0200
On 08/06/12 18:26, address@hidden wrote:
> I have a problem when using --convert-links (-k) on a utf-8 encoded web page.
> How to reproduce is:
> wget -k --restrict-file-names=nocontrol
> (This is a Japanese wiki page.)
> The file name is utf-8. To check the utf-8 sequence.
> iconv -f utf-8 -t utf-8 [downloadedfile(replaced for non-utf-8 env)]
> iconv: illegal input sequence at position 77822
> (or open with gedit show the corruption.)
> If I don't have -k option, there is no broken file. This usually happens
> near end of the file. Typically only one or two bytes illegal utf-8
> characters. And at near the illegal characters, some of the data is
> missing. Added illegal characters are typically 0xe3, or 0xe383, but not
> limited to. This problem happens depends on the input file, around 20% of
> Japanese wiki pages show this problem.
> I have not yet tried wget 1.13 and I could not find any regarding
> information on the web. I looked up the convert.c, but, I am not familiar
> with the code.
I'm not seeing that error (wget 1.13.4).
> -O Without-k
> wget -k
> -O With-k
A comparison of the changes between both files seem to be the expected ones.
(I found it is converting <a href="#cite_ref-0"> to <a
href="With-k#cite_ref-0">, which is unneeded, but that'd be a different
Iconv conversion doesn't show any error either:
> iconv -f utf-8 -t utf-8 < With-k > /dev/null
> iconv -f utf-8 -t utf-8 < Without-k-k > /dev/null