From: Micah Cowan
Subject: Re: [Bug-wget] wget 1.12 utf-8 webpage with convert-links generate illegail utf-8 sequence
Date: Sat, 09 Jun 2012 12:03:33 -0700
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:12.0) Gecko/20120430 Thunderbird/12.0.1

On 06/09/2012 11:39 AM, Ángel González wrote:
> On 08/06/12 18:26, address@hidden wrote:
>> Hi,
>> I have a problem when using --convert-links (-k) on a utf-8 encoded web page.
>> How to reproduce is:
>> wget -k --restrict-file-names=nocontrol
>> http://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%8B%E3%81%93%E3%81%86%E3%81%B8%E3%81%84
>> (This is a Japanese wiki page.)
>> The file name is utf-8. To check the utf-8 sequence.
>> iconv -f utf-8 -t utf-8 [downloadedfile(replaced for non-utf-8 env)]
>>> /dev/null
>> iconv: illegal input sequence at position 77822
>> (or open with gedit show the corruption.)

Given that both Angel and myself were unable to reproduce the problem (I
tried with the latest development version, and also 1.10.2 and 1.12), it
would seem additional information is needed.

What operating system are you running this on? And what is your version
of wget? Is your .wgetrc (or wget.ini) and /etc/wgetrc empty, and if
not, what are their contents?

Could you attach an example of the broken file contents? ...the full
file itself is perhaps a bit large to attach in a mailing list (~85k?),
but perhaps you could use a pastebin, or otherwise throw it up on a
server, or just post a snippet that illustrates exactly what sort of
corruption is taking place in your setup.

Good luck,

