bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] wget 1.12 utf-8 webpage with convert-links generate illegail


From: hitoshi
Subject: [Bug-wget] wget 1.12 utf-8 webpage with convert-links generate illegail utf-8 sequence
Date: Fri, 8 Jun 2012 18:26:37 +0200
User-agent: SquirrelMail/1.4.21

Hi,

I have a problem when using --convert-links (-k) on a utf-8 encoded web page.

How to reproduce is:

wget -k --restrict-file-names=nocontrol
http://ja.wikipedia.org/wiki/%E3%81%A4%E3%81%8B%E3%81%93%E3%81%86%E3%81%B8%E3%81%84
(This is a Japanese wiki page.)

The file name is utf-8. To check the utf-8 sequence.

iconv -f utf-8 -t utf-8 [downloadedfile(replaced for non-utf-8 env)]
>/dev/null
iconv: illegal input sequence at position 77822
(or open with gedit show the corruption.)

If I don't have -k option, there is no broken file. This usually happens
near end of the file. Typically only one or two bytes illegal utf-8
characters. And at near the illegal characters, some of the data is
missing. Added illegal characters are typically 0xe3, or 0xe383, but not
limited to. This problem happens depends on the input file, around 20% of
Japanese wiki pages show this problem.

I have not yet tried wget 1.13 and I could not find any regarding
information on the web. I looked up the convert.c, but, I am not familiar
with the code.

Data missing is critical for me. I am currently thinking downloading files
without -k option and convert links by my own program. This problem didn't
happen English or German Wiki pages so far.

Any hint is appreciated. Thank you!

---
address@hidden % wget --version
GNU Wget 1.12 built on linux-gnu.

+digest +ipv6 +nls +ntlm +opie +md5/openssl +https -gnutls +openssl
-iri

Wgetrc:
    /etc/wgetrc (system)
Locale: /usr/share/locale
Compile: gcc -DHAVE_CONFIG_H -DSYSTEM_WGETRC="/etc/wgetrc"
    -DLOCALEDIR="/usr/share/locale" -I. -I../lib -g -O2 -DNO_SSLv2
    -D_FILE_OFFSET_BITS=64 -O2 -g -Wall
Link: gcc -g -O2 -DNO_SSLv2 -D_FILE_OFFSET_BITS=64 -O2 -g -Wall
    -Wl,-Bsymbolic-functions /usr/lib/libssl.so /usr/lib/libcrypto.so
    -ldl -lrt ftp-opie.o openssl.o http-ntlm.o gen-md5.o
    ../lib/libgnu.a
---


Hitoshi Yamauchi




reply via email to

[Prev in Thread] Current Thread [Next in Thread]