bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Tilde issue with recursive download when IRI is enabled a


From: Andries E. Brouwer
Subject: Re: [Bug-wget] Tilde issue with recursive download when IRI is enabled and a page uses Shift JIS
Date: Tue, 7 Feb 2017 00:29:23 +0100
User-agent: Mutt/1.5.24 (2015-08-30)

On Mon, Feb 06, 2017 at 10:55:32PM +0100, Tim Rühsen wrote:
> On Montag, 6. Februar 2017 05:02:57 CET William Prescott wrote:
> > Hello,
> > 
> > I'm encountering a problem when recursively downloading from a website when
> > the URL contains a tilde and the page encoding claims to be Shift JIS.
> > 
> > I've tried both Wget 1.17.1 (from Ubuntu 16.04) and 1.19 (from source,
> > with Libidn2 0.16).
> > I believe my local character encoding is UTF-8.
> > 
> > The first page will download okay, but then most pages after it will get the
> > tilde converted to "%E2%80%BE" ("‾"), which, as one would expect, doesn't
> > work.
> 
> Hi William,
> 
> reproducable by:
> 
> $echo '~'|iconv -f SHIFT-JIS -t utf-8
>
> 
> $echo -n '~'|iconv -f SHIFT-JIS -t utf-8|od -t x1
> 0000000 e2 80 be
> 
> So this seems not be a Wget issue, but maybe a general character conversion 
> issue. Not sure what Wget could do...
> 
> Regards, Tim


Shift JIS is not a single well-defined character set.
There are x-sjis-unicode, x-sjis-cp932, x-sjis-jisx0221, x-sjis-jdk117
that all are called "shift-jis", and are subtly different.
See also https://www.w3.org/TR/japanese-xml/#sjis .


SJIS and CP932 (the "Microsoft version of SJIS") are almost identical,
and CP932 does contain a tilde.

Java did (does?) treat SJIS 5c and 7e as ASCII 5c and 7e.
The docs say "This is in keeping with standard industry practice within Japan."

Can wget use a fallback? Use the given bytes converted from SJIS.
When that fails use these bytes converted from CP932 (if different).
When that fails use these bytes without any conversion?


It looks like http://seesaawiki.jp/w/kou1okada/d/wget%20-%20troubleshooting
describes the same problem. There three successful suggestions are given
(for wget 1.13.4): (i) Give one of ASCII, EUC-JP or UTF-8 with the
--remote-encoding option, (ii) Give the --no-iri option, (iii) Export LANG=C.

Andries



reply via email to

[Prev in Thread] Current Thread [Next in Thread]