bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Tilde issue with recursive download when IRI is enabled a


From: William Prescott
Subject: Re: [Bug-wget] Tilde issue with recursive download when IRI is enabled and a page uses Shift JIS
Date: Mon, 6 Feb 2017 19:02:16 -0500

Thanks for the responses.

Indeed, that seems to be the case: Shift JIS replaces ASCII \ and ~
with ¥ and ‾, respectively
(with exceptions as per Andries' message).

In addition, RFC 3987 (Internationalized Resource Identifiers (IRIs))
section 6.3 states that:
"In cases where the document as a whole has a
   native character encoding, IRIs MUST also be encoded in this
   character encoding and converted accordingly by a parser or
   interpreter."
This would make it seem that the observed behavior in Wget is correct and that
the document is faulty.

I would also like to note that, even when the the document's links don't contain
a tilde, Wget will still fail to fetch the pages as long as there is a tilde in
the URL the Wget was called with.

Best regards,
William Prescott

On Mon, Feb 6, 2017 at 6:29 PM, Andries E. Brouwer
<address@hidden> wrote:
> On Mon, Feb 06, 2017 at 10:55:32PM +0100, Tim Rühsen wrote:
>> On Montag, 6. Februar 2017 05:02:57 CET William Prescott wrote:
>> > Hello,
>> >
>> > I'm encountering a problem when recursively downloading from a website when
>> > the URL contains a tilde and the page encoding claims to be Shift JIS.
>> >
>> > I've tried both Wget 1.17.1 (from Ubuntu 16.04) and 1.19 (from source,
>> > with Libidn2 0.16).
>> > I believe my local character encoding is UTF-8.
>> >
>> > The first page will download okay, but then most pages after it will get 
>> > the
>> > tilde converted to "%E2%80%BE" ("‾"), which, as one would expect, doesn't
>> > work.
>>
>> Hi William,
>>
>> reproducable by:
>>
>> $echo '~'|iconv -f SHIFT-JIS -t utf-8
>> ‾
>>
>> $echo -n '~'|iconv -f SHIFT-JIS -t utf-8|od -t x1
>> 0000000 e2 80 be
>>
>> So this seems not be a Wget issue, but maybe a general character conversion
>> issue. Not sure what Wget could do...
>>
>> Regards, Tim
>
>
> Shift JIS is not a single well-defined character set.
> There are x-sjis-unicode, x-sjis-cp932, x-sjis-jisx0221, x-sjis-jdk117
> that all are called "shift-jis", and are subtly different.
> See also https://www.w3.org/TR/japanese-xml/#sjis .
>
>
> SJIS and CP932 (the "Microsoft version of SJIS") are almost identical,
> and CP932 does contain a tilde.
>
> Java did (does?) treat SJIS 5c and 7e as ASCII 5c and 7e.
> The docs say "This is in keeping with standard industry practice within 
> Japan."
>
> Can wget use a fallback? Use the given bytes converted from SJIS.
> When that fails use these bytes converted from CP932 (if different).
> When that fails use these bytes without any conversion?
>
>
> It looks like http://seesaawiki.jp/w/kou1okada/d/wget%20-%20troubleshooting
> describes the same problem. There three successful suggestions are given
> (for wget 1.13.4): (i) Give one of ASCII, EUC-JP or UTF-8 with the
> --remote-encoding option, (ii) Give the --no-iri option, (iii) Export LANG=C.
>
> Andries
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]