bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Tilde issue with recursive download when IRI is enabled a


From: Tim Rühsen
Subject: Re: [Bug-wget] Tilde issue with recursive download when IRI is enabled and a page uses Shift JIS
Date: Tue, 07 Feb 2017 15:15:59 +0100
User-agent: KMail/5.2.3 (Linux/4.9.0-1-amd64; KDE/5.28.0; x86_64; ; )

On Montag, 6. Februar 2017 19:02:16 CET William Prescott wrote:
> Thanks for the responses.
> 
> Indeed, that seems to be the case: Shift JIS replaces ASCII \ and ~
> with ¥ and ‾, respectively
> (with exceptions as per Andries' message).
> 
> In addition, RFC 3987 (Internationalized Resource Identifiers (IRIs))
> section 6.3 states that:
> "In cases where the document as a whole has a
>    native character encoding, IRIs MUST also be encoded in this
>    character encoding and converted accordingly by a parser or
>    interpreter."
> This would make it seem that the observed behavior in Wget is correct and
> that the document is faulty.
> 
> I would also like to note that, even when the the document's links don't
> contain a tilde, Wget will still fail to fetch the pages as long as there
> is a tilde in the URL the Wget was called with.

Hi William,

you are on UTF-8 and thus copy&pasting a URL from the original document does 
not do the Shift JIS to UTF-8 conversion. If your editor (or text viewer) is 
locale/charset aware (e.g. here on KDE I use kate and can manually tell it, 
that the charset encoding of the viewed document is 'sjis'), set it to the 
right encoding and then try copy&paste.

Another way would be to translate your string from ShiftJIS to UTF-8 as I did 
in my example, like

$ wget `echo 'http://domain.jp/~withtilde'|iconv -f SHIFT-JIS -t utf-8`

Or you translate your whole document to UTF-8 with that trick, like
$ cat shiftjis_text.html|iconv -f SHIFT-JIS -t utf-8 >utf8_text.html

Now you should be able to copy&paste URLs from that document.
Ah yes, that only works on Unix/Linux/BSD systems.

Regards, Tim

> On Mon, Feb 6, 2017 at 6:29 PM, Andries E. Brouwer
> 
> <address@hidden> wrote:
> > On Mon, Feb 06, 2017 at 10:55:32PM +0100, Tim Rühsen wrote:
> >> On Montag, 6. Februar 2017 05:02:57 CET William Prescott wrote:
> >> > Hello,
> >> > 
> >> > I'm encountering a problem when recursively downloading from a website
> >> > when
> >> > the URL contains a tilde and the page encoding claims to be Shift JIS.
> >> > 
> >> > I've tried both Wget 1.17.1 (from Ubuntu 16.04) and 1.19 (from source,
> >> > with Libidn2 0.16).
> >> > I believe my local character encoding is UTF-8.
> >> > 
> >> > The first page will download okay, but then most pages after it will
> >> > get the tilde converted to "%E2%80%BE" ("‾"), which, as one would
> >> > expect, doesn't work.
> >> 
> >> Hi William,
> >> 
> >> reproducable by:
> >> 
> >> $echo '~'|iconv -f SHIFT-JIS -t utf-8
> >> ‾
> >> 
> >> $echo -n '~'|iconv -f SHIFT-JIS -t utf-8|od -t x1
> >> 0000000 e2 80 be
> >> 
> >> So this seems not be a Wget issue, but maybe a general character
> >> conversion
> >> issue. Not sure what Wget could do...
> >> 
> >> Regards, Tim
> > 
> > Shift JIS is not a single well-defined character set.
> > There are x-sjis-unicode, x-sjis-cp932, x-sjis-jisx0221, x-sjis-jdk117
> > that all are called "shift-jis", and are subtly different.
> > See also https://www.w3.org/TR/japanese-xml/#sjis .
> > 
> > 
> > SJIS and CP932 (the "Microsoft version of SJIS") are almost identical,
> > and CP932 does contain a tilde.
> > 
> > Java did (does?) treat SJIS 5c and 7e as ASCII 5c and 7e.
> > The docs say "This is in keeping with standard industry practice within
> > Japan."
> > 
> > Can wget use a fallback? Use the given bytes converted from SJIS.
> > When that fails use these bytes converted from CP932 (if different).
> > When that fails use these bytes without any conversion?
> > 
> > 
> > It looks like
> > http://seesaawiki.jp/w/kou1okada/d/wget%20-%20troubleshooting
> > describes the same problem. There three successful suggestions are given
> > (for wget 1.13.4): (i) Give one of ASCII, EUC-JP or UTF-8 with the
> > --remote-encoding option, (ii) Give the --no-iri option, (iii) Export
> > LANG=C.
> > 
> > Andries

Attachment: signature.asc
Description: This is a digitally signed message part.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]