bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Tilde issue with recursive download when IRI is enabled and a


From: \ address@hidden
Subject: [Bug-wget] Tilde issue with recursive download when IRI is enabled and a page uses Shift JIS
Date: Mon, 6 Feb 2017 02:25:53 -0500

Hello,

I'm encountering a problem when recursively downloading from a website when the
URL contains a tilde and the page encoding claims to be Shift JIS.

I've tried both Wget 1.17.1 (from Ubuntu 16.04) and 1.19 (from source,
with Libidn2 0.16).
I believe my local character encoding is UTF-8.

The first page will download okay, but then most pages after it will get the
tilde converted to "%E2%80%BE" ("‾"), which, as one would expect, doesn't work.

Disabling IRI with --no-iri solves the issue.

----------------------------------------
A simple way to reproduce the issue is to create a web-accessible directory that
looks like this:
*testPath
*--index.html
*--~tildeFolder
*----index.html
*----bar.html

testPath/index.html contains:
<meta http-equiv="Content-Type"
content="text/html;charset=Shift_JIS"><a href="~tildeFolder/">Foo</a>

testPath/~tildeFolder/index.html contains:
<meta http-equiv="Content-Type"
content="text/html;charset=Shift_JIS"><a href="bar.html">Bar</a>

----------------------------------------
EXAMPLE OUTPUT (note that bar.html is never retrieved):
$ wget -r -np 'http://127.0.0.1/testPath/'
--2017-02-06 02:09:49--  http://127.0.0.1/testPath/
Connecting to 127.0.0.1:80... connected.
HTTP request sent, awaiting response... 200 OK
Length: 102 [text/html]
Saving to: ‘127.0.0.1/testPath/index.html’

127.0.0.1/testPath/index.html

2017-02-06 02:09:49 (1.33 MB/s) - ‘127.0.0.1/testPath/index.html’
saved [102/102]

Loading robots.txt; please ignore errors.
--2017-02-06 02:09:49--  http://127.0.0.1/robots.txt
Reusing existing connection to 127.0.0.1:80.
HTTP request sent, awaiting response... 404 Not Found
2017-02-06 02:09:49 ERROR 404: Not Found.

--2017-02-06 02:09:49--  http://127.0.0.1/testPath/%E2%80%BEtildeFolder/
Reusing existing connection to 127.0.0.1:80.
HTTP request sent, awaiting response... 404 Not Found
2017-02-06 02:09:49 ERROR 404: Not Found.

--2017-02-06 02:09:49--  http://127.0.0.1/testPath/~tildeFolder/
Reusing existing connection to 127.0.0.1:80.
HTTP request sent, awaiting response... 200 OK
Length: 110 [text/html]
Saving to: ‘127.0.0.1/testPath/‾tildeFolder/index.html’

127.0.0.1/testPath/‾tildeFolder/index.html

2017-02-06 02:09:49 (8.68 MB/s) -
‘127.0.0.1/testPath/‾tildeFolder/index.html’ saved [110/110]

FINISHED --2017-02-06 02:09:49--

----------------------------------------

Best regards,
William Prescott



reply via email to

[Prev in Thread] Current Thread [Next in Thread]