bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] escaped URLs and recursive retrieval


From: Bram Vandoren
Subject: [Bug-wget] escaped URLs and recursive retrieval
Date: Mon, 21 Nov 2011 16:20:29 +0100
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.2.23) Gecko/20110920 SUSE/3.1.15 Thunderbird/3.1.15

Hi,
I encountered a bug in wget that occurs with recursive retrieval: if a page contains 2 (or more) links:
<a href="http://example.com/~user/blah";> and
<a href="http://example.com/%7Euser/blah";>

Both links point to the same page but the encoding is different. wget doesn't recognise this as the same page and downloads the page 'blah' twice. It also overwrites the first downloaded file. Also if you specify the conversion option '-k', it only converts one of the two links.

I had a quick look at the source code. It can be solved by changing url_parse in url.c. Call url_unescape before parsing the url. This way you get a the same parsed url for both links. I am not sure if this is a good way to solve it. The conversion should probably be similar to the conversion that's done to determine the file name of the URL.

Kind regards,
Bram.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]