bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] escaped URLs and recursive retrieval


From: Paul Wratt
Subject: Re: [Bug-wget] escaped URLs and recursive retrieval
Date: Wed, 23 Nov 2011 05:56:12 +1300

I have also had this problem for at least six months (cant say from
what version)

for me it seems to be an issue with japanese urls with "~", and the
"%7E" also gets re-encoded

it is almost like wget is using the "HTTP-ENCODING" to re-encode urls
that it has already encoded them ("%7E")

Paul

On Tue, Nov 22, 2011 at 4:20 AM, Bram Vandoren
<address@hidden> wrote:
> Hi,
> I encountered a bug in wget that occurs with recursive retrieval: if a page
> contains 2 (or more) links:
> <a href="http://example.com/~user/blah";> and
> <a href="http://example.com/%7Euser/blah";>
>
> Both links point to the same page but the encoding is different. wget
> doesn't recognise this as the same page and downloads the page 'blah' twice.
> It also overwrites the first downloaded file.
> Also if you specify the conversion option '-k', it only converts one of the
> two links.
>
> I had a quick look at the source code. It can be solved by changing
> url_parse in url.c.  Call url_unescape before parsing the url. This way you
> get a the same parsed url for both links. I am not sure if this is a good
> way to solve it. The conversion should probably be similar to the conversion
> that's done to determine the file name of the URL.
>
> Kind regards,
> Bram.
>
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]