Re: [Bug-wget] escaped URLs and recursive retrieval

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] escaped URLs and recursive retrieval

From:	Paul Wratt
Subject:	Re: [Bug-wget] escaped URLs and recursive retrieval
Date:	Wed, 23 Nov 2011 05:56:12 +1300

I have also had this problem for at least six months (cant say from
what version)

for me it seems to be an issue with japanese urls with "~", and the
"%7E" also gets re-encoded

it is almost like wget is using the "HTTP-ENCODING" to re-encode urls
that it has already encoded them ("%7E")

Paul

On Tue, Nov 22, 2011 at 4:20 AM, Bram Vandoren
<address@hidden> wrote:
> Hi,
> I encountered a bug in wget that occurs with recursive retrieval: if a page
> contains 2 (or more) links:
> <a href="http://example.com/~user/blah";> and
> <a href="http://example.com/%7Euser/blah";>
>
> Both links point to the same page but the encoding is different. wget
> doesn't recognise this as the same page and downloads the page 'blah' twice.
> It also overwrites the first downloaded file.
> Also if you specify the conversion option '-k', it only converts one of the
> two links.
>
> I had a quick look at the source code. It can be solved by changing
> url_parse in url.c.  Call url_unescape before parsing the url. This way you
> get a the same parsed url for both links. I am not sure if this is a good
> way to solve it. The conversion should probably be similar to the conversion
> that's done to determine the file name of the URL.
>
> Kind regards,
> Bram.
>
>

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] escaped URLs and recursive retrieval, Bram Vandoren, 2011/11/21
- Re: [Bug-wget] escaped URLs and recursive retrieval, Paul Wratt <=

Prev by Date: [Bug-wget] escaped URLs and recursive retrieval
Next by Date: [Bug-wget] wget feature request: support for SAN/UCC SSL Certs RFC 3280 part 4.2.1.7
Previous by thread: [Bug-wget] escaped URLs and recursive retrieval
Next by thread: [Bug-wget] wget feature request: support for SAN/UCC SSL Certs RFC 3280 part 4.2.1.7
Index(es):
- Date
- Thread