bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Unexpected result with -H and -D


From: Tim Rühsen
Subject: Re: [Bug-wget] Unexpected result with -H and -D
Date: Wed, 17 Jan 2018 15:53:58 +0100
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.5.2

Hi,

this is not a PSL matching, so no libpsl is needed.

Just sufmatch() has to be fixed to do (sub)domain matching.

Attached is a fix.


With Best Regards, Tim



On 01/17/2018 03:01 PM, Darshit Shah wrote:
> Hi,
> 
> This is a bug in Wget, apparently a really old one! Seems like the bug has 
> been
> around since atleast 1997.
> 
> Looking at the source, the issue is that Wget does a very simple suffix
> matching on the actual domain and accepted domains list. This is obviously
> wrong as you have just found out.
> 
> I'm going to try and implement this correctly, but I'm currently a little 
> short
> on time, so if anyone else wants to pick it up, please feel free to. It's
> simple, use libpsl to get the proper domain name and match against that.
> 
> 
> Of course, this change will require libpsl to no longer be an optional
> dependency
> 
> * Friso van Vollenhoven <address@hidden> [180117 14:40]:
>> Hello all,
>>
>> I am trying to do a recursive download of a webpage and span multiple hosts
>> within the same domain, but not cross to other domains. The issue is that
>> the crawl does extend to other domains. My full command is this:
>>
>> wget \
>> --recursive \
>> --no-clobber \
>> --page-requisites \
>> --adjust-extension \
>> --span-hosts \
>> --domains=scapino.nl \
>> --no-parent \
>> --tries=2 \
>> --wait=1 \
>> --random-wait \
>> --waitretry=2 \
>> --header='User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2)
>> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' \
>> https://www.scapino.nl/winkels/scapino-utrecht-510061
>>
>> From this combination of --span-hosts and --domains, I would expect to
>> download assets from cdn.scapino.nl and www.scapino.nl, but not other
>> domains. For some reason that I don't understand, wget also starts to do
>> what looks like a full crawl of the domain werkenbijscapino.nl, which is
>> referenced from the original page.
>>
>> Any thoughts or direction would be much appreciated.
>>
>> I am using wget 1.18 on Debian.
>>
>>
>> Best regards,
>> Friso
> 

Attachment: 0001-src-host.c-sufmatch-Fix-to-domain-matching.patch
Description: Text Data

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]