bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Unexpected result with -H and -D


From: Darshit Shah
Subject: Re: [Bug-wget] Unexpected result with -H and -D
Date: Wed, 17 Jan 2018 15:01:21 +0100
User-agent: NeoMutt/20171215

Hi,

This is a bug in Wget, apparently a really old one! Seems like the bug has been
around since atleast 1997.

Looking at the source, the issue is that Wget does a very simple suffix
matching on the actual domain and accepted domains list. This is obviously
wrong as you have just found out.

I'm going to try and implement this correctly, but I'm currently a little short
on time, so if anyone else wants to pick it up, please feel free to. It's
simple, use libpsl to get the proper domain name and match against that.


Of course, this change will require libpsl to no longer be an optional
dependency

* Friso van Vollenhoven <address@hidden> [180117 14:40]:
> Hello all,
> 
> I am trying to do a recursive download of a webpage and span multiple hosts
> within the same domain, but not cross to other domains. The issue is that
> the crawl does extend to other domains. My full command is this:
> 
> wget \
> --recursive \
> --no-clobber \
> --page-requisites \
> --adjust-extension \
> --span-hosts \
> --domains=scapino.nl \
> --no-parent \
> --tries=2 \
> --wait=1 \
> --random-wait \
> --waitretry=2 \
> --header='User-Agent:Mozilla/5.0 (Macintosh; Intel Mac OS X 10_13_2)
> AppleWebKit/537.36 (KHTML, like Gecko) Chrome/63.0.3239.132 Safari/537.36' \
> https://www.scapino.nl/winkels/scapino-utrecht-510061
> 
> From this combination of --span-hosts and --domains, I would expect to
> download assets from cdn.scapino.nl and www.scapino.nl, but not other
> domains. For some reason that I don't understand, wget also starts to do
> what looks like a full crawl of the domain werkenbijscapino.nl, which is
> referenced from the original page.
> 
> Any thoughts or direction would be much appreciated.
> 
> I am using wget 1.18 on Debian.
> 
> 
> Best regards,
> Friso

-- 
Thanking You,
Darshit Shah
PGP Fingerprint: 7845 120B 07CB D8D6 ECE5 FF2B 2A17 43ED A91A 35B6

Attachment: signature.asc
Description: PGP signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]