[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] avoiding a large number of HEAD reqs when resuming
From: |
UukGoblin |
Subject: |
Re: [Bug-wget] avoiding a large number of HEAD reqs when resuming |
Date: |
Fri, 1 May 2015 11:50:19 +0000 |
User-agent: |
Mutt/1.4.2.3i |
On Thu, Apr 30, 2015 at 11:02:31PM +0200, Tim R?hsen wrote:
> The top-down approach would be something like
>
> wget -r --extract-links | distributor host1 host2 ... hostN
>
> 'distributor' is a program that start one instance of wget on each host
> given,
> taking the (absolute) URLs via stdin, and give it to the wget instances (e.g.
> via round-robin... better would be to know wether a file download has been
> finished).
Yes, something like that, although not quite simple. The distributor would
have to know what has just been downloaded by the worker, and invoke
the link extractor on each newly-downloaded html file - in order to
append the links in it to the download queue.
> I assume '-r --extract-links' does not download, but just recursive
> scans/extracts the existing files !?
Yes, that's exactly what I had in mind.
> Wget also has to be adjusted to start downloading immediately on the first
> URL
> read from stdin. Right now it collects all URLs until stdin closes and than
> starts downloading.
Ah, good point, I wasn't aware of that.
> I wrote a C library for the nextgen Wget (start to move the code to wget this
> autumn) with that you can also do the extraction part. There are small C
> examples that you might extend to work recursive. It works with CSS and HTML.
>
> https://github.com/rockdaboot/mget/tree/master/examples
Nice, thank you! I'll check it out :-)
- Re: [Bug-wget] avoiding a large number of HEAD reqs when resuming,
UukGoblin <=