Re: [Bug-wget] avoiding a large number of HEAD reqs when resuming

bug-wget

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] avoiding a large number of HEAD reqs when resuming

From:	Tim Rühsen
Subject:	Re: [Bug-wget] avoiding a large number of HEAD reqs when resuming
Date:	Thu, 30 Apr 2015 23:02:31 +0200
User-agent:	KMail/4.14.2 (Linux/3.16.0-4-amd64; KDE/4.14.2; x86_64; ; )

Am Donnerstag, 30. April 2015, 12:04:18 schrieb User Goblin:
> The situation: I'm trying to resume a large recursive download of a site
> with many files (-r -l 10 -c)
> 
> The problem: When resuming, wget issues a large number of HEAD requests
> for each file that it already downloaded. This triggers the upstream
> firewall, making the download impossible.
> 
> My initial idea was to parse wget's -o output and figure out which files
> still need to be downloaded, and then feed them via -i when continuing the
> download. This led me to the conclusion that I'd need two pieces of
> functionality, (1) machine-parseable output of -o, and (2) a way to convert
> a partially downloaded directory structure to links that still need
> downloading.
> 
> I could work around (1), the output of -o is just hard to parse.
> 
> For (2), I could use lynx or w3m or something like that, but then I never
> am sure that the links produced are the same that wget produced. Therefore
> I'd love an option like `wget --extract-links ./index.html` that'd just
> read an html file and produce a list of links on output. Or perhaps an
> assertion that some other tool like urlscan will do it exactly the same way
> as wget.
> 
> There's a third idea that we discussed on IRC with darnir, namely having
> wget store its state when downloading. That would solve the original problem
> and would be pretty nice. However, I'd still like to have (1) and (2) done,
> because I'm also thinking of distributing this large download to a number
> of IP addresses, by running many instances of wget on many different
> servers (and writing a script that'd distribute the load).
> 
> Thoughts welcome :-)

The top-down approach would be something like

wget -r --extract-links | distributor host1 host2 ... hostN

'distributor' is a program that start one instance of wget on each host given, 
taking the (absolute) URLs via stdin, and give it to the wget instances (e.g. 
via round-robin... better would be to know wether a file download has been 
finished).

I assume '-r --extract-links' does not download, but just recursive 
scans/extracts the existing files !?

Wget also has to be adjusted to start downloading immediately on the first URL 
read from stdin. Right now it collects all URLs until stdin closes and than 
starts downloading.

I wrote a C library for the nextgen Wget (start to move the code to wget this 
autumn) with that you can also do the extraction part. There are small C 
examples that you might extend to work recursive. It works with CSS and HTML.

https://github.com/rockdaboot/mget/tree/master/examples

Regards, Tim

signature.asc
Description: This is a digitally signed message part.

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-wget] avoiding a large number of HEAD reqs when resuming, User Goblin, 2015/04/30
- Re: [Bug-wget] avoiding a large number of HEAD reqs when resuming, Tim Rühsen <=
- Re: [Bug-wget] avoiding a large number of HEAD reqs when resuming, Giuseppe Scrivano, 2015/04/30

Prev by Date: Re: [Bug-wget] GSoC15: Speed up Wget's Download Mechanism
Next by Date: Re: [Bug-wget] GSoC15: Speed up Wget's Download Mechanism
Previous by thread: [Bug-wget] avoiding a large number of HEAD reqs when resuming
Next by thread: Re: [Bug-wget] avoiding a large number of HEAD reqs when resuming
Index(es):
- Date
- Thread