bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] avoiding a large number of HEAD reqs when resuming


From: User Goblin
Subject: [Bug-wget] avoiding a large number of HEAD reqs when resuming
Date: Thu, 30 Apr 2015 12:04:18 +0000
User-agent: Mutt/1.4.2.3i

The situation: I'm trying to resume a large recursive download of a site
with many files (-r -l 10 -c)

The problem: When resuming, wget issues a large number of HEAD requests
for each file that it already downloaded. This triggers the upstream firewall,
making the download impossible.

My initial idea was to parse wget's -o output and figure out which files
still need to be downloaded, and then feed them via -i when continuing the
download. This led me to the conclusion that I'd need two pieces of
functionality, (1) machine-parseable output of -o, and (2) a way to convert
a partially downloaded directory structure to links that still need
downloading.

I could work around (1), the output of -o is just hard to parse.

For (2), I could use lynx or w3m or something like that, but then I never
am sure that the links produced are the same that wget produced. Therefore
I'd love an option like `wget --extract-links ./index.html` that'd just
read an html file and produce a list of links on output. Or perhaps an
assertion that some other tool like urlscan will do it exactly the same way
as wget.

There's a third idea that we discussed on IRC with darnir, namely having
wget store its state when downloading. That would solve the original problem
and would be pretty nice. However, I'd still like to have (1) and (2) done,
because I'm also thinking of distributing this large download to a number
of IP addresses, by running many instances of wget on many different
servers (and writing a script that'd distribute the load).

Thoughts welcome :-)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]