bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] avoiding a large number of HEAD reqs when resuming


From: Ángel González
Subject: Re: [Bug-wget] avoiding a large number of HEAD reqs when resuming
Date: Mon, 04 May 2015 22:39:28 +0200
User-agent: Thunderbird

On 30/04/15 14:04, User Goblin wrote:
My initial idea was to parse wget's -o output and figure out which files
still need to be downloaded, and then feed them via -i when continuing the
download. This led me to the conclusion that I'd need two pieces of
functionality, (1) machine-parseable output of -o, and (2) a way to convert
a partially downloaded directory structure to links that still need
downloading.

I could work around (1), the output of -o is just hard to parse.

For (2), I could use lynx or w3m or something like that, but then I never
am sure that the links produced are the same that wget produced. Therefore
I'd love an option like `wget --extract-links ./index.html` that'd just
read an html file and produce a list of links on output. Or perhaps an
assertion that some other tool like urlscan will do it exactly the same way
as wget.
I made such program some time ago, but was never merged into wget. See
“Exposing wget functionality for extracting links from a web page”
https://lists.gnu.org/archive/html/bug-wget/2013-09/msg00079.html

0001-Moved-free_urlpos.patch no longer applies cleanly, so I'm attaching a
rebased one (it's a trivial change, though).

Attachment: 0001-Move-free_urlpos.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]