Hi,
Below is an idea for an enhancement to wget, which might be a
two-day-ish project for someone familiar with C, maybe less if
one is also really familiar with wget internals.
The feature I'm looking for consists of an alternative to the existing
"--convert-links" option, which would allow the scraped content to be
hosted online (from a transparent proxy, like e.g. a squid cache),
instead of being limited to offline viewing, via "file://".
I would be very happy to collaborate (review and test) any patches
implementing something like this, but can't contribute any C code
myself, for lawyerly, copyright-assignment related reasons.
I am also willing and able to buy beer, should we ever meet in person
(e.g. at linuxconf in Seattle later this year) :)
Here go the details:
When recursively scraping a site, the -E (--adjust-extension) option
will append .html or .css to output generated by script calls.
Then, -k (--convert-links) will modify the html documents referencing
such scripts, so that the respective links will also have their extension
adjusted to match the file name(s) to which script output is saved.
Unfortunately, -k also modifies the beginning (protocol://host...) portion
of links during conversion. For instance, a link:
"//host.example.net/cgi-bin/foo.cgi?param"
might get turned into:
"../../../host.example.net/cgi-bin/foo.cgi%3Fparam.html"
which is fine when the scraped site is viewed locally (e.g. in a browser
via "file://..."), but breaks if one attempts to host the scraped content
for access via "http://..." (e.g. in a transparent proxy, think populating
a squid cache from a recursive wget run).
In the latter case, we'd like to still be able to convert links, but they'd
have to look something like this instead:
"//host.example.net/cgi-bin/foo.cgi%3Fparam.html"
In other words, we want to be able to convert the filename portion of the
link only (in Unix terms, that's the "basename"), and leave the protocol,
host, and path portions alone (i.e., don't touch the "dirname" part of the
link).
The specification below is formatted as a patch against the current wget
git master, but contains no actual code, just instructions on how one
would write this alternative version of --convert-link.
I have also built a small two-server test for this functionality. Running:
wget -rpH -l 1 -P ./vhosts --adjust-extension --convert-links \
www.contrib.andrew.cmu.edu/~somlo/WGET/
will result in three very small html documents with stylesheet links
that look like "../../../host/script.html". Once the spec below is
successfully implemented, running
wget -rpH -l 1 -P ./vhosts --adjust-extension --basename-only-convert-option \
www.contrib.andrew.cmu.edu/~somlo/WGET/
should result in the stylesheet references being converted to the desired
"//host/script.html" format instead.
Thanks in advance, and please feel free to get in touch if this sounds
interesting!
-- Gabriel