bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] [RFE / project idea]: convert-links for "transparent prox


From: Gabriel L. Somlo
Subject: Re: [Bug-wget] [RFE / project idea]: convert-links for "transparent proxy" mode
Date: Sun, 5 Jul 2015 14:29:26 -0400
User-agent: Mutt/1.5.23 (2014-03-12)

On Sun, Jul 05, 2015 at 07:34:07PM +0200, Ander Juaristi wrote:
> Hi Gabriel,
> 
> So, if I understood well, you want to keep the modifications made by Wget to
> the basename (such as escape the reserved characters) but not touch the
> hostname production, right?
> 
> So that instead of
> 
>     ../../../mirror.ini.cmu.edu/cgi-bin/cssfoo.cgi%3Ftwo.html
> 
> would have to be
> 
>     http://mirror.ini.cmu.edu/cgi-bin/cssfoo.cgi%3Ftwo.html
> 
> For a URI that originally was (BTW, I don't know if omitting the scheme is
> correct, but anyway, that's how it was)
> 
>     //mirror.ini.cmu.edu/cgi-bin/cssfoo.cgi?two
> 
> Thus, without looking at the code (either Wget's original or your proposed
> changes), and from a purely algorithmic approach, the original behaviour of
> Wget is something like this:
> 
>     for each absolute URI found as uri
>     loop
>         convert_relative(uri)
>         escape(uri)
>     end loop
> 
> And what you want is something like this:
> 
>     for each absolute URI found as uri
>     loop
>         escape(uri)       // keep the URI as-is but escape the reserved 
> characters
>     end loop
> 
> Am I right?

Almost :)

Leaving out escaping the URI (which needs to happen in all cases),
construct_relative() does two things:

        1. modify "basename" according to what --adjust-extension did
           to the file name of the document targeted by the URI. I.e.,
           foo.cgi?arg1&arg2&arg3  -> foo.cgi\?arg1\&arg2\&arg3.html
                                                               ^^^^^

        2. modify "dirname" to compute the correct number of "../"'s
           required to back out of the directory hierarchy
           representing the current web server before being ready to
           enter the target web server's folder hierarchy.
           That, btw, is the part which assumes one always ends up
           using "file://" to view the scraped content :)

We do need #1 to still happen, so just escape-ing the URI isn't going
to be enough. When the document containing the URI in question is
being served from a transparent proxy, the client will request the
URI, which then had better match something else also available on the 
transparent proxy. When --adjust-extension was used, that something
will have a different basename than what's in the link URI.

Regarding #2, we clearly don't want

        //mirror.ini.cmu.edu/foo...

to be converted into

        ../../../mirror.ini.cmu.edu/foo...

However, we ALSO do not want it converted into

        http://mirror.ini.cmu.edu/foo...

Why would that matter ? Leaving out the schema (i.e., "//host/path")
in a document translates into "client will use the same schema as for
the referencing document which contains the URI". So, if I'm
downloading index.html using https, then a stylesheet link inside
index.html written as "//some-other-host/foo/stylesheet.css" will also
be requested via https.

If you hardcode it to "http://some-other-host/foo/stylesheet.css";,
then when loading the referencing index.html via https, the stylesheet
will NOT load, and the document will be displayed all wrong and ugly.

So, in conclusion, we want a "construct_transparent_proxy" specific
function which converts links inside documents corresponding to what
--adjust-extension did to the actual files being referenced, but
WITHOUT touching "dirname" in any way, leaving it the way it was in
the original document.

Hope that makes sense.

Thanks,
--Gabriel

> 
> On 06/29/2015 04:03 PM, Gabriel L. Somlo wrote:
> >Hi,
> >
> >Below is an idea for an enhancement to wget, which might be a
> >two-day-ish project for someone familiar with C, maybe less if
> >one is also really familiar with wget internals.
> >
> >
> >The feature I'm looking for consists of an alternative to the existing
> >"--convert-links" option, which would allow the scraped content to be
> >hosted online (from a transparent proxy, like e.g. a squid cache),
> >instead of being limited to offline viewing, via "file://".
> >
> >
> >I would be very happy to collaborate (review and test) any patches
> >implementing something like this, but can't contribute any C code
> >myself, for lawyerly, copyright-assignment related reasons.
> >
> >I am also willing and able to buy beer, should we ever meet in person
> >(e.g.  at linuxconf in Seattle later this year) :)
> >
> >
> >Here go the details:
> >
> >When recursively scraping a site, the -E (--adjust-extension) option
> >will append .html or .css to output generated by script calls.
> >
> >Then, -k (--convert-links) will modify the html documents referencing
> >such scripts, so that the respective links will also have their extension
> >adjusted to match the file name(s) to which script output is saved.
> >
> >Unfortunately, -k also modifies the beginning (protocol://host...) portion
> >of links during conversion. For instance, a link:
> >
> >   "//host.example.net/cgi-bin/foo.cgi?param"
> >
> >might get turned into:
> >
> >   "../../../host.example.net/cgi-bin/foo.cgi%3Fparam.html"
> >
> >which is fine when the scraped site is viewed locally (e.g. in a browser
> >via "file://..."), but breaks if one attempts to host the scraped content
> >for access via "http://..."; (e.g. in a transparent proxy, think populating
> >a squid cache from a recursive wget run).
> >
> >In the latter case, we'd like to still be able to convert links, but they'd
> >have to look something like this instead:
> >
> >   "//host.example.net/cgi-bin/foo.cgi%3Fparam.html"
> >
> >In other words, we want to be able to convert the filename portion of the
> >link only (in Unix terms, that's the "basename"), and leave the protocol,
> >host, and path portions alone (i.e., don't touch the "dirname" part of the
> >link).
> >
> >
> >The specification below is formatted as a patch against the current wget
> >git master, but contains no actual code, just instructions on how one
> >would write this alternative version of --convert-link.
> >
> >
> >I have also built a small two-server test for this functionality. Running:
> >
> >wget -rpH -l 1 -P ./vhosts --adjust-extension --convert-links \
> >      www.contrib.andrew.cmu.edu/~somlo/WGET/
> >
> >will result in three very small html documents with stylesheet links
> >that look like "../../../host/script.html". Once the spec below is
> >successfully implemented, running
> >
> >wget -rpH -l 1 -P ./vhosts --adjust-extension --basename-only-convert-option 
> >\
> >      www.contrib.andrew.cmu.edu/~somlo/WGET/
> >
> >should result in the stylesheet references being converted to the desired
> >"//host/script.html" format instead.
> >
> >
> >Thanks in advance, and please feel free to get in touch if this sounds 
> >interesting!
> >
> >   -- Gabriel
> -- 
> Regards,
> - AJ



reply via email to

[Prev in Thread] Current Thread [Next in Thread]