[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Bug-wget] [RFE / project idea]: convert-links for "transparent prox
Gabriel L. Somlo
Re: [Bug-wget] [RFE / project idea]: convert-links for "transparent proxy" mode
Mon, 31 Aug 2015 20:02:41 -0400
On Mon, Aug 31, 2015 at 03:14:42AM +0200, Ander Juaristi wrote:
> Since no one expressed either interest or refusal to this idea (and I found
> myself in an unexpected situation of having more free time than usual :D), I
> decided to work on it a bit, which I've been doing during this week.
> After hacking some code over your inline comments, I did several test runs
> over your provided test servers (www.contrib.andrew.cmu.edu/...) and still
> Wget was processing net paths automatically by prefixing the protocol
> ("http://"). So I thought the problem could be tackled down by just not
> converting net paths ("//") into schemes (ie "http://"), when transforming
> the downloaded HTML/CSS files.
> Sorry if I'm still unable to see through your use case but I think it all
> could be solved by simply introducing a new switch that prevents that
> conversion. For example:
> $ wget --keep-net-paths ...
> So that "//mirror.cmu.edu/..." would not be converted into
> "http://mirror.cmu.edu/...". The rest of the job (such as #1 in your
> previous answer) would be done by the other switches, such as
> '--convert-links' itself.
> You've got a broader overview than me. You think this is enough?
I started by looking at
char *newname = construct_relative(file, link->localname)
That function uses the disk file name of the file containing the link,
and the disk file name of the file the link is pointing at.
All it needs is the two file names, so it can build a relative file
system reference (e.g. backing out of the current dir. of 'file'
enough to then be able to descend into the current dir of
It returns a freshly allocated string (newname) which then gets
quoted, and used to replace the value of the original link in the
To accomplish something kinda like that, but not really -- we still
want a newly allocated 'newname', except not something related to a
local-disk file name.
We need the original value of the link from the downloaded document
(may start with '[http[s]:]//...', depending on whatever the author
of the web page used in their original html), and we need the
extension-adjusted name of the saved link target (that's still
The original value of the link starts at 'p' (or 'url_start'), and
its size is given by link->size.
So we could call a function
char *newname = construct_tpu(p, link)
p points at a string which looks like this:
link->size includes the surrounding single or double quotes, if
present in the original file.
So, if *p=='"' or if *p=='\'', the real link size is shorter by
two characters than the value of link->size :), and the actual link
text starts at *(p+1).
link->localname will be something like
All we need to do is calculate dirname(p) and basename(link-localname),
concatenate them together, and we've ended up with a "transparent
proxy URL" link, which uses the "online" (i.e. NOT file://...)
protocol to request the *adjusted* filename scraped and saved by wget.
In other words,
Does that make sense ?
Please feel free to grab me on IRC some time during "work hours" (I'm on
US Eastern time, hope there's some useful overlap with your active
hours :) and we can chat about it in some more detail, if you'd like.