bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] [RFE / project idea]: convert-links for "transparent proxy" m


From: Gabriel L. Somlo
Subject: [Bug-wget] [RFE / project idea]: convert-links for "transparent proxy" mode
Date: Mon, 29 Jun 2015 10:03:21 -0400
User-agent: Mutt/1.5.23 (2014-03-12)

Hi,

Below is an idea for an enhancement to wget, which might be a
two-day-ish project for someone familiar with C, maybe less if
one is also really familiar with wget internals.


The feature I'm looking for consists of an alternative to the existing
"--convert-links" option, which would allow the scraped content to be
hosted online (from a transparent proxy, like e.g. a squid cache),
instead of being limited to offline viewing, via "file://".


I would be very happy to collaborate (review and test) any patches
implementing something like this, but can't contribute any C code
myself, for lawyerly, copyright-assignment related reasons.

I am also willing and able to buy beer, should we ever meet in person
(e.g.  at linuxconf in Seattle later this year) :)


Here go the details:

When recursively scraping a site, the -E (--adjust-extension) option
will append .html or .css to output generated by script calls.

Then, -k (--convert-links) will modify the html documents referencing
such scripts, so that the respective links will also have their extension
adjusted to match the file name(s) to which script output is saved.

Unfortunately, -k also modifies the beginning (protocol://host...) portion
of links during conversion. For instance, a link:

  "//host.example.net/cgi-bin/foo.cgi?param"

might get turned into:

  "../../../host.example.net/cgi-bin/foo.cgi%3Fparam.html"

which is fine when the scraped site is viewed locally (e.g. in a browser
via "file://..."), but breaks if one attempts to host the scraped content
for access via "http://..."; (e.g. in a transparent proxy, think populating
a squid cache from a recursive wget run).

In the latter case, we'd like to still be able to convert links, but they'd
have to look something like this instead:

  "//host.example.net/cgi-bin/foo.cgi%3Fparam.html"

In other words, we want to be able to convert the filename portion of the
link only (in Unix terms, that's the "basename"), and leave the protocol,
host, and path portions alone (i.e., don't touch the "dirname" part of the
link).


The specification below is formatted as a patch against the current wget
git master, but contains no actual code, just instructions on how one
would write this alternative version of --convert-link.


I have also built a small two-server test for this functionality. Running:

wget -rpH -l 1 -P ./vhosts --adjust-extension --convert-links \
     www.contrib.andrew.cmu.edu/~somlo/WGET/

will result in three very small html documents with stylesheet links
that look like "../../../host/script.html". Once the spec below is
successfully implemented, running

wget -rpH -l 1 -P ./vhosts --adjust-extension --basename-only-convert-option \
     www.contrib.andrew.cmu.edu/~somlo/WGET/

should result in the stylesheet references being converted to the desired
"//host/script.html" format instead.


Thanks in advance, and please feel free to get in touch if this sounds 
interesting!

  -- Gabriel

---
 doc/wget.texi |  4 ++++
 src/convert.c | 73 +++++++++++++++++++++++++++++++++++++++++++++++++++++++++++
 src/convert.h |  2 ++
 src/init.c    |  1 +
 src/main.c    |  9 ++++++++
 src/options.h |  1 +
 6 files changed, 90 insertions(+)

diff --git a/doc/wget.texi b/doc/wget.texi
index 16cc5db..6f45e8d 100644
--- a/doc/wget.texi
+++ b/doc/wget.texi
@@ -1998,6 +1998,10 @@ Note that only at the end of the download can Wget know 
which links have
 been downloaded.  Because of that, the work done by @samp{-k} will be
 performed at the end of all the downloads.
 
address@hidden %** FIXME: blurb about alternative link conversion flag here
address@hidden %** (like --convert-links above, but leave "dirname link" alone,
address@hidden %**  and only convert "basename link", for online rather than 
local viewing)
+
 @cindex backing up converted files
 @item -K
 @itemx --backup-converted
diff --git a/src/convert.c b/src/convert.c
index 6d78945..06a0ff6 100644
--- a/src/convert.c
+++ b/src/convert.c
@@ -138,6 +138,9 @@ convert_links_in_hashtable (struct hash_table 
*downloaded_set,
                  not be identical to that on the server (think `-nd',
                  `--cut-dirs', etc.)  */
               cur_url->convert = CO_CONVERT_TO_RELATIVE;
+              /* FIXME: if alternative convert option selected, set
+               * cur_url->convert to that other constant instead!
+               * Also, update comment above to describe that */
               cur_url->local_name = xstrdup (local_name);
               DEBUGP (("will convert url %s to local %s\n", u->url, 
local_name));
             }
@@ -176,6 +179,8 @@ convert_links_in_hashtable (struct hash_table 
*downloaded_set,
    downloaded get converted to the relative URL which will point to
    that file.  And the other URLs get converted to the remote URL on
    the server.
+   FIXME: update above paragraph to include basename-only link adjustment
+   (where the beginning of the url -- protocol and host -- are left untouched.
 
    All the downloaded HTMLs are kept in downloaded_html_files, and
    downloaded URLs in urls_downloaded.  All the information is
@@ -206,6 +211,7 @@ static const char *replace_attr_refresh_hack (const char *, 
int, FILE *,
                                               const char *, int);
 static char *local_quote_string (const char *, bool);
 static char *construct_relative (const char *, const char *);
+/* FIXME: new construct function decl. for alternative link-adjust option */
 
 /* Change the links in one file.  LINKS is a list of links in the
    document, along with their positions and the desired direction of
@@ -320,6 +326,18 @@ convert_links (const char *file, struct urlpos *links)
             ++to_file_count;
             break;
           }
+        /* FIXME: Clone CO_CONVERT_TO_RELATIVE case above for the new
+         * basename-only link adjust option (new constant added to convert.h).
+         * Instead of calling construct_relative(file, link->localname),
+         * call the new construct function whose prototype we added above.
+         * Arguments should be p (the pointer to the beginning of the
+         * link data in the file), and link (the structure containing
+         * all the details about the original link and its extension-adjusted
+         * version). Something like:
+         *   ...
+         *   char *newname = new_construct_function (p, link);
+         *   ...
+         */
         case CO_CONVERT_TO_COMPLETE:
           /* Convert the link to absolute URL. */
           {
@@ -422,6 +440,61 @@ construct_relative (const char *basefile, const char 
*linkfile)
   return link;
 }
 
+/* FIXME: function to construct and return a "transparent proxy" URL
+ * reflecting changes made by --adjust-extension to the file component
+ * (i.e., "basename") of the original URL, but leaving the "dirname"
+ * of the URL (protocol://hostname... portion) untouched.
+ *
+ * Think: populating a squid cache via a recursive wget scrape, where
+ * changing URLs to work locally with "file://..." is NOT desirable.
+
+   Example:
+
+   if
+                     p = "//foo.com/bar.cgi?xyz"
+   and
+      link->local_name = "docroot/foo.com/bar.cgi?xyz.css"
+   then
+
+      new_construct_func(p, link);
+   will return
+      "//foo.com/bar.cgi?xyz.css"
+
+   Essentially, we do s/$(basename orig_url)/$(basename link->local_name)/
+
+Here goes:
+
+static char *
+new_construct_function (const char *p, const struct urlpos *link)
+{
+
+  - make a copy of the original link at position p, using strndup().
+    The length of the current link at p is given by link->size.
+    BEWARE: if p starts with a single or double quote, then the *real*
+            link starts at position p+1, and length is (link->size - 2)
+            i.e., make sure to strip "..." or '...' quotes before cloning
+            the link text.
+
+  - compute the "basename" of the original link (i.e., strrchr(..., '/'),
+    then one position past that). If strrchr returns null, then the whole
+    link is the basename, i.e. we have no protocol, host, and path in front
+    of the file name that is the link.
+
+  - compute the basename of the adjusted link (in link->local_name).
+    Same procedure as computing basename for original link text.
+
+  - if the basenames of the original and adjusted links are the same, then
+    this link is not affected by --adjust-extension, so we return the
+    strndup-ed clone of the original link (caller expects a newly allocated
+    string it can free).
+
+  - if the basenames differ, build a new string where the adjusted basename
+    is grafted onto the "dirname" of the original link. Make sure to free
+    all but the newly allocated "grafted" string, which gets returned to the
+    caller
+}
+*/
+
 /* Used by write_backup_file to remember which files have been
    written. */
 static struct hash_table *converted_files;
diff --git a/src/convert.h b/src/convert.h
index 3105db1..05a89c1 100644
--- a/src/convert.h
+++ b/src/convert.h
@@ -40,6 +40,8 @@ enum convert_options {
   CO_NOCONVERT = 0,             /* don't convert this URL */
   CO_CONVERT_TO_RELATIVE,       /* convert to relative, e.g. to
                                    "../../otherdir/foo.gif" */
+  /* FIXME: insert constant for converting basename only, leaving "dirname"
+   * (i.e., the beginning of the converted URLs) unchanged */
   CO_CONVERT_TO_COMPLETE,       /* convert to absolute, e.g. to
                                    "http://orighost/somedir/bar.jpg";. */
   CO_NULLIFY_BASE               /* change to empty string. */
diff --git a/src/init.c b/src/init.c
index a436ef2..1453452 100644
--- a/src/init.c
+++ b/src/init.c
@@ -160,6 +160,7 @@ static const struct {
   { "contentonerror",   &opt.content_on_error,  cmd_boolean },
   { "continue",         &opt.always_rest,       cmd_boolean },
   { "convertlinks",     &opt.convert_links,     cmd_boolean },
+  /* FIXME: add entry for alternative (basename only) link conversion option */
   { "cookies",          &opt.cookies,           cmd_boolean },
 #ifdef HAVE_SSL
   { "crlfile",          &opt.crl_file,          cmd_file_once },
diff --git a/src/main.c b/src/main.c
index a0044d9..bf04b9f 100644
--- a/src/main.c
+++ b/src/main.c
@@ -193,6 +193,7 @@ static struct cmdline_option option_data[] =
     { "connect-timeout", 0, OPT_VALUE, "connecttimeout", -1 },
     { "continue", 'c', OPT_BOOLEAN, "continue", -1 },
     { "convert-links", 'k', OPT_BOOLEAN, "convertlinks", -1 },
+    /* FIXME: option for alternative, basename-only link conversion */
     { "content-disposition", 0, OPT_BOOLEAN, "contentdisposition", -1 },
     { "content-on-error", 0, OPT_BOOLEAN, "contentonerror", -1 },
     { "cookies", 0, OPT_BOOLEAN, "cookies", -1 },
@@ -748,6 +749,7 @@ Recursive download:\n"),
     N_("\
   -k,  --convert-links             make links in downloaded HTML or CSS point 
to\n\
                                      local files\n"),
+    /* FIXME: help blurb for alternative, basename-only link convert option */
     N_("\
        --backups=N                 before writing file X, rotate up to N 
backup files\n"),
 
@@ -1257,6 +1259,7 @@ main (int argc, char **argv)
      interoption dependency checks. */
 
   if (opt.noclobber && opt.convert_links)
+                     /* FIXME: convert_links should be OR-ed with new option */
     {
       fprintf (stderr,
                _("Both --no-clobber and --convert-links were specified,"
@@ -1315,6 +1318,9 @@ Can't timestamp and not clobber old files at the same 
time.\n"));
   if (opt.output_document)
     {
       if (opt.convert_links
+          /* FIXME: convert_links should be OR-ed with new option
+           * also, the error message should be updated to mention
+           * the new alternative link convert option */
           && (nurl > 1 || opt.page_requisites || opt.recursive))
         {
           fputs (_("\
@@ -1620,6 +1626,8 @@ for details.\n\n"));
             output_stream_regular = true;
         }
       if (!output_stream_regular && opt.convert_links)
+                       /* FIXME: convert_links should be OR-ed with new option
+                        * also, maybe update error msg. to mention new opt */
         {
           fprintf (stderr, _("-k can be used together with -O only if \
 outputting to a regular file.\n"));
@@ -1770,6 +1778,7 @@ outputting to a regular file.\n"));
     save_cookies ();
 
   if (opt.convert_links && !opt.delete_after)
+     /* FIXME: convert_links should be OR-ed with new option */
     convert_all_links ();
 
   cleanup ();
diff --git a/src/options.h b/src/options.h
index bef1f10..c922d32 100644
--- a/src/options.h
+++ b/src/options.h
@@ -177,6 +177,7 @@ struct options
                                    NULL. */
   bool convert_links;           /* Will the links be converted
                                    locally? */
+  /* FIXME: new bool for alternative (basename-only) link convert option */
   bool remove_listing;          /* Do we remove .listing files
                                    generated by FTP? */
   bool htmlify;                 /* Do we HTML-ify the OS-dependent
-- 
2.1.0




reply via email to

[Prev in Thread] Current Thread [Next in Thread]