bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Patch: new option --rename-output: modify output filename wit


From: Andrew Cady
Subject: [Bug-wget] Patch: new option --rename-output: modify output filename with perl
Date: Fri, 15 Jan 2010 23:56:28 -0500
User-agent: Mutt/1.5.18 (2008-05-17)

This patch adds an option that allows the user to specify a perl
expression used to modify the target filenames of a call to wget.  It
works similarly to perl's "rename" script, in terms of how perl is used
to modify the filename string.  That is, the original filename is stored
in the perl variable $_, which the user-supplied code can modify; the
value left in $_ is used instead of the original.

Perl treats $_ as the default variable for regular expressions (among
other operations), so that the user can specify a regular expression
without (having to know) any perl code (other than perl-compatible
regexes), and that will work fine.

I implemented this feature back in August or so, in order to mirror
thepiratebay.org with wget.  By default, wget would have put 1M files
into a single directory in order to mirror that site, which (with ext3)
would have destroyed filesystem performance, to say the least.

Since there are many other sites whose visible directory structure is
inappropriate for direct representation in an actual filesystem, I
imagine this patch could be generally useful.

Example usage:


  $ wget -x --rename 's?/?%2f?g' 
http://www.gnu.org/software/wget/manual/html_node/index.html

  --2010-01-15 23:01:23--  
http://www.gnu.org/software/wget/manual/html_node/index.html
  Resolving www.gnu.org... 199.232.41.10
  Connecting to www.gnu.org|199.232.41.10|:80... connected.
  HTTP request sent, awaiting response... 200 OK
  Length: 8545 (8.3K) [text/html]
  Saving to: "www.gnu.org%2fsoftware%2fwget%2fmanual%2fhtml_node%2findex.html"

  100%[===========================================>] 8,545       --.-K/s   in 
0s      

  2010-01-15 23:01:23 (134 MB/s) - 
"www.gnu.org%2fsoftware%2fwget%2fmanual%2fhtml_node%2findex.html" saved 
[8545/8545]


This also works exactly how one would want it to work:


  $ wget -q --rename 's?/?%2f?g' -r --no-parent -k 
http://www.gnu.org/software/wget/manual/html_node/index.html


I.e., you get the site saved without any of the directory structure, and
all the internal links still work.

It is also possible to create directory structure by adding slashes.
(That is how I dealt with thepiratebay.org).

Regexes are probably the most useful thing to use with this script,
but since arbitrary perl is allowed, quite a lot more could be done.
(An example is generalizing the regex above, to translate some larger
set of characters to %hex codes.)  I originally wanted to use PCRE for
this, but (amazingly) it doesn't directly provide any facility for
substitution -- only matching.  I couldn't find such a facility in C
library form anywhere on the internet.  Rather than (re)implement it, I
just called perl.  I thought it was terribly hackish at the time, but
now I like it.  It actually adds much less to the binary (when you don't
use it) than the PCRE approach would have.

Attachment: rename-output.patch
Description: Text Data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]