bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Thoughts on regex support


From: Micah Cowan
Subject: [Bug-wget] Thoughts on regex support
Date: Tue, 22 Sep 2009 22:08:17 -0700
User-agent: Thunderbird 2.0.0.23 (X11/20090817)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Alright, so I've been thinking lately about how to implement support for
regular expressions in Wget, and thought I'd share them.

First, here are what I've come up with for a list of requirements in
adding regex support:

  - It should use extended regular expressions, if not perl-compatible
    regexes. This means it should support intervals - {x,y} - and
    grouping/alternations - (foo|bar). Having to backslash-escape
    parentheses to indicate grouping is a PITA. PCREs are less
    important; it's acceptable to have to write [[:space:]] instead of
    \s, and to forego handy things such as look-behind/look-ahead...

  - It should be possible to match against just certain components of an
    URL, and not have to specify a regex for the entire URL every time.
    In particular, people don't want to write a regex like
    '.*/index.php$', and then become frustrated when it fails to match
    "www.example.com/dir/index.php?id=1", because they forgot to make
    their regex take the query-string portion into account.

  - We should avoid adding many more options than are necessary. Just
    because we want to allow users to select which portions of the URL
    they're matching, doesn't mean we should have separate options for
    each (--match-url, --match-body, --match-path,
    --match-path-plus-query-string, etc)

  - It should be easy to match against individual fields from HTML
    forms, within query strings. If a user wants to avoid fetching pages
    with "action=print", it's cumbersome to concoct just the right
    query-string regex: ^(.*\&)?action=print(\&.*)?$ (writing it as just
    .*action=print.* means it would match "fraction=printed", too)

  - We should avoid unnecessary external dependencies if possible.
    Gnulib provides a "regex" module (which will prefer the system
    libraries' regex facilities if they pass muster). However, since
    EREs are fully compatible with PCREs, I don't see what harm there
    would be in offering PCREs as an alternative, to those that want it.
    Probably not in the first iteration of this feature, though.

  - We should provide short options. They're quickly becoming scarce,
    but this is a valuable enough feature to warrant it.

Alright, so given these, here's my thoughts on the interface.

  { --match | --no-match }  [ : components [ / flags ] : ] regex

With short options -z and -Z for --match and --no-match, respectively.
The regex will be matched against the full URL (or the selected
components from the URL) - that is, it is implicitly anchored at the
beginning and end; think of it as having an invisible ^ at the
beginning, and a $ at the end. (This is also the behavior of the `-regex'
option to the GNU find command.)

If the components aren't specified, it would default to matching just
the pathname portion of the URL. This allows for simple examples like:

  --match '.*/common\.php'
  . Match any file named "index.php" (similar to -A common.php, except
    the -A version does suffix-matching, so it would match
    "uncommon.php" as well.)

  --match '.*\.php'
  . Identical to -A .php

  --match '.*/SPECIAL/foo.html'
  . Match the file `foo.html' directly within any directory named
    "/foo/". There's no real equivalent to this in Wget, as the
    wildcard-matching of -I doesn't provide a way to match against
    arbitrary sequences of directories ("*" can't match slash).

You could match against query strings by specifying that component:

  --no-match ':query:nofollow'
  . Don't fetch foo.php?nofollow

  --no-match ':field:action=(edit|print)'
  . Don't follow links for producing printer-ready output, or editing
    pages. Equivalent to --no-match ':query:(.*&)?action=print(&.*)?',
    but somewhat easier to write.

Components can also be specified using the shortest unique identifier,
so that last example could be written simply: -z ':f:action=(edit|print)'

Components may be combined; to match against the combination of path and
query string, you just specify :path+query:. That could be abbreviated
as :p+q:. Combinations are only allowed if all the components involved
are consecutive; :domain+query: (no path) would be illegal.

Case-sensitivity could be disabled using the "/i" flag.

  -z ':/i:.*.htm$'
  . Match any suffix like ".htm", ".HTM", ".HtM", etc.

Possible variations and other thoughts:

  - If we make the components-specification mandatory, we could eschew
    the initial colon.

  - Avoid adding both a --match and a --no-match option, by making
    negation a flag instead (/n or something: --match 'p/ni:.*\.js'
    would reject any paths ending in any case variant of ".js").

  - Other anchoring options. I suspect that the many common use cases
    will begin with '.*'. We could remove the implicit anchoring, but
    then we'd probably usually want it at the end, forcing us to write
    the final '$'. That's one character versus two, but my gut tells me
    it's easier to forget anchors than it is to forget "match-any"
    patterns, which is why I lean toward implicit anchors. Meanwhile,
    anchoring just the right side and leaving the left open violates the
    principle of least surprise more than I like. Open to discussion,
    though.

== The "--traverse" option ==

This may not quite fit the rest of the theme, but while we're we're
visiting the subject of enhancing our accept/reject rules, why not take
the opportunity to improve upon a bit of current Wget behavior that can
really get in the way of getting Wget to do what you want it to.

Currently, regardless of what you decree to be acceptable or
unacceptable, Wget will always download files whose paths end in ".htm"
or ".html", and there is no means to turn this off. Want to disable all
pages with the parameter string "action=edit"? Well, you're out of luck
when you hit http://wiki.example.com/About.html?action=edit - we
download it anyway.

Worse, that file would then be deleted after we'd parsed links from it,
even if we specified `-A .html', because while pattern matching for
determining what to download is only ever against the filename portion
of the URL, Wget matches against the final name of the
locally-downloaded file to determine what it should delete (for not
having been "acceptable").

Meanwhile, there's no way to tell Wget about _other_ sorts of files that
should be downloaded (to find links), but not kept in the final results.
.php, .asp, directory names, etc, are all given the miss. You could
specify them in the accept rules, but you'd have to delete them by hand
afterward. Unless of course, some of them ended with query strings, in
which case Wget would conveniently forget that they'd ever matched in
the first place.

Even more: Wget will download these extra .html files, even when it
knows it won't be doing anything with the links it parses from them! If
you set -r --level=1 --accept=.zip, and point Wget at an Apache-style
indexing page, Wget will happily download all the index.html?<blah-blah>
variants for different sorting options, even though it can't traverse
the resulting links without exceeding the recursion depth.

Proposed solution? Make this option (a) configurable, and (b) behave
consistently. A --traverse option would specify a regex for files which
should be parsed for links, but not kept in the results. It would follow
the same syntax as --match, so it could be used with query strings. And
we would only decide whether or not to keep a file around based on its
original URL, and not on the local file name after download.

Additionally, the --traverse settings would be ignored when we're one
level away from the maximum recursion depth. Why download something just
to throw it out without doing anything more?

Caveat: I'm against giving --traverse an implicit default value of
'.*\.html?' in the interests of preserving previous default behaviors
(we'd then also need to provide a way to "clear" --traverse of its
existing esttings). The downside to this decision is that this could
potentially BREAK EXISTING SCRIPTS. I'm not too worried about this: they
can slap a rule in their .wgetrc files; it's more important to ensure
that there's a means to reproduce historical behavior, than it is to
ensure Wget's default behavior is backward-compatible.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
Maintainer of GNU Wget and GNU Teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEUEARECAAYFAkq5rUEACgkQ7M8hyUobTrHLTQCWNQ4dKxqSs7mzLIjqV89GAGp1
wACdGYOTw60P032FmRweB2YWtp1h9QY=
=C1Ke
-----END PGP SIGNATURE-----





reply via email to

[Prev in Thread] Current Thread [Next in Thread]