bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Thoughts on regex support


From: Micah Cowan
Subject: Re: [Bug-wget] Thoughts on regex support
Date: Thu, 24 Sep 2009 20:21:27 -0700
User-agent: Thunderbird 2.0.0.23 (X11/20090817)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Tony Lewis wrote:
> Micah Cowan wrote:
> 
>> - It should use extended regular expressions
> Agreed
> 
>> PCREs are less important
> I have a very strong preference for \s over [[:space]]

Meh; looks like gnulib's EREs do that anyway.

>> - It should be possible to match against just certain components of an
>>   URL
> Agreed. In your exchange with Matthew some possible labels were discussed. I
> compared the identifiers you suggested with the definition of Location in
> JavaScript and noted that there is very little overlap. (I'm not sure that
> JavaScript should be the deciding factor, but these are well-known names for
> the components.)

The names are based primarily on RFC 3986 (which governs URIs; also
various other RFCs use the same lingo). "field" comes from HTML, since
that's where the format originates from.

> url (href)
> scheme (protocol)
> domain (is it host, which includes port, or hostname, which does not?)
> path (pathname)
> query (search [includes ?])
> field (no equivalent)
> 
> Also, Location includes port and hash. How do you plan to deal with these
> aspects of a URL?

"hash" doesn't apply to URIs that wget would handle (it's called the
"fragment" portion in relevant RFCs), as that's not normally part of
what gets sent to the server.

I'd forgotten the port number... probably we should include that with
domain, and consider calling it "host" instead. Actually, since Wget
commonly uses the term "domain", we should at least provide that as an
alternative name.

> There should be a simple way of matching www.site.com and site.com. It might
> be explicitly specified as ':domain:^.*\bsite.com$', but I suspect most
> people will really want ':domain:site.com' to match both, but not to match
> othersite.com.

Well, "^.*" seems redundant if anchors aren't implied. It'd be
".*\bsite\.com" if anchors are implied, and "\bsite\.com$" otherwise.

Note, of course, that -D is still an option, and -D site.com would be
equivalent, so probably still the best choice.

>> - It should be easy to match against individual fields from HTML
>>   forms, within query strings.
> I agree that it is convenient to separate the query string into "fields" by
> splitting on '&'. However, I think it should also be easy to match on the
> name and value portions of name=value. For example, exclude any URL where
> 'action' is specified. Perhaps that will be ':field:^action='.

Yes. Or if we go with implicit anchors (which I currently favor
specifically because it makes the most sense for :field:), it'd be
':field:action=.*'

>> - We should avoid unnecessary external dependencies if possible.
> Agreed, but we should not lose functionality for most users because some
> implementation has a broken or missing regex library.

If we're going with gnulib (likely), it performs a set of tests that the
system regex library must pass before it decides to use it rather than
the built-in version. It's not clear to me whether that includes sugar
like \s, \b.

>> - We should provide short options.
> Perhaps, but I would put this in the "nice to have" category.

Sure.

> 
>> { --match | --no-match }  [ : components [ / flags ] : ] regex
> Sounds OK, but I think you mean: [ : [ components ] [ / flags ] : ]

Or something similar. '::' would be silly :)

> That is, I think you meant to allow ':/i:foo' since you use that syntax
> later in your message.

Right.

>> With short options -z and -Z for --match and --no-match, respectively.
> Those are not intuitive choices to me, but OK.

No, they're not. But we only have a handful of sane single-character
options remaining, and only three pairings of uppercase/lowercase. The
other two are -J/j and -G/g.

>> it is implicitly anchored at the beginning and end
> I think this is a bad idea. If someone wants ^ and $, they should specify
> them.

So, that makes two against so far. Though I think I may have persuaded
Matt Woehlke down to a slight, vague preference.

> I realize that my argument for domain matching is not entirely consistent
> with explicit anchors. Going back to domain matching, I think
> ':domain:site.com' should be interpreted as ':domain:^.*\bsite\.com$', but I
> also think domain matching is a special case.

Definitely not doing that. Special-casing the syntax for different
components strikes me as a bad idea.

And I don't think domain matching is nearly the special case that
field-matching is, and I'm not special-casing that, either. Syntax-wise,
that is: it's already a special-case in the sense that it's a single
regex that gets matched against multiple candidates, rather than once
against one single slice of an URL.

> In the more general case of anchoring, I think ':path:foo' should match
> '/path/to/foo.html' and '/foo/baz/index.html'.

Yeah, except when is that useful?

But that is the essence of the anchoring versus no-anchoring debate.

>> If the components aren't specified, it would default to matching just
>> the pathname portion of the URL.
> I'm not sure this is the obvious behavior, but I would get used to it.

It's open for discussion. What do you think the most obvious behavior
would be? Full-url? I'm currently trying to aim for
most-frequently-used, over most-obvious, so if you think that'd be a
different component (or slice of components), lemme know.

> It is not clear to me how one would combine matches. Let's say that I want
> all ZIP files from directory a (but not from directory A) and all JPG files
> from directory b (but not from directory B). How do I indicate that I want
> to match:
> 
> (':path:\ba\b' AND ':path/i:\.zip$') OR (':path:\bb\b' AND ':path/i:\.jpg$')

I'd probably go for --match ':path:.*/a/.*\.[Zz][Ii][Pp]' and --match
':path:.*/b/.*\.[Jj][Pp][Ee]?[Gg]'. PCREs would make that somewhat nicer.

As already discussed, --match and --no-match would be analogs to -A and
- -R; they'd just use regexes rather than wildcards (and have wider
options for what portions you're matching against). This means that
arbitrary decision tables are not made possible. If we really want that,
we should either provide a complete boolean expression syntax (or action
tables, but those are ugly), or we should eschew regexes altogether and
just go straight for farming out to external commands (which, again, I
do plan on adding, probably for 1.13).

>> == The "--traverse" option ==
> In general, I agree with the thinking in this entire section.
> 
>> Additionally, the --traverse settings would be ignored when we're one
>> level away from the maximum recursion depth. Why download something just
>> to throw it out without doing anything more?
> What if you're recording unfollowed links to the SIDB? Don't you still want
> those links to appear?

Dunno. What do you think?

>> Caveat: I'm against giving --traverse an implicit default value of
>> '.*\.html?'
> What's wrong with treating --traverse as meaning --traverse
> ':path/i:^.*\.html?$' and then having --traverse ':path/i:^.*\.php$'
> override that behavior and only download PHP pages. In other words, if you
> don't specify a matching pattern to traverse, it behaves the way it does
> now, but if you do specify one, you have to include '.html' if you want HTML
> suffixes as well.

Mainly, because I don't want to continue what I consider to be broken
default behavior, if I can get away with it. :)

> Given that the most common use case is to match against suffixes in the
> path, perhaps ':path/i:^.*\.' and '$' should be implied so that --traverse
> '(html?|php)' is interpreted as ':path/i:^.*\.(html?|php)$'.

Again, I really want consistency with the regex rules.

Perhaps we should simply say, "most convenience be damned", and go for
explicit anchors everywhere, even if that leads to a little more typing
in most places. It certainly follows the principle of least surprise...

> By the way, it would probably be helpful to have a variation of traverse
> that looks for Content-Type headers that contain "text/html" regardless of
> path extension.

Maybe, but I doubt it. The problem is that this would require us to do a
HEAD on _every_ link we come across, to check the header. And HEAD isn't
even all that reliable (many servers fail to support it, or I think even
sometimes report false Content-Type information).

The bug tracker already has a similar issue open for an "--accept-type"
or something, which would do the same thing, but as an accept/reject
rule. I think it may have been my idea, or else it was already in the
TODO list when I started. But it suffers from the same problems, and I'm
not sure the benefits outweigh the shortcomings enough to make it a
useful option.

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
Maintainer of GNU Wget and GNU Teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkq8NzYACgkQ7M8hyUobTrETZACfXynM2Cle7PPbOPPUa3xNhsFU
btoAoJJU30OliLZkHR35tDdKjSrPu9Uu
=MuWi
-----END PGP SIGNATURE-----




reply via email to

[Prev in Thread] Current Thread [Next in Thread]