bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Re: Thoughts on regex support


From: Micah Cowan
Subject: Re: [Bug-wget] Re: Thoughts on regex support
Date: Wed, 23 Sep 2009 13:33:06 -0700
User-agent: Thunderbird 2.0.0.23 (X11/20090817)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Matthew Woehlke wrote:
> Micah Cowan wrote:
>> [stuff about regex matching]
> 
> How will you handle nested boolean expressions? Same as 'find'?
> 
> IOW, how do you do this?
> [url matches foo] AND ( [domain matches bar] OR [query matches baz] )
> 
> (Obviously I am intentionally choosing an example where the 'or' part
> can't be easily expressed in the regex.)

Actually, my plan is... not to. The current method for checking
accept/reject rules - if it's in the acceptance list, and not in the
reject list, it's in - has never garnered any complaints to my
knowledge. And technically, such logic is representable in the regex
itself; it would just be butt-ugly, and a pain to craft (but only if you
wanted the literal case you mention: I can't think of a circumstance
where it'd actually be useful).

I had thought that, if we really want a robust system, we could do it as
a series of result tables, a lá PAM or iptables. But that tends to be
vastly more confusing to users. My expectation is that the real use
cases for such a thing is going to be incredibly rare.

At some point (1.13, I'm hoping?) Wget will provide more options for
delegating tasks to outside programs, which is a good deal more Unixy.
In that event, we could pawn off acceptability decisions to an awk
script or what have you.

My primary concern for the immediate future, really, is to start
supporting query string-matching, and fix some of the things about
accept/reject that I consider fundamentally broken. Regexes were
something that had always been attractive, and seems to be a convenient
way to address the query string problem at the same time.

OTOH, I realize that we want to make it as robust as possible. If
someone can come up with a simple and easy-to-use system, I'm interested.

>>   --no-match ':field:action=(edit|print)'
> 
> Something like 'param[eter]' or 'arg[ument]' seems more sensible to me
> (though as a programmer I am not the best to ask about usability
> things). Such URL's coming from a form isn't always obvious... and in
> some cases is even untrue.

"Parameter", at least, suffers from the shortcoming that it forces both
itself and path to specify a minimum of three characters to get a unique
label.

I'd go so far as to say it's frequently the case that such query-string
formats don't come from an HTML form; however, as far as I know, HTML
forms are the only thing that actually specify that format, and I assume
they're directly responsible for the popularity of this representation.
CGI itself doesn't lay any constraints or even expectations on what the
query string should look like (though most libraries implementing CGI
provide facilities for the HTML forms format).

However, it could still be termed a "field", I believe, regardless of
whether it comes from an HTML form or not. I personally prefer it over
either "parameter" or "argument", but I'm willing to hear more opinions.

(I might prefer "parameter", actually, if it weren't for the conflict
with "path". But I doubt we'll identify a more appropriate name for "path".)

>>   . Don't follow links for producing printer-ready output, or editing
>>     pages. Equivalent to --no-match ':query:(.*&)?action=print(&.*)?',
>>     but somewhat easier to write.
> 
> Just in case you're planning on a conversion to that regex in the code,
> remember that it is really:
>   '^.*[?]([^&]*&)*action=print(&.*)?$'

No it's not. The anchors are already implicit, remember? and the
":query:" label means it only tries that regex against the query-string
portion of the URL, so the .*[?] would break. If I'd used :url: instead
of :query:, then that modification would be necessary (though still
without the anchors).

> For that matter, if you support '\b', I wonder if you need "components"
> at all...

I don't see how that would help anywhere save for the "fields"
components; but then I gather from the above that you may have been a
bit confused about what the effect of the component-selection does.

If I were going to support that (and I may), then I'd probably go with
\< and \> instead, as that's what seems to be commonly used for EREs,
outside of Perl.

>> Components may be combined; to match against the combination of path and
>> query string, you just specify :path+query:. That could be abbreviated
>> as :p+q:. Combinations are only allowed if all the components involved
>> are consecutive; :domain+query: (no path) would be illegal.
> 
> I can probably figure out technical reasons for that, but it doesn't
> make much sense from a user perspective. Why shouldn't I be able to write:
>   -z ':d,f:foo'
> ...and have it match both
>    'http://foobar.com/'
>  and
>    'http://baz.org/index?title=foobar'
> ?

No. It means that entire regex is matched, once, against the combined
components, not matched once for each of the components. There is no
sane way to combine only the domain and a field (field would not be
allowed to combine with anything, in fact).

> BTW, what exactly are the components? Is this right?
> 
> [u]rl: http://foobar.com/site/images/thumb.php?name=baz.jpg&x=64&y=64
> p[r]otocol: "http"
> [d]omain: "foobar.com"
> [p]ath: "site/images"
> [f]ile: "thumb.php"
> [q]uery: "name=baz.jpg&x=64&y=64"
> [a]rgs: "name=baz.jpg", "x=64", "y=64"

This is the diagram I did (but didn't include in the message I sent out).

  /---\ scheme             /------ path ---\ /---------- query ------\
  https://addictivecode.org/foo/bar/baz.html?fee=fi&fo=fum&bludtype=en
          \---- domain ---/                         \----/   field

The idea would be that "query" would match everything after the question
mark (so don't include the mark in your regex), though "path" would
include the intial / (it would always have one). The main advantage to
that for "path" would be that you can do ".*/index.html"; I can't think
of any similar advantage to including the ? in the query string.
However, if you specify :path+query:, then the question mark is included.

Similarly, :scheme: wouldn't include the "://", but :scheme+domain: would.

>>   - Avoid adding both a --match and a --no-match option, by making
>>     negation a flag instead (/n or something: --match 'p/ni:.*\.js'
>>     would reject any paths ending in any case variant of ".js").
> 
> Similar ideas:
>  -z '(?!expr)'

This one's of course automatic with PCRE if we provide that with an
option; we'd have to "emulate" it in builds not including PCRE.

>>   - Other anchoring options. I suspect that the many common use cases
>>     will begin with '.*'. We could remove the implicit anchoring, but
>>     then we'd probably usually want it at the end, forcing us to write
>>     the final '$'. That's one character versus two, but my gut tells me
>>     it's easier to forget anchors than it is to forget "match-any"
>>     patterns, which is why I lean toward implicit anchors.
> 
> MHO: implicit anchoring violates traditional regex usage. There is
> probably an example of implicit anchoring somewhere, but offhand I can't
> think of it. (And at any rate, sed/grep sure don't use implicit anchoring.)

Both sed and grep use regex as the basis of a _search_. We're not
_searching_ for a pattern in a string, we're matching. (Find's manual
uses this same reasoning). Additionally, implicit anchoring is obviously
unhelpful to sed and grep, because by far the most common use cases want
to match anywhere.

On the other hand, this reasoning doesn't apply so cleanly to Perl and
(especially) Awk, where you can argue that the regex's primary function
is for matching, not searching (in Awk, you have to call a separate
function afterwards to get the search results; the regex just returns a
boolean value). Still, it seems to me we're most closely trying to do
what find does, and not what sed, grep, perl or awk do. We're not the
least bit interested in transforming the URL afterwards, or remembering
match start/end positions*.

 * Actually, it'd be great to do transformations on URLs, and especially
file names. We'll do that eventually, but not via built-in Wget
facilities; we'll outsource it to sed or other user-specified commands.

> Of course, if you support '\b' (and require explicit anchoring), then it
> is somewhat hard to justify args (as you can just use '\bexpr\b' against
> query, instead of '^expr$' against args).

Not really. \b would falsely match punctuation that can be a legitimate
part of a field name and/or value (and, in particular, %XX).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
Maintainer of GNU Wget and GNU Teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkq6hgIACgkQ7M8hyUobTrHvRwCfd7rndVUln9ZMmKJs3Twvx7rf
l3gAn2x+t1JTHuKT9xY1YtmtLxLyqXIM
=ebCu
-----END PGP SIGNATURE-----




reply via email to

[Prev in Thread] Current Thread [Next in Thread]