bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Re: Thoughts on regex support


From: Matthew Woehlke
Subject: [Bug-wget] Re: Thoughts on regex support
Date: Wed, 23 Sep 2009 16:34:44 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.23) Gecko/20090825 Fedora/2.0.0.23-1.fc10 Thunderbird/2.0.0.23 Mnenhy/0.7.5.0

Micah Cowan wrote:
Matthew Woehlke wrote:
Micah Cowan wrote:
[stuff about regex matching]
How will you handle nested boolean expressions? Same as 'find'?

IOW, how do you do this?
[url matches foo] AND ( [domain matches bar] OR [query matches baz] )

(Obviously I am intentionally choosing an example where the 'or' part
can't be easily expressed in the regex.)

Actually, my plan is... not to.

That's a perfectly valid answer, and a possibility I was thinking of (though I didn't explicitly say so).

The current method for checking
accept/reject rules - if it's in the acceptance list, and not in the
reject list, it's in - has never garnered any complaints to my
knowledge.

Okay, so it works like this:

[matches any of match-expr's] AND [does not match any of exclude-expr's]

My only comment here is that, if that's how multiple expressions will be handled, I'm less sure about having an invert match, as opposed to --no-match and /only/ --no-match.

  --no-match ':field:action=(edit|print)'
Something like 'param[eter]' or 'arg[ument]' seems more sensible to me
(though as a programmer I am not the best to ask about usability
things). Such URL's coming from a form isn't always obvious... and in
some cases is even untrue.

However, it could still be termed a "field", I believe, regardless of
whether it comes from an HTML form or not. I personally prefer it over
either "parameter" or "argument", but I'm willing to hear more opinions.

(I might prefer "parameter", actually, if it weren't for the conflict
with "path". But I doubt we'll identify a more appropriate name for "path".)

Agreed. But then, you don't have 'file'? That's probably the main reason I prefer 'args', it doesn't conflict with anything.

Well, okay, I /actually/ prefer \b and forget fields/args/whatever :-). It should make the code less complicated also (otherwise as I understand it, this would be a match against any of a list of strings, where everything else is a match against a single string.)

Actually, it might even make sense to implement \b as only matching start/end and '[&?/.]'. That way matching path components (well, unless the paths contain '.') is also "safe".

  . Don't follow links for producing printer-ready output, or editing
    pages. Equivalent to --no-match ':query:(.*&)?action=print(&.*)?',
    but somewhat easier to write.
Just in case you're planning on a conversion to that regex in the code,
remember that it is really:
  '^.*[?]([^&]*&)*action=print(&.*)?$'

No it's not. The anchors are already implicit, remember?

You should finish reading my mail first :-). (Anyway I included them mostly for clarity.)

and the ":query:" label means it only tries that regex against the
query-string portion of the URL, so the .*[?] would break. If I'd
used :url: instead of :query:, then that modification would be
necessary (though still without the anchors).

Er... yeah. Never mind then.

For that matter, if you support '\b', I wonder if you need "components"
at all...

I don't see how that would help anywhere save for the "fields"
components;

See above, it is useful in query, domain, and path (and to an extent even in file).

but then I gather from the above that you may have been a
bit confused about what the effect of the component-selection does.

True. If you specified clearly what are the components in the original mail, I obviously missed it.

If I were going to support that (and I may), then I'd probably go with
\< and \> instead, as that's what seems to be commonly used for EREs,
outside of Perl.

I don't object to having those also, but they're harder to type ;-).

Components may be combined; to match against the combination of path and
query string, you just specify :path+query:. That could be abbreviated
as :p+q:. Combinations are only allowed if all the components involved
are consecutive; :domain+query: (no path) would be illegal.
I can probably figure out technical reasons for that, but it doesn't
make much sense from a user perspective. Why shouldn't I be able to write:
  -z ':d,f:foo'
...and have it match both
   'http://foobar.com/'
 and
   'http://baz.org/index?title=foobar'
?

No. It means that entire regex is matched, once, against the combined
components, not matched once for each of the components.

How, then, did you plan for 'fields' to be matched?

There is no sane way to combine only the domain and a field (field
would not be allowed to combine with anything, in fact).

Sure there is; match each separately.

If you're going to insist on only matching contiguous parts of the url (and that is okay), then I would prefer a syntax that makes it clear that is what is happening. That is why I suggested the git 'from..to' syntax, instead of using '+'.

If you're going to match against 'fields' as a list of strings, then I question why you can't also match against a list of url components as a list (rather than a concatenation). (Okay, maybe it doesn't /make sense/, but you said "there is no way to [do this]".)

Besides, 'a..b' syntax deals more neatly IMHO with what is a legal combination and what is not :-). (Since e.g. 's..q' is a legal way to say 'u', but 's+q' is illegal. And you avoid 's+d+p+q'.)

BTW, what exactly are the components? Is this right?

[u]rl: http://foobar.com/site/images/thumb.php?name=baz.jpg&x=64&y=64
p[r]otocol: "http"
[d]omain: "foobar.com"
[p]ath: "site/images"
[f]ile: "thumb.php"
[q]uery: "name=baz.jpg&x=64&y=64"
[a]rgs: "name=baz.jpg", "x=64", "y=64"

This is the diagram I did (but didn't include in the message I sent out).

  /---\ scheme             /------ path ---\ /---------- query ------\
  https://addictivecode.org/foo/bar/baz.html?fee=fi&fo=fum&bludtype=en
          \---- domain ---/                         \----/   field

Thanks for that diagram, it helps a lot.

So... I got it mostly right. You use 'scheme' instead of 'protocol', which is better. You /don't/ have a match against just file. But...

"path" would include the intial / (it would always have one)

...I guess this is sufficient, as per your note, also if you want to match leading path (without file).

I can't think of any similar advantage to including the ? in the
query string.

Agreed.

However, if you specify :path+query:, then the question mark is
included. Similarly, :scheme: wouldn't include the "://", but
:scheme+domain: would.

Using 'p..q' I would expect this. Again this feels less natural with 'p+q', because that doesn't look to me as much like concatenation.

  - Avoid adding both a --match and a --no-match option, by making
    negation a flag instead (/n or something: --match 'p/ni:.*\.js'
    would reject any paths ending in any case variant of ".js").
Similar ideas:
 -z '(?!expr)'

This one's of course automatic with PCRE if we provide that with an
option; we'd have to "emulate" it in builds not including PCRE.

Actually, this is interesting w.r.t. the first point... I don't think I would consider '--match foo' and '--no-match (?!foo)' the same. Rather, one is an accept rule (which happens to accept anything that doesn't match 'foo'), and one is a reject rule. This is actually useful since it lets you accept anything that «matches [list] AND matches [expr]».

  - Other anchoring options. I suspect that the many common use cases
    will begin with '.*'. We could remove the implicit anchoring, but
    then we'd probably usually want it at the end, forcing us to write
    the final '$'. That's one character versus two, but my gut tells me
    it's easier to forget anchors than it is to forget "match-any"
    patterns, which is why I lean toward implicit anchors.
MHO: implicit anchoring violates traditional regex usage. There is
probably an example of implicit anchoring somewhere, but offhand I can't
think of it. (And at any rate, sed/grep sure don't use implicit anchoring.)

Both sed and grep use regex as the basis of a _search_. We're not
_searching_ for a pattern in a string, we're matching.

You might apply that logic to syntax highlighters also, yet at least kate's doesn't use implicit anchoring. Besides I'm not convinced I want to write '.*/index\.html?' all the time (yes, same reason you gave originally). Especially when maybe I would rather write '/index\.'.

On the other hand, you're right about 'find'...

Okay, I suppose I can live with implicit anchoring :-).

--
Matthew
Please do not quote my e-mail address unobfuscated in message bodies.
--
I picked up a Magic 8-Ball the other day and it said 'Outlook not so good.' I said 'Sure, but Microsoft still ships it.'
  -- Anonymous (from cluefire.net)





reply via email to

[Prev in Thread] Current Thread [Next in Thread]