bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-wget] Re: Thoughts on regex support


From: Matthew Woehlke
Subject: [Bug-wget] Re: Thoughts on regex support
Date: Fri, 25 Sep 2009 12:40:44 -0500
User-agent: Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.8.1.23) Gecko/20090825 Fedora/2.0.0.23-1.fc10 Thunderbird/2.0.0.23 Mnenhy/0.7.5.0

Micah Cowan wrote:
Tony Lewis wrote:
Also, Location includes port and hash. How do you plan to deal with these
aspects of a URL?

I'd forgotten the port number... probably we should include that with
domain, and consider calling it "host" instead. Actually, since Wget
commonly uses the term "domain", we should at least provide that as an
alternative name.

Maybe we could keep 'domain' as w/o port, and have 'host' as with port?

Micah Cowan wrote:
{ --match | --no-match }  [ : components [ / flags ] : ] regex
Sounds OK, but I think you mean: [ : [ components ] [ / flags ] : ]

Or something similar. '::' would be silly :)

Sure, but no need to go out of the way to make it illegal :-).

it is implicitly anchored at the beginning and end
I think this is a bad idea. If someone wants ^ and $, they should specify
them.

So, that makes two against so far. Though I think I may have persuaded
Matt Woehlke down to a slight, vague preference.

Something like that. So we have two vague feelings against it, and one moderate feeling in favor :-).

I realize that my argument for domain matching is not entirely consistent
with explicit anchors. Going back to domain matching, I think
':domain:site.com' should be interpreted as ':domain:^.*\bsite\.com$', but I
also think domain matching is a special case.

Definitely not doing that. Special-casing the syntax for different
components strikes me as a bad idea.

This was one point on which I strongly agreed (partly why my own opinion is mushy).

If the components aren't specified, it would default to matching just
the pathname portion of the URL.
I'm not sure this is the obvious behavior, but I would get used to it.

It's open for discussion. What do you think the most obvious behavior
would be? Full-url? I'm currently trying to aim for
most-frequently-used, over most-obvious, so if you think that'd be a
different component (or slice of components), lemme know.

I'd missed this point in the original message. I would think full url is most obvious. I'd be hesitant to guess what 'most used' would be; that tends to be a failing proposition for at least some audiences. Ergo since no solution is best from 'most used' standpoint, 'most sensible' wins out IMHO.

(And I personally think url is more obvious than ':s-p:'...)

It is not clear to me how one would combine matches. Let's say that I want
all ZIP files from directory a (but not from directory A) and all JPG files
from directory b (but not from directory B). How do I indicate that I want
to match:

(':path:\ba\b' AND ':path/i:\.zip$') OR (':path:\bb\b' AND ':path/i:\.jpg$')

I'd probably go for --match ':path:.*/a/.*\.[Zz][Ii][Pp]' and --match
':path:.*/b/.*\.[Jj][Pp][Ee]?[Gg]'. PCREs would make that somewhat nicer.

That was the same solution I came up with, plus this alternative:

-z ':p/i:/a/.*\.zip' -z ':p/i:/b/.*\.jpg' -Z ':p:/(?!(a|b))/'

(Obviously the latter needs either /n flag or PCRE's. I think I'm increasingly in favor that we should support PCRE's if at all possible. Yes, I realize that means a dependency on libpcre.)

Also: I see that both Micah and I assumed you really meant '/a/', not '\ba\b' :-).

As already discussed, --match and --no-match would be analogs to -A and
- -R; they'd just use regexes rather than wildcards (and have wider
options for what portions you're matching against).

Thought: is it possible to alter the syntax of -A/-R to tell these that you are matching a regex rather than a glob? Maybe by requiring the '::'?

Given that the most common use case is to match against suffixes in the
path, perhaps ':path/i:^.*\.' and '$' should be implied so that --traverse
'(html?|php)' is interpreted as ':path/i:^.*\.(html?|php)$'.

Again, I really want consistency with the regex rules.

Perhaps we should simply say, "most convenience be damned", and go for
explicit anchors everywhere, even if that leads to a little more typing
in most places. It certainly follows the principle of least surprise...

:-)

--
Matthew
Please do not quote my e-mail address unobfuscated in message bodies.
--
I want to vote for a Conservative Democrat. Too bad they're about as rare as an Honest Politician. Maybe I'll get lucky and someone will come along that's both.





reply via email to

[Prev in Thread] Current Thread [Next in Thread]