bug-wget
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-wget] Thoughts on regex support


From: Micah Cowan
Subject: Re: [Bug-wget] Thoughts on regex support
Date: Fri, 25 Sep 2009 09:31:13 -0700
User-agent: Thunderbird 2.0.0.23 (X11/20090817)

-----BEGIN PGP SIGNED MESSAGE-----
Hash: SHA1

Tony Lewis wrote:
> Micah Cowan wrote:
> 
>> Tony Lewis wrote:
> 
>> "hash" doesn't apply to URIs that wget would handle (it's called the
>> "fragment" portion in relevant RFCs), as that's not normally part of
>> what gets sent to the server.
> But it can appear in the links within a page. Are you going to discard the
> fragment portion before doing the match?

Of course. What possible use could matching the fragment portion be for
deciding whether to download the entire page?

I just noticed RFC 3986 explicitly states this should be the case, in
fact. From 6.1: "When URIs are compared to select (or avoid) a network
action, such as retrieval of a representation, fragment components (if
any) should be excluded from the comparison."

This looks to match exactly what we're doing.

>> I'd forgotten the port number... probably we should include that with
>> domain, and consider calling it "host" instead. Actually, since Wget
>> commonly uses the term "domain", we should at least provide that as an
>> alternative name.
> 
> While we can't ignore it, I doubt that people want to match on the port very
> often (if ever!). It's probably best to have :domain: refer just to the host
> name portion of the URL.

Well, perhaps we should provide both. Domain, without port, and host,
for those that do. But then, "host" strikes me as a poor name for it in
that case. The RFC calls it "authority" (which also includes
"username:password@"; we definitely won't match that either: it'll be
stripped from any URL it appears in). I don't like that name much
either, though.

I could see someone legitimately wanting to match against port, if there
are two services running on the same host, and the user's only
interested in one. Of course, it would be no great loss if we failed to
support them; we don't currently, after all.

>> Note, of course, that -D is still an option, and -D site.com would be
>> equivalent, so probably still the best choice.
> 
> So many different ways to accomplish the same thing. Yikes! I can just see
> someone posting a question to the list and getting three correct answers
> using different command line options (-D, -A, and -z).

We could consider deprecating some, but I think they are all more
convenient than the regexes would be, for many cases.

>>> Sounds OK, but I think you mean: [ : [ components ] [ / flags ] : ]
>> Or something similar. '::' would be silly :)
> 
> I agree that '::' is silly and I'm assuming the parser would treat it as a
> no-op.

I'd be inclined to require at least one of components and flags.
{ components [ / flags ]  |  / flags }

>> No, they're not. But we only have a handful of sane single-character
>> options remaining, and only three pairings of uppercase/lowercase. The
>> other two are -J/j and -G/g.
> 
> Hmm... -g for --grep? :-)

I meant to mention, but forgot, that you can find a script named
"freeopts" in util/ (only in the development sources), that scans to see
which options are still available.

>> So, that makes two against so far. Though I think I may have persuaded
>> Matt Woehlke down to a slight, vague preference.
> 
> I could live with either solution, but I think explicit anchors makes the
> most intuitive sense when you're doing a pattern match.
> 
>>> In the more general case of anchoring, I think ':path:foo' should match
>>> '/path/to/foo.html' and '/foo/baz/index.html'.
>> Yeah, except when is that useful?
> 
> When foo is a CGI script such that /path/to/foo and /path/to/foo/arg/u/ments
> invoke the same script.

Eh? That doesn't seem to have anything to do with your example. And when
would /path/to/foo.html and /foo/baz/index.html ever be the same script?
Sure, possible, but very, very contrived, and easily solved with the use
of ".*".

>>>> If the components aren't specified, it would default to matching just
>>>> the pathname portion of the URL.
>>> I'm not sure this is the obvious behavior, but I would get used to it.
>> It's open for discussion. What do you think the most obvious behavior
>> would be?
> 
> I think :scheme-path: is the most obvious default.

(To list) Any other opinions on this?

You're probably right that it's the "most obvious", but as I mentioned,
I was shooting for "most convenient". Which is that?

>>> What if you're recording unfollowed links to the SIDB? Don't you still
> want
>>> those links to appear?
>> Dunno. What do you think?
> 
> I would expect the URLs from the last traversal level to appear as
> unfollowed links.

I'm inclined not to. If we must, we could list the links that would
otherwise have been traversed as "unfollowed", but downloading a file
just to report that we're not following the (newly found) links within
it seems ridiculous. Could anyone ever find an actual use for that behavior?

>>> What's wrong with treating --traverse as meaning --traverse
>>> ':path/i:^.*\.html?$' and then having --traverse ':path/i:^.*\.php$'
>>> override that behavior and only download PHP pages.
>> Mainly, because I don't want to continue what I consider to be broken
>> default behavior, if I can get away with it. :)
> 
> Either the current behavior is a bug in which case it should be fixed or
> it's a feature that needs to be maintained. I was assuming it was a feature
> and suggested a simple way to maintain the behavior while easily allowing it
> to be overridden.

Depends on your point of view, I imagine. It was certainly intended to
be a feature when it was written, and perhaps some consider it to be so.
It would have been more featureful around the time it was written, when
presumably virtually all html pages really did have "html" or "htm"
suffixes, and the use of ".php", ".cfm", ".asp" and such was not yet
prolific.

However, on the modern web, the combination of the fact that it rarely
covers everything you want it to, and frequently covers many things you
don't want it to, makes it a misfeature. I might not go so far as to
call it a bug, but it is definitely a thorn in my and others' sides.

I don't doubt there are those that rely on it. But my support experience
on IRC suggests (among that particular sample set, anyway) that they are
vastly fewer than those who are surprised and annoyed by it (I actually
don't recall ever encountering someone there who relied on it). I don't
mind annoying the few who use it (they can have peace of mind again with
a quick .wgetrc entry) if it means less confusing behavior for everyone
else.

...shoot, I just realized we're really going to need a --no-traverse
option as well. And not only that, but there will be expressions that we
want in both --no-match and --no-traverse; "action=edit"-type things, or
anything that might tax the server. I'm not crazy about
- --no-match-or-traverse... maybe a better system can be had.

>>> Given that the most common use case is to match against suffixes in the
>>> path, perhaps ':path/i:^.*\.' and '$' should be implied so that --traverse
>>> '(html?|php)' is interpreted as ':path/i:^.*\.(html?|php)$'.
>> Again, I really want consistency with the regex rules.
> 
> OK. So how about adding :suffix: to the mix. Then one can say --traverse
> ':suffix/i:(html?|php)'.

I dislike that idea, particularly since it's not a "real" component of
an URL. Neither is :field:, of course: but it's much more useful (IMO)
than :suffix: would be, given that you can just use :path: and add ".*"
at the front (if we go with explicit anchors) or "$" at the back
(otherwise).

>> Perhaps we should simply say, "most convenience be damned", and go for
>> explicit anchors everywhere, even if that leads to a little more typing
>> in most places. It certainly follows the principle of least surprise...
> 
> In all the places that I work with regular expressions, anchors are
> explicitly specified so *I* would be most surprised by having implicit
> anchors.

That's what I said, wasn't it? The only place I've seen implicit anchors
so far is "find", and in truth that was surprising to me the first time
I encountered it.

Okay. We're doing explicit anchors everywhere. :)

I'm not going to love that for :fields: though.

>> I'd probably go for --match ':path:.*/a/.*\.[Zz][Ii][Pp]' and --match
>> ':path:.*/b/.*\.[Jj][Pp][Ee]?[Gg]'. PCREs would make that somewhat nicer.
> 
> What about the possibility of including multiple components in the same
> argument to match?
> For example, --match ':path:.*/a/.*:suffix/i:zip'. This would mean that you
> have to escape a colon when it appears between the scheme and domain as in
> ':url:http\://www.site.com/.*'.

Easier and more clearly stated as --match ':path:.*/a/.*\.zip'. And no,
I have no interest in such a thing.

> In your proposal am I allowed to supply two --match parameters that are
> OR'ed together?

Yes. Just like --accept does. What we lack is an "and" operation.

> Also, will URLs be converted to canonical form before the --match operation
> is performed? Will the argument to match be canonicalized? Will '=' match
> '=', '%3d', or '%3D'? Likewise, will '%61' match 'a' or '%61'?

Probably. Not the '%61' thing, though: we definitely would _not_ be
normalizing the regex; it's not remotely practical. And note that '/'
and '%3F' can not have the same meaning, as they mean different things
in the path portion of a URI (one separates two components of a path,
the other is valid as part of a single path component).

- --
Micah J. Cowan
Programmer, musician, typesetting enthusiast, gamer.
Maintainer of GNU Wget and GNU Teseq
http://micah.cowan.name/
-----BEGIN PGP SIGNATURE-----
Version: GnuPG v1.4.9 (GNU/Linux)
Comment: Using GnuPG with Mozilla - http://enigmail.mozdev.org

iEYEARECAAYFAkq88FAACgkQ7M8hyUobTrG0VgCeMjG/xD+BJVurYMqr95lI3us6
4NoAni/f1KsIGs7VsMzWdu101vKKi+Vb
=RNR+
-----END PGP SIGNATURE-----





reply via email to

[Prev in Thread] Current Thread [Next in Thread]