[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: repeated matches in grep
From: |
Tom Lord |
Subject: |
Re: repeated matches in grep |
Date: |
Sat, 20 Oct 2001 13:34:15 -0700 (PDT) |
>> grep '\(.\).*\1.*\1.*\1.*\1.*\1' /usr/dict/words
>>
>> but this does not:
>>
>> grep '\(.\)\(.*\1\)\{5\}' /usr/dict/words
Without too much thought to it, I would say that the engine is
being greedy and matches the longest possible match with this
expression (.*\1) before trying {5}
This refers to a silly, quite old, informal rivalry among regexp
implementors.
Bugs along these lines plague nearly all Posix regexp matchers. The
Posix spec is quite difficult to implement correctly. (I'm not sure,
off the top of my head, whether the example regexps above are really
Posix BRE or extensions to BRE syntax -- but claiming that the
leftmost longest rule doesn't apply to BRE or ERE extensions is kind
of wormy.)
You can probably fix such bugs, if you're very careful, but that opens
up a few additional cans of worms.
See:
http://www.regexps.com/rx-posix.html
in particular:
http://www.regexps.com/manual/libhackerlab/html/posix-regexps.html
especially:
http://www.regexps.com/manual/libhackerlab/html/posix-regexps.html#Rx_and_Conformance_to_the_Posix_Standard
and:
http://www.regexps.com/manual/libhackerlab/html/introduction-to-regexps.html
especially:
http://www.regexps.com/manual/libhackerlab/html/introduction-to-regexps.html#The_Leftmost_Longest_Rule
Cans of worms include the open question of how much code will break
if you fix such bugs, and, complexities of performance and tuning.
On the latter point, see:
http://www.regexps.com/manual/libhackerlab/html/rx-posix-data-sheet.html
and
http://www.regexps.com/manual/libhackerlab/html/posix-regexps.html
and
http://www.regexps.com/manual/libhackerlab/html/tuning.html
A related, slightly depressing phenomenon surrounds the XML/Unicode
regexp language defined for XML Schema. Several implementations of
that language suffer from similar bugs, even though they don't have to
deal with the complexities of subexpression reporting and
back-references. One that appears to get it right is:
http://www.regexps.com/rx-xml.html
though be sure to see:
http://www.regexps.com/manual/libhackerlab/html/rx-xml-data-sheet.html
to understand where we're at in terms of testing rx-xml.
If you want a posix engine that, like the one in GNU grep, combines a
DFA-cache for simple regexps and a full-blown backtracking engine for
general regexps, rx-posix is worth taking a look at.
Grep's engine saves a cycle or two on DFA table look-ups. On the
other hand, rx-posix uses the DFA engine in a greater range of
circumstances and also gets the hard posix cases right (at least so
far as testing has yet determined).
-t