bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: repeated matches in grep


From: Tom Lord
Subject: Re: repeated matches in grep
Date: Sat, 20 Oct 2001 13:34:15 -0700 (PDT)

      >> grep '\(.\).*\1.*\1.*\1.*\1.*\1' /usr/dict/words
      >> 
      >> but this does not:
      >> 
      >> grep '\(.\)\(.*\1\)\{5\}' /usr/dict/words

      Without too much thought to it, I would say that the engine is
      being greedy and matches the longest possible match with this
      expression (.*\1) before trying {5}

This refers to a silly, quite old, informal rivalry among regexp
implementors.

Bugs along these lines plague nearly all Posix regexp matchers.  The
Posix spec is quite difficult to implement correctly.  (I'm not sure,
off the top of my head, whether the example regexps above are really
Posix BRE or extensions to BRE syntax -- but claiming that the
leftmost longest rule doesn't apply to BRE or ERE extensions is kind
of wormy.)

You can probably fix such bugs, if you're very careful, but that opens
up a few additional cans of worms.

See:

http://www.regexps.com/rx-posix.html

in particular:

http://www.regexps.com/manual/libhackerlab/html/posix-regexps.html

especially:

http://www.regexps.com/manual/libhackerlab/html/posix-regexps.html#Rx_and_Conformance_to_the_Posix_Standard

and:

http://www.regexps.com/manual/libhackerlab/html/introduction-to-regexps.html

especially:

http://www.regexps.com/manual/libhackerlab/html/introduction-to-regexps.html#The_Leftmost_Longest_Rule

Cans of worms include the open question of how much code will break
if you fix such bugs, and, complexities of performance and tuning.

On the latter point, see:

http://www.regexps.com/manual/libhackerlab/html/rx-posix-data-sheet.html

and

http://www.regexps.com/manual/libhackerlab/html/posix-regexps.html

and 

http://www.regexps.com/manual/libhackerlab/html/tuning.html

A related, slightly depressing phenomenon surrounds the XML/Unicode
regexp language defined for XML Schema.  Several implementations of
that language suffer from similar bugs, even though they don't have to
deal with the complexities of subexpression reporting and
back-references.  One that appears to get it right is:

http://www.regexps.com/rx-xml.html

though be sure to see: 

http://www.regexps.com/manual/libhackerlab/html/rx-xml-data-sheet.html

to understand where we're at in terms of testing rx-xml.

If you want a posix engine that, like the one in GNU grep, combines a
DFA-cache for simple regexps and a full-blown backtracking engine for 
general regexps, rx-posix is worth taking a look at.

Grep's engine saves a cycle or two on DFA table look-ups.  On the
other hand, rx-posix uses the DFA engine in a greater range of
circumstances and also gets the hard posix cases right (at least so
far as testing has yet determined).

-t



reply via email to

[Prev in Thread] Current Thread [Next in Thread]