bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: split-function


From: Tom Lord
Subject: Re: split-function
Date: Mon, 9 Apr 2001 17:07:12 -0700 (PDT)

        Within the following circumstances the split-function seems to
        run endless.

        [
                expression: /\*([^*]|([*]+[^/]))*\*/
                string: //**********************************//
        ]

Wow -- that's a reasonably realistic example of the famous problems
that make Posix regexp matchers difficult to implement correctly
and well.

rx-posix (from regexps.com) seems to handle that expression well.  It
comes with extensive Posix tests, including some that other matchers
(including GNU regex) don't pass.  On the other hand, it lacks GNU
extensions.  It doesn't handle Unicode, but the low level engine
handles UTF-8 and all flavors of UTF-16, so there seems to be a finite
amount of straightforward work to get there.

I haven't tested the expression with the Tcl matcher, but would guess
that it also does well.  The Tcl matcher also lacks GNU extensions.
It handles Unicode with a UTF-8 encoding; handling UTF-16 or 32
variations seems to be a finite amount of straightforward work.  It
has a few bugs and/or its author disagrees with me about what the
Posix spec means.

The combination of dfa.c and regex.c has a steep drop-off from 
fast expressions to slow expressions.  The two matchers mentioned 
above optimize a lot of cases that dfa.c can't handle, and that regex.c
handles slowly.

There's no telling how many scripts would break by switching to a
correct Posix matcher -- none, many, or something in between.  There
also seems to be disagreement on precisely what the Posix spec means
-- though my interpretation (in rx-posix) is, of course, the correct
one :-)

Thomas Lord



reply via email to

[Prev in Thread] Current Thread [Next in Thread]