bug-glibc
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

New regex implementation


From: Ville Laurikari
Subject: New regex implementation
Date: Thu, 15 Mar 2001 22:33:53 +0200
User-agent: Mutt/1.2.5i

Hello all,

I am writing my master's thesis on the subject of efficient submatch
addressing for regular expressions.  As a byproduct of the thesis,
I have written a prototype of a POSIX.2 compatible regex matching
library using the methods studied in my work.  The library currently 
supports only a modest subset of the stuff specified in POSIX.2, but
extended regular expressions with substring addressing are supported
(back referencing is not yet supported, but support for them can be
added). 

The thing is currently about 5-10 times faster than GNU regex-0.12 for
most patterns, and exponentially faster for cases where GNU regex
takes exponential time (like the pattern `(a*)*|b*' and strings of the
form `aaaaaaaaaaaaaa...b').   It uses `O(p)' memory for a regex of
size `p', and all memory allocations are done during pattern
compilation, nothing is allocated during matching.

Anyways, I checked out the open tasks list for glibc, and found this: 

   The regular expression matching functions need to be
   extended/rewritten for full internationalization support.  This
   includes support for character and collation classes.  Wide
   character versions are also necessary.  

I am not fully aware of how much work it would be to add all the
missing bits to my library.  I guess it's doable, and I am willing to
give it a shot.

How does this sound?  I presume patching up regex-0.12 for full
internationalization support is a lot of work, since it has not been
done yet.  I hope to be able to ultimately produce a drop-in
replacement for GNU regex with support for all the features currently
in GNU regex, support for the necessary features missing in GNU regex,
and a significant increase in speed.


-- 
http://www.iki.fi/vl/




reply via email to

[Prev in Thread] Current Thread [Next in Thread]