[Chicken-hackers] [PATCH] Update irregex to 0.9.0

chicken-hackers

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Chicken-hackers] [PATCH] Update irregex to 0.9.0

From:	Peter Bex
Subject:	[Chicken-hackers] [PATCH] Update irregex to 0.9.0
Date:	Sun, 16 Sep 2012 21:46:44 +0200
User-agent:	Mutt/1.4.2.3i

Hi everyone,

Because the version of master was bumped to 4.8.1, I decided to go ahead
and update our core irregex to the latest version.  The attached patches
(all 4 of them) synchronizes us with upstream 0.9.0 irregex.  This gives
some performance improvements for submatches.

Before 0.9, irregex would rely on backtracking whenever submatches
occurred in the SRE.  The new version uses a clever algorithm by
Ville Laurikari based on "tagged" NFAs (or tNFAs), which are compiled
to DFAs which can perform submatch bookkeeping while processing the
input, in one pass.  You can get more info by reading the thesis that
introduced this: http://laurikari.net/ville/regex-submatch.pdf
There's also a very short paper that describes the concepts:
http://laurikari.net/ville/spire2000-tnfa.pdf
But beware, the short paper uses slightly different notation,
terminology, and algorithms(!)

Irregex (and thus, Chicken) is one of the few implementations of
Laurikari's algorithm.  The others that I've found were:

- Ville's own "TRE" C library (http://laurikari.net/tre/).  He wrote this
   as an industrial-strength regex lib after completing the prototype
   for his thesis in some obscure C++ macro hackery.  This library is
   included in NetBSD 6.0 and if I understand correctly there are plans
   to replace Henry Spencer's regexp engine in libc with TRE.
- A regex class for the Tango library for the "D" language
   
(http://www.dsource.org/projects/tango/docs/current/htmlsrc/tango.text.Regex.html,
   http://www.dsource.org/projects/tango/docs/current/tango.text.Regex.html).
   I used this implementation to gain a detailed working understanding of the
   algorithm where the paper was too hand-wavy.
- A tNFA Emacs library (http://www.emacswiki.org/emacs/TaggedNFA)
   The heavy imperative coding style and (to me) strange naming
   conventions made this useless to understand the algorithm.
   Man, people will write anything in Emacs Lisp!
- Haskell has a pluggable Regex engine, which also has a tNFA backend.
   (http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa)
   However, they've invented an own concept ("Orbits") to fit their type
   system.  This boggled my mind, so I kind of left it at that.
   Design notes about Orbits: http://www.haskell.org/haskellwiki/RegexpDesign

One possibly interesting aspect of this algorithm is that, according
to Laurikari's thesis, with some modifications it should be able to
perform "approximate" or "fuzzy" matching.  NetBSD 6 now includes an
"agrep" command which leverages TRE's ability to do this.  I was only
interested in the performance improvements so, didn't look into this.

I've done some quick benchmarks.  There's the old "Chicken vs Perl"
chestnut: 
http://lists.nongnu.org/archive/html/chicken-users/2011-09/msg00131.html

Without patch:
interpreted: csi -s href-extraction.scm  3.75s user 0.11s system 97% cpu 3.962 
total
compiled (-O3): ./href-extraction  2.53s user 0.05s system 98% cpu 2.605 total

With patch:
interpreted: csi -s href-extraction.scm  2.01s user 0.08s system 98% cpu 2.126 
total
compiled (-O3): ./href-extraction  1.34s user 0.08s system 98% cpu 1.434 total

Unfortunately, Perl still takes our lunch:
perl extract.pl  0.05s user 0.01s system 83% cpu 0.072 total

I'm unsure what the bottleneck is here.  Warrants more investigation.


As another test, I ran the irregex alternative implementation of uri-generic.
This is simply the output of running the following at the REPL after
loading the compiled uri-generic.irregex.so:
#;1> ,t (let loop ((i 0)) (unless (> i 10000) (uri-reference 
"http://call-with-current-continuation.org/foo/bar?mooh#qux";) (loop (add1 i))))

Without patch:
1.683s CPU time, 0.083s GC time (major), 600142 mutations, 26/4716 GCs 
(major/minor)
With patch:
1.313s CPU time, 0.073s GC time (major), 600143 mutations, 23/4720 GCs 
(major/minor)
This is very close to the "main" implementation, which is based on the
matchable egg.  It converts input strings into lists of chars and matches
on those:
1.268s CPU time, 0.038s GC time (major), 1320215 mutations, 15/5555 GCs 
(major/minor)

So it looks like this consistently improves performance for some
real-world tasks.

Attached you also find the output of the benchmark program in the
upstream irregex distribution, with an unpatched and a patched
Chicken (changed the code to (use irregex) instead of loading
it from the source dir).  You can see that for some benchmarks
the improvement is quite a lot larger than in the "real-world" ones.

There's a trade-off though: Regex compilation time is up for most of
these, for some *considerably* much.  I think this is an acceptable
tradeoff, as long as the compilation time doesn't go through the roof.
In earlier tests, before some tweaking, this was the case.

Anyone using irregex in practice is welcome to try this patch and report
back results, especially regarding compilation times.  Of course match
times are useful to know as well.

Perhaps we can bring down compilation times by making clever use of some
more specializations.  Better representations of tNFA multi-states are
also welcome.  Tweaking this representation and the algorithms that
manipulate them is what influenced compilation times the most during
the implementation of this stuff.

Cheers,
Peter
-- 
http://sjamaan.ath.cx
--
"The process of preparing programs for a digital computer
 is especially attractive, not only because it can be economically
 and scientifically rewarding, but also because it can be an aesthetic
 experience much like composing poetry or music."
                                                        -- Donald Knuth

0001-Convert-irregex-s-NFA-representation-to-support-tags.patch
Description: Text document

0002-Irregex-Implement-Laurikari-s-algorithm-for-tNFA-t-D.patch
Description: Text document

0003-Irregex-Small-test-changes-Add-regression-test-for-f.patch
Description: Text document

0004-Irregex-Convert-strings-with-charset-ranges-into-lar.patch
Description: Text document

irregex-bench-with-patch
Description: Text document

irregex-bench-without-patch
Description: Text document

[Prev in Thread]

Current Thread

[Next in Thread]

[Chicken-hackers] [PATCH] Update irregex to 0.9.0, Peter Bex <=
- Re: [Chicken-hackers] [PATCH] Update irregex to 0.9.0, Alex Shinn, 2012/09/17
  - Re: [Chicken-hackers] [PATCH] Update irregex to 0.9.0, Peter Bex, 2012/09/17
- Re: [Chicken-hackers] [PATCH] Update irregex to 0.9.0, Felix, 2012/09/18
  - Re: [Chicken-hackers] [PATCH] Update irregex to 0.9.0, Sven Hartrumpf, 2012/09/20
    - [Chicken-hackers] [PATCH] Re: [PATCH] Update irregex to 0.9.0, Peter Bex, 2012/09/23
    - Re: [Chicken-hackers] [PATCH] Re: [PATCH] Update irregex to 0.9.0, Felix, 2012/09/23

Prev by Date: Re: [Chicken-hackers] will we have a release this year?
Next by Date: Re: [Chicken-hackers] [PATCH] Update irregex to 0.9.0
Previous by thread: [Chicken-hackers] NetBSD w/ 4.8.0rc3
Next by thread: Re: [Chicken-hackers] [PATCH] Update irregex to 0.9.0
Index(es):
- Date
- Thread