|
From: | Peter Bex |
Subject: | [Chicken-hackers] [PATCH] Update irregex to 0.9.0 |
Date: | Sun, 16 Sep 2012 21:46:44 +0200 |
User-agent: | Mutt/1.4.2.3i |
Hi everyone, Because the version of master was bumped to 4.8.1, I decided to go ahead and update our core irregex to the latest version. The attached patches (all 4 of them) synchronizes us with upstream 0.9.0 irregex. This gives some performance improvements for submatches. Before 0.9, irregex would rely on backtracking whenever submatches occurred in the SRE. The new version uses a clever algorithm by Ville Laurikari based on "tagged" NFAs (or tNFAs), which are compiled to DFAs which can perform submatch bookkeeping while processing the input, in one pass. You can get more info by reading the thesis that introduced this: http://laurikari.net/ville/regex-submatch.pdf There's also a very short paper that describes the concepts: http://laurikari.net/ville/spire2000-tnfa.pdf But beware, the short paper uses slightly different notation, terminology, and algorithms(!) Irregex (and thus, Chicken) is one of the few implementations of Laurikari's algorithm. The others that I've found were: - Ville's own "TRE" C library (http://laurikari.net/tre/). He wrote this as an industrial-strength regex lib after completing the prototype for his thesis in some obscure C++ macro hackery. This library is included in NetBSD 6.0 and if I understand correctly there are plans to replace Henry Spencer's regexp engine in libc with TRE. - A regex class for the Tango library for the "D" language (http://www.dsource.org/projects/tango/docs/current/htmlsrc/tango.text.Regex.html, http://www.dsource.org/projects/tango/docs/current/tango.text.Regex.html). I used this implementation to gain a detailed working understanding of the algorithm where the paper was too hand-wavy. - A tNFA Emacs library (http://www.emacswiki.org/emacs/TaggedNFA) The heavy imperative coding style and (to me) strange naming conventions made this useless to understand the algorithm. Man, people will write anything in Emacs Lisp! - Haskell has a pluggable Regex engine, which also has a tNFA backend. (http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa) However, they've invented an own concept ("Orbits") to fit their type system. This boggled my mind, so I kind of left it at that. Design notes about Orbits: http://www.haskell.org/haskellwiki/RegexpDesign One possibly interesting aspect of this algorithm is that, according to Laurikari's thesis, with some modifications it should be able to perform "approximate" or "fuzzy" matching. NetBSD 6 now includes an "agrep" command which leverages TRE's ability to do this. I was only interested in the performance improvements so, didn't look into this. I've done some quick benchmarks. There's the old "Chicken vs Perl" chestnut: http://lists.nongnu.org/archive/html/chicken-users/2011-09/msg00131.html Without patch: interpreted: csi -s href-extraction.scm 3.75s user 0.11s system 97% cpu 3.962 total compiled (-O3): ./href-extraction 2.53s user 0.05s system 98% cpu 2.605 total With patch: interpreted: csi -s href-extraction.scm 2.01s user 0.08s system 98% cpu 2.126 total compiled (-O3): ./href-extraction 1.34s user 0.08s system 98% cpu 1.434 total Unfortunately, Perl still takes our lunch: perl extract.pl 0.05s user 0.01s system 83% cpu 0.072 total I'm unsure what the bottleneck is here. Warrants more investigation. As another test, I ran the irregex alternative implementation of uri-generic. This is simply the output of running the following at the REPL after loading the compiled uri-generic.irregex.so: #;1> ,t (let loop ((i 0)) (unless (> i 10000) (uri-reference "http://call-with-current-continuation.org/foo/bar?mooh#qux") (loop (add1 i)))) Without patch: 1.683s CPU time, 0.083s GC time (major), 600142 mutations, 26/4716 GCs (major/minor) With patch: 1.313s CPU time, 0.073s GC time (major), 600143 mutations, 23/4720 GCs (major/minor) This is very close to the "main" implementation, which is based on the matchable egg. It converts input strings into lists of chars and matches on those: 1.268s CPU time, 0.038s GC time (major), 1320215 mutations, 15/5555 GCs (major/minor) So it looks like this consistently improves performance for some real-world tasks. Attached you also find the output of the benchmark program in the upstream irregex distribution, with an unpatched and a patched Chicken (changed the code to (use irregex) instead of loading it from the source dir). You can see that for some benchmarks the improvement is quite a lot larger than in the "real-world" ones. There's a trade-off though: Regex compilation time is up for most of these, for some *considerably* much. I think this is an acceptable tradeoff, as long as the compilation time doesn't go through the roof. In earlier tests, before some tweaking, this was the case. Anyone using irregex in practice is welcome to try this patch and report back results, especially regarding compilation times. Of course match times are useful to know as well. Perhaps we can bring down compilation times by making clever use of some more specializations. Better representations of tNFA multi-states are also welcome. Tweaking this representation and the algorithms that manipulate them is what influenced compilation times the most during the implementation of this stuff. Cheers, Peter -- http://sjamaan.ath.cx -- "The process of preparing programs for a digital computer is especially attractive, not only because it can be economically and scientifically rewarding, but also because it can be an aesthetic experience much like composing poetry or music." -- Donald Knuth
0001-Convert-irregex-s-NFA-representation-to-support-tags.patch
Description: Text document
0002-Irregex-Implement-Laurikari-s-algorithm-for-tNFA-t-D.patch
Description: Text document
0003-Irregex-Small-test-changes-Add-regression-test-for-f.patch
Description: Text document
0004-Irregex-Convert-strings-with-charset-ranges-into-lar.patch
Description: Text document
irregex-bench-with-patch
Description: Text document
irregex-bench-without-patch
Description: Text document
[Prev in Thread] | Current Thread | [Next in Thread] |