[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Chicken-hackers] [PATCH] Update irregex to 0.9.0
From: |
Alex Shinn |
Subject: |
Re: [Chicken-hackers] [PATCH] Update irregex to 0.9.0 |
Date: |
Mon, 17 Sep 2012 14:01:25 +0900 |
Thanks Peter!
On Mon, Sep 17, 2012 at 4:46 AM, Peter Bex <address@hidden> wrote:
> Hi everyone,
>
> Because the version of master was bumped to 4.8.1, I decided to go ahead
> and update our core irregex to the latest version. The attached patches
> (all 4 of them) synchronizes us with upstream 0.9.0 irregex. This gives
> some performance improvements for submatches.
>
> Before 0.9, irregex would rely on backtracking whenever submatches
> occurred in the SRE. The new version uses a clever algorithm by
> Ville Laurikari based on "tagged" NFAs (or tNFAs), which are compiled
> to DFAs which can perform submatch bookkeeping while processing the
> input, in one pass. You can get more info by reading the thesis that
> introduced this: http://laurikari.net/ville/regex-submatch.pdf
> There's also a very short paper that describes the concepts:
> http://laurikari.net/ville/spire2000-tnfa.pdf
> But beware, the short paper uses slightly different notation,
> terminology, and algorithms(!)
>
> Irregex (and thus, Chicken) is one of the few implementations of
> Laurikari's algorithm. The others that I've found were:
>
> - Ville's own "TRE" C library (http://laurikari.net/tre/). He wrote this
> as an industrial-strength regex lib after completing the prototype
> for his thesis in some obscure C++ macro hackery. This library is
> included in NetBSD 6.0 and if I understand correctly there are plans
> to replace Henry Spencer's regexp engine in libc with TRE.
> - A regex class for the Tango library for the "D" language
>
> (http://www.dsource.org/projects/tango/docs/current/htmlsrc/tango.text.Regex.html,
> http://www.dsource.org/projects/tango/docs/current/tango.text.Regex.html).
> I used this implementation to gain a detailed working understanding of the
> algorithm where the paper was too hand-wavy.
> - A tNFA Emacs library (http://www.emacswiki.org/emacs/TaggedNFA)
> The heavy imperative coding style and (to me) strange naming
> conventions made this useless to understand the algorithm.
> Man, people will write anything in Emacs Lisp!
> - Haskell has a pluggable Regex engine, which also has a tNFA backend.
> (http://hackage.haskell.org/cgi-bin/hackage-scripts/package/regex-tdfa)
> However, they've invented an own concept ("Orbits") to fit their type
> system. This boggled my mind, so I kind of left it at that.
> Design notes about Orbits: http://www.haskell.org/haskellwiki/RegexpDesign
>
> One possibly interesting aspect of this algorithm is that, according
> to Laurikari's thesis, with some modifications it should be able to
> perform "approximate" or "fuzzy" matching. NetBSD 6 now includes an
> "agrep" command which leverages TRE's ability to do this. I was only
> interested in the performance improvements so, didn't look into this.
>
> I've done some quick benchmarks. There's the old "Chicken vs Perl"
> chestnut:
> http://lists.nongnu.org/archive/html/chicken-users/2011-09/msg00131.html
>
> Without patch:
> interpreted: csi -s href-extraction.scm 3.75s user 0.11s system 97% cpu
> 3.962 total
> compiled (-O3): ./href-extraction 2.53s user 0.05s system 98% cpu 2.605 total
>
> With patch:
> interpreted: csi -s href-extraction.scm 2.01s user 0.08s system 98% cpu
> 2.126 total
> compiled (-O3): ./href-extraction 1.34s user 0.08s system 98% cpu 1.434 total
>
> Unfortunately, Perl still takes our lunch:
> perl extract.pl 0.05s user 0.01s system 83% cpu 0.072 total
>
> I'm unsure what the bottleneck is here. Warrants more investigation.
Note this is an apples to oranges comparison - Perl
is using an optimistic algorithm that does well in common
cases like this. It will outperform even highly tuned C
implementations of NFA matching like TRE or Google's RE
here. On the other hand, there are patterns for which
Perl takes exponential time.
> As another test, I ran the irregex alternative implementation of uri-generic.
> This is simply the output of running the following at the REPL after
> loading the compiled uri-generic.irregex.so:
> #;1> ,t (let loop ((i 0)) (unless (> i 10000) (uri-reference
> "http://call-with-current-continuation.org/foo/bar?mooh#qux") (loop (add1
> i))))
>
> Without patch:
> 1.683s CPU time, 0.083s GC time (major), 600142 mutations, 26/4716 GCs
> (major/minor)
> With patch:
> 1.313s CPU time, 0.073s GC time (major), 600143 mutations, 23/4720 GCs
> (major/minor)
> This is very close to the "main" implementation, which is based on the
> matchable egg. It converts input strings into lists of chars and matches
> on those:
> 1.268s CPU time, 0.038s GC time (major), 1320215 mutations, 15/5555 GCs
> (major/minor)
>
> So it looks like this consistently improves performance for some
> real-world tasks.
>
> Attached you also find the output of the benchmark program in the
> upstream irregex distribution, with an unpatched and a patched
> Chicken (changed the code to (use irregex) instead of loading
> it from the source dir). You can see that for some benchmarks
> the improvement is quite a lot larger than in the "real-world" ones.
>
> There's a trade-off though: Regex compilation time is up for most of
> these, for some *considerably* much. I think this is an acceptable
> tradeoff, as long as the compilation time doesn't go through the roof.
> In earlier tests, before some tweaking, this was the case.
>
> Anyone using irregex in practice is welcome to try this patch and report
> back results, especially regarding compilation times. Of course match
> times are useful to know as well.
>
> Perhaps we can bring down compilation times by making clever use of some
> more specializations. Better representations of tNFA multi-states are
> also welcome. Tweaking this representation and the algorithms that
> manipulate them is what influenced compilation times the most during
> the implementation of this stuff.
>
> Cheers,
> Peter
> --
> http://sjamaan.ath.cx
> --
> "The process of preparing programs for a digital computer
> is especially attractive, not only because it can be economically
> and scientifically rewarding, but also because it can be an aesthetic
> experience much like composing poetry or music."
> -- Donald Knuth
>
> _______________________________________________
> Chicken-hackers mailing list
> address@hidden
> https://lists.nongnu.org/mailman/listinfo/chicken-hackers
>