[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Chicken-hackers] need irregex benchmark
From: |
matt welland |
Subject: |
Re: [Chicken-hackers] need irregex benchmark |
Date: |
Sun, 22 May 2011 15:26:18 -0700 |
On Sun, 2011-05-22 at 23:33 +0200, Peter Bex wrote:
> On Sun, May 22, 2011 at 11:25:44AM -0700, matt welland wrote:
> > Hi Felix,
> >
> > It isn't a particularly complex benchmark but my logpro app relies
> > heavily on regexes and I'm seeing some somewhat slow performance. You
> > can get it here http://www.kiatoa.com/fossils/logpro.
>
> I tried to clone it, but wasn't able to update:
>
> $ fossil clone http://www.kiatoa.com/fossils/logpro logpro.fossil
> Bytes Cards Artifacts Deltas
> Sent: 53 1 0 0
> Received: 312 1 0 0
> Sent: 680 27 0 0
> fossil: *** time skew *** server is slow by 89.7 seconds
> Total network traffic: 641 bytes sent, 0 bytes received
> Rebuilding repository meta-data...
> 0.0% complete...
> project-id: (null)
> server-id: 78c9b6811813c697324d20c0ac39eb9850f38ca2
> admin-user: sjamaan (password is "7de28c")
> $ mkdir logpro
> $ cd logpro
> $ fossil open ../logpro.fossil
> $ ls
> _FOSSIL_
> $
My mistake. I created that repo with a newer version of fossil than what
the server is using. Fossil's a bit poor on the error message and
failure behavior though. I think it will work if you try again.
> I have no idea what's going on here. Christian was able
> to rebuild the repo structure using today's fossil:
> http://paste.call-cc.org/paste?id=16cedff25f7cfa8bc83f6cb677bad9ba8e02274f
>
> > We process some
> > very large log files and have many waivers, ignores and error patterns.
> > The procedure using 50% of the cycles, misc:line-match-regexs, merely
> > applies a list of regexes to a line of text looking for the first match.
> > I suspect there is a better way (suggestions welcome).
>
> Perhaps you can construct one big regex from the list with an
> (or X Y Z)-like combination. If you need to know which regex
> was matched you can use (submatch X) or (submatch-named X), and
> query the result object to see which submatch is non-#f.
>
> When possible, irregex tries to collapse common prefixes. So
> (or "aaabz" "aaacz") will be compiled to something equivalent
> to (seq "aaa" (or "b" "c") "z"). Of course the prefix can be
> easily checked in a loop. The suffixes are two state changes
> which only need to check their two characters.
>
> If you match something like "aaaby" against "aaabz" and *then*
> against "aaacz", it will need to check the prefix twice.
>
> I apologize if this is obvious to you; I haven't been able to
> look at your code, so I'm basically just guessing. Perhaps
> you can enable the code browser for anonymous users?
It did occur to me that sticking all the regexes into a single regex
might be faster but now I have an idea why :) However the subtleties
concern me. Will it stop at the first match? Does it work left to right?
Thanks for the idea, I will experiment with it.
> > On a separate note I'd like to turn logpro and megatest into egg apps
> > someday. I read the distributed egg system docs and will give it a go
> > one of these days....
>
> I'd be interested to hear how this goes; I was unable to convince
> fossil to create pre-packaged tarballs or point to the "tip" of
> a file through the web interface.
The more recent versions will make a tarball (browse the timeline to a
version node), and you can point to the tip of a file using the
doc/trunk path, for example:
http://www.kiatoa.com/cgi-bin/fossils/logpro/doc/trunk/logprocessor.scm
> Cheers,
> Peter