bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#43527: [PATCH] grep: avoid unneeded compilation of regex


From: Norihiro Tanaka
Subject: bug#43527: [PATCH] grep: avoid unneeded compilation of regex
Date: Tue, 22 Sep 2020 23:54:10 +0900

On Mon, 21 Sep 2020 17:33:25 -0700
Jim Meyering <jim@meyering.net> wrote:

> On Sun, Sep 20, 2020 at 6:34 PM Jim Meyering <jim@meyering.net> wrote:
> >
> > On Sun, Sep 20, 2020 at 12:17 AM Norihiro Tanaka <noritnk@kcn.ne.jp> wrote:
> > > Hi,
> > > Performace for as following case is fixed in bug#43040.
> > >
> > >   $ yes 0 | head -100000 | sed '$s/././' >pat
> > >   $ grep -vf pat /dev/null
> > >
> > > However, still slow and a lot of memory wasted for the following cases.
> > >
> > >   $ grep -vf /usr/share/dict/linux.words /usr/share/dict/linux.words
> > >
> > > This bug is introduced in commit abb7f4f2325f26f930ff59b702fe42568a8e81e7.
> > > Though it's an optimization for patterns with backreferences, it seems
> > > to cause performance degradation in many cases due to regex
> > > implementation issues.
> > >
> > > grep needs regex engine when patterns is not supported by DFA engine,
> > > and when either given only matching (-o) or color option (--color) is
> > > given.
> > >
> > > In other words, if none of them are met, grep only uses regex to check
> > > the syntax.  grep avoids compilation of regex not to check syntax by this
> > > patch.
> >
> > Yikes. Thank you!
> > That exposes (and fixes in this common case) a problem that makes grep
> > require memory that is quadratic in the number of regular expressions.
> >
> > To illustrate, I ran some timings.
> > With only 80,000 lines of /usr/share/dict/linux.words, the following
> > would use 100GB of RSS and take 3 minutes. With the fix, it used less
> > than 400MB and took less than one second.
> >
> >   head -$N /usr/share/dict/linux.words > w; grep -vf w w
> >
> > N            Mem(k): Old         New
> > 20000     6341188 (2.4s)    103168
> > 40000    25241288 (9.29s)   199188 (0.31s)
> > 80000   100547432 (180s)    392872 (0.66s)
> >
> > I've just pushed the gnulib-adjusting patch and will push the other soon.
> > I'll also add a test and a NEWS item in a separate patch.
> 
> Here are the two patches (tested on top of a third that updates to
> latest gnulib). I'll await an 'ok' from Norihiro Tanaka before
> pushing, since commit-log metadata is essentially immutable once
> pushed.

Great, thank you.  I confirmed it.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]