bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#34053: [PATCH] grep: fix slow for multiple word matching


From: Jim Meyering
Subject: bug#34053: [PATCH] grep: fix slow for multiple word matching
Date: Wed, 18 Dec 2019 18:55:01 -0800

On Tue, Nov 26, 2019 at 2:38 PM Norihiro Tanaka <address@hidden> wrote:
> On Sun, 13 Jan 2019 08:45:47 +0900
> Norihiro Tanaka <address@hidden> wrote:
> > grep uses KWset matcher for multiple word matching.  It is very slow when
> > most of the parts matched to a pattern are not words.  So, if a part firstly
> > matched to pattern is not a word, use the grep matcher to match for its 
> > line.
> >
> > By the way, if START_PTR is set, grep matcher uses regex matcher which is
> > very slow to match words.  Therefore, we use grep matcher when only 
> > START_PTR
> > is not set.
> >
> > Example, although it is a very extreme case...
> >
> > $ cat >pat <<EOF
> > 0
> > 00 0
> > 00 00 0
> > 00 00 00 0
> > 00 00 00 00 0
> > 00 00 00 00 00 0
> > 00 00 00 00 00 00 0
> > 00 00 00 00 00 00 00 0
> > 00 00 00 00 00 00 00 00 0
> > 00 00 00 00 00 00 00 00 00 0
> > 00 00 00 00 00 00 00 00 00 00 0
> > 00 00 00 00 00 00 00 00 00 00 00 0
> > 00 00 00 00 00 00 00 00 00 00 00 00 0
> > 00 00 00 00 00 00 00 00 00 00 00 00 00 0
> > EOF
> > $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | head -1000000 >inp
> >
> > $ env LC_ALL=C time -p src/grep -wf pat inp
> > real 5.75
> > user 5.67
> > sys 0.02
> >
> > Retry after applied the patch.
> >
> > $ env LC_ALL=C time -p src/grep -wf pat inp
> > real 0.32
> > user 0.31
> > sys 0.00
> >
> > Thanks,
> > Norihiro
>
> I fix previous patch.
>
> This change should not be applied for multibyte locales, as grep matcher
> uses regex with pattern with invert charclass in word matching in
> multibyte locales and it is very slow.

Good timing, in both senses :-) Thank you.
This confirms a ~15x speedup for that test case:

$ t="$(yes 00 |head -14|paste -sd' ') 0"; while :; do echo $t; test
"$t" = 0 && break; t=${t#00 }; done > pat;cat pat
00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 00 0
00 00 00 00 00 00 00 0
00 00 00 00 00 00 0
00 00 00 00 00 0
00 00 00 00 0
00 00 00 0
00 00 0
00 0
0
$ env LC_ALL=C time -p grep -wf pat inp
...
before: real 4.75
after: real 0.30

Similarly, retrying my counterexample, I now see a significant
improvement (needed a larger input file to demonstrate):

$ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | head -5000000 > inp

Best of ten, running this, before vs after:
$ env LC_ALL=C time -p grep -w 000 inp
0.62s vs 0.43s

I have revised the wording in the commit message. Please ACK that. The
only other change was in formatting, to split a line that went one
byte past the 80-column maximum.

l'd like to push the attached some time tomorrow.

Attachment: kws.diff
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]