bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#34053: [PATCH] grep: fix slow for multiple word matching


From: Norihiro Tanaka
Subject: bug#34053: [PATCH] grep: fix slow for multiple word matching
Date: Fri, 20 Dec 2019 00:06:09 +0900

On Wed, 18 Dec 2019 18:55:01 -0800
Jim Meyering <address@hidden> wrote:

> On Tue, Nov 26, 2019 at 2:38 PM Norihiro Tanaka <address@hidden> wrote:
> > On Sun, 13 Jan 2019 08:45:47 +0900
> > Norihiro Tanaka <address@hidden> wrote:
> > > grep uses KWset matcher for multiple word matching.  It is very slow when
> > > most of the parts matched to a pattern are not words.  So, if a part 
> > > firstly
> > > matched to pattern is not a word, use the grep matcher to match for its 
> > > line.
> > >
> > > By the way, if START_PTR is set, grep matcher uses regex matcher which is
> > > very slow to match words.  Therefore, we use grep matcher when only 
> > > START_PTR
> > > is not set.
> > >
> > > Example, although it is a very extreme case...
> > >
> > > $ cat >pat <<EOF
> > > 0
> > > 00 0
> > > 00 00 0
> > > 00 00 00 0
> > > 00 00 00 00 0
> > > 00 00 00 00 00 0
> > > 00 00 00 00 00 00 0
> > > 00 00 00 00 00 00 00 0
> > > 00 00 00 00 00 00 00 00 0
> > > 00 00 00 00 00 00 00 00 00 0
> > > 00 00 00 00 00 00 00 00 00 00 0
> > > 00 00 00 00 00 00 00 00 00 00 00 0
> > > 00 00 00 00 00 00 00 00 00 00 00 00 0
> > > 00 00 00 00 00 00 00 00 00 00 00 00 00 0
> > > EOF
> > > $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | head -1000000 >inp
> > >
> > > $ env LC_ALL=C time -p src/grep -wf pat inp
> > > real 5.75
> > > user 5.67
> > > sys 0.02
> > >
> > > Retry after applied the patch.
> > >
> > > $ env LC_ALL=C time -p src/grep -wf pat inp
> > > real 0.32
> > > user 0.31
> > > sys 0.00
> > >
> > > Thanks,
> > > Norihiro
> >
> > I fix previous patch.
> >
> > This change should not be applied for multibyte locales, as grep matcher
> > uses regex with pattern with invert charclass in word matching in
> > multibyte locales and it is very slow.
> 
> Good timing, in both senses :-) Thank you.
> This confirms a ~15x speedup for that test case:
> 
> $ t="$(yes 00 |head -14|paste -sd' ') 0"; while :; do echo $t; test
> "$t" = 0 && break; t=${t#00 }; done > pat;cat pat
> 00 00 00 00 00 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 00 0
> 00 00 00 00 00 00 0
> 00 00 00 00 00 0
> 00 00 00 00 0
> 00 00 00 0
> 00 00 0
> 00 0
> 0
> $ env LC_ALL=C time -p grep -wf pat inp
> ...
> before: real 4.75
> after: real 0.30
> 
> Similarly, retrying my counterexample, I now see a significant
> improvement (needed a larger input file to demonstrate):
> 
> $ yes '00 00 00 00 00 00 00 00 00 00 00 00 00' | head -5000000 > inp
> 
> Best of ten, running this, before vs after:
> $ env LC_ALL=C time -p grep -w 000 inp
> 0.62s vs 0.43s
> 
> I have revised the wording in the commit message. Please ACK that. The
> only other change was in formatting, to split a line that went one
> byte past the 80-column maximum.
> 
> l'd like to push the attached some time tomorrow.

Thanks, I agree to your adjuestment for commit message.






reply via email to

[Prev in Thread] Current Thread [Next in Thread]