[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#21763: poor performance since grep 2.19 when comparing files with gr
From: |
Jim Meyering |
Subject: |
bug#21763: poor performance since grep 2.19 when comparing files with grep |
Date: |
Mon, 26 Oct 2015 11:39:54 -0700 |
On Mon, Oct 26, 2015 at 5:54 AM, Bennett, Steve
<address@hidden> wrote:
> Apologies in advance if this is more of a "discuss" question, but it looks
> like a particular use-case shows a marked change in performance between
> recent versions of grep.
>
> A colleague mentioned a performance issue with grep to me, and its puzzling
> me a bit.
> It turns out that he was using "grep -Fvif" to find lines in one file that
> are not present in another.
>
> Up until grep 2.18 this seems to work with linear performance and it takes
> less than 50ms to compare files up to about 20,000 lines.
> With grep 2.19 and later, ever relatively small files are quite slow, runtime
> (and memory use) increases exponentially (e.g. 300ms to compare 200 lines,
> 1.5s to compare 400 lines, 5s to compare 600 lines).
>
> I've shown my colleague how to use sort and diff (and "comm", which I think
> is vastly underrated), but it made me wonder if this is a reasonable thing to
> expect grep to be able to do, and whether such a performance drop should be
> seen as a bug.
>
> The way he was using it, he had two (unsorted) data sets (about 6000 rows in
> each), with most lines being common, and he was just using:
> grep -Fvif FILE1 FILE2
> In his case, the older version of grep took way less than a second to run,
> but after he had upgraded his machine it took 20 minutes before running out
> of swap and seg faulting.
>
> In terms of comparing performance, I've found that the following works to
> compare performance (vary N to try different sized data files):
> N=600; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1 $N > $F;
> time grep -Fvif $F $F; rm $F
Thank you for reporting that.
Interesting: that progression (time vs. increasing N) is clearly
quadratic or worse when using a multibyte locale, but is linear with
LC_ALL=C. I suspect when you run "locale", it reports something like
en_US.utf8.
I.e., if you have no need for multi-byte matching, set LC_ALL=C, and
that idiom will be very quick, even for a million lines:
$ N=1000000; F=/tmp/zz.$$; seq -f '%g bottles of beer on the wall' 1
$N > $F; LC_ALL=C env time grep -Fvif $F $F; rm $F
0.78user 0.08system 0:00.86elapsed 100%CPU (0avgtext+0avgdata
131536maxresident)k
0inputs+0outputs (0major+32587minor)pagefaults 0swaps
Currently, I am not planning even to investigate this for the imminent release.