Grep with UTF8 is slow

bug-gnu-utils

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Grep with UTF8 is slow

From:	A. Hjortland
Subject:	Grep with UTF8 is slow
Date:	24 Apr 2003 21:25:48 +0200

Grep (at least 2.5 and 2.5.1) is *very* slow in cases where most input
lines match and the charset is UTF8.

Look at the second case (3 seconds!):
----------------------------------
> export LC_ALL=no_NO.UTF8
> time find /usr/bin | grep -c zzz
0

real    0m0.035s
user    0m0.000s
sys     0m0.030s
> time find /usr/bin | grep -c bin
2054

real    0m3.364s
user    0m3.320s
sys     0m0.000s
----------------------------------
> export LC_ALL=C
> time find /usr/bin | grep -c zzz
0

real    0m0.021s
user    0m0.000s
sys     0m0.010s
> time find /usr/bin | grep -c bin
2054

real    0m0.021s
user    0m0.010s
sys     0m0.010s
----------------------------------

We tracked the problem down to EGexecute in search.c.
As I understand, the function scans a buffer and returns _one_ match
from the buffer each time it's called. The problem is: For each call to
EGexecute, check_multibyte_string (also in search.c) is called once, on
the *entire buffer*. If all N lines match, and the buffer contains, say
1000 lines, that means check_multibyte_string will have to process
N*1000 lines, not just N lines. Hence the low performance.

Patch suggestion attached.
Warning: This is just something i wipped together rapher haphazrdly. No
guaranties, here :)

How the patch works:
Instead of parsing the _entire buffer at once_ with
check_multibyte_string, parse it _incrementally_ in chunks of 100 bytes,
as far as needed.


Obviously, the other execute-functions must be patched too.


--Håkon A. Hjortland

grep-2.5_search.c_utf8speed.diff
Description: Text document

[Prev in Thread]

Current Thread

[Next in Thread]

Grep with UTF8 is slow, A. Hjortland <=

Prev by Date: Reclama tu premio
Next by Date: Re: gawk: append doesn't datestamp (PC)
Previous by thread: newbie DD problem
Next by thread: Delivery failure
Index(es):
- Date
- Thread