[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Grep with UTF8 is slow
From: |
A. Hjortland |
Subject: |
Grep with UTF8 is slow |
Date: |
24 Apr 2003 21:25:48 +0200 |
Grep (at least 2.5 and 2.5.1) is *very* slow in cases where most input
lines match and the charset is UTF8.
Look at the second case (3 seconds!):
----------------------------------
> export LC_ALL=no_NO.UTF8
> time find /usr/bin | grep -c zzz
0
real 0m0.035s
user 0m0.000s
sys 0m0.030s
> time find /usr/bin | grep -c bin
2054
real 0m3.364s
user 0m3.320s
sys 0m0.000s
----------------------------------
> export LC_ALL=C
> time find /usr/bin | grep -c zzz
0
real 0m0.021s
user 0m0.000s
sys 0m0.010s
> time find /usr/bin | grep -c bin
2054
real 0m0.021s
user 0m0.010s
sys 0m0.010s
----------------------------------
We tracked the problem down to EGexecute in search.c.
As I understand, the function scans a buffer and returns _one_ match
from the buffer each time it's called. The problem is: For each call to
EGexecute, check_multibyte_string (also in search.c) is called once, on
the *entire buffer*. If all N lines match, and the buffer contains, say
1000 lines, that means check_multibyte_string will have to process
N*1000 lines, not just N lines. Hence the low performance.
Patch suggestion attached.
Warning: This is just something i wipped together rapher haphazrdly. No
guaranties, here :)
How the patch works:
Instead of parsing the _entire buffer at once_ with
check_multibyte_string, parse it _incrementally_ in chunks of 100 bytes,
as far as needed.
Obviously, the other execute-functions must be patched too.
--Håkon A. Hjortland
grep-2.5_search.c_utf8speed.diff
Description: Text document
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Grep with UTF8 is slow,
A. Hjortland <=