|
From: | Paolo Bonzini |
Subject: | Re: [PATCH 16/17] grep: remove check_multibyte_string, fix non-UTF8 missed match |
Date: | Sun, 14 Mar 2010 13:33:00 +0100 |
User-agent: | Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.7) Gecko/20100120 Fedora/3.0.1-1.fc12 Lightning/1.0b2pre Thunderbird/3.0.1 |
On 03/14/2010 02:16 AM, Norihiro Tanaka wrote:
Hi, By this patch, even when multibyte-check failed for a simple pattern that doesn't contain the wild-card and the repetition expression, `dfaexec' will have called. Do you intend it?
Yes, see for example bug 23814. There, I'm searching for \xAA\xBB; kwset could give an exact match, but it only finds an unaligned match in \xBB\xAA\xBB\xAA. Note that DFA search anyway runs only on the line that kwset selected. Also, for UTF-8 the is_mb_middle test should always lead to success unless an invalid UTF-8 character gets into the DFA's "must" kwset.
The alternative is making kwset multibyte-aware, which is probably not impossible but not easy either; I would know how to do it only if I could specialize kwset with knowledge of the particular charsets, which is not good.
Paolo
[Prev in Thread] | Current Thread | [Next in Thread] |