[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Question about critical_factorization() in the Two-Way algorithm
From: |
Eric Blake |
Subject: |
Re: Question about critical_factorization() in the Two-Way algorithm |
Date: |
Mon, 21 Jun 2010 17:30:53 -0600 |
User-agent: |
Mozilla/5.0 (X11; U; Linux x86_64; en-US; rv:1.9.1.9) Gecko/20100430 Fedora/3.0.4-3.fc13 Lightning/1.0b2pre Mnenhy/0.8.2 Thunderbird/3.0.4 |
[adding bug-gnulib]
On 06/20/2010 10:00 PM, Carlos wrote:
> Howdy -- I'm writing an article for Code Quarterly on pattern
> matching, featuring the Two-Way algorithm, which you added
> to libc and gnulib a couple of years back.
>
> There is one result of the critical factorization phase I haven't been
> able to understand, and I was hoping you could give me a clue. Two
> strings that would appear to have identical periodicity, eg "banana"
> and "zanana", result in different factorizations when I run them
> through the example implementation you cite on the libc bugzilla.
Thanks again for the report. This affects gnulib's memmem, strstr,
strcasestr, and c-strcasestr modules (it would also affect memcasemem,
if we ever have any reason to implement that counterpart to memcasecmp).
>
> I have reimplemented the algorithm from the description and I get the
> same result as the Java demo at
> http://www-igm.univ-mlv.fr/~lecroq/string/node26.html
>
> Factorization:
> xl="ba", xr="nana"
> xl="z", xr="anana"
I already explained to Carlos off-list that the difference here is
explainable by the fact that the example code on this web page (which is
where I started when writing the gnulib implementation) includes the
entire string in the search for the maximal suffix, and that both
results (2/4 for banana and 1/5 for zanana) are the only two valid
critical factorizations for any 3-element alphabet with that 6-letter
pattern XYZYZY, even though it does indeed look odd that which of the
two factorizations is chosen depends on how X, Y, and Z sort
lexicographically.
For other example, consider "abcabc", which has three critical
factorizations where a non-empty left half still fits within the period
of the overall string (1/5, 2/4, 3/3), but sorting lexicographic
suffixes can find at most two of those factorizations.
> I'm not exactly sure what's going on here, but I wonder if it's a bug
> in the more efficient implementation. Another (likely) possibility is
> a subtlety of the algorithm that I am not understanding. Can you shed
> some light on what might be going on?
Ultimately, it is NOT a behavioral bug. But it IS an optimization bug.
For any needle of length 1 or 2, we don't need to do any comparisons to
determine a critical factorization (for "", 0/0 is critical, but empty
needles are already special cased; for "a", both 0/1 and 1/0 are
critical, but only 0/1 plays well with the rest of the code that assumes
a non-empty suffix; for "aa", both 0/2 and 1/1 are critical; and for
"ab", only 1/1 is critical). Needles of length 0 are already forbidden
by the documented two_way_* interface. And anything of length 3 or
longer can safely ignore the suffix created by the entire needle itself
(since the 0/n factorization will always be a local period of 1, and can
only be critical if the entire needle consists of the same byte; but in
that case, the 1/n-1 factorization would also be critical). Therefore,
instead of checking the entire needle for a longest suffix (that is,
starting with a comparison of x[0] and x[1]), we can instead start with
only reduced-length suffixes (that is, start with a comparison of x[1]
and x[2]), for one less comparison, and a slightly faster factorization
time.
Ultimately, I implemented my enhanced string search code for gnulib
first, then ported it to glibc later, so I will be posting a patch soon
then filing a glibc bug to get it folded in there.
Meanwhile, glibc bug 11607 already complains that the time spent on
factorization is extremely noticeable for short strings. I have not
benchmarked things myself (partly because I don't know how cache-line
effects would play into a benchmark), but do wonder if tweaking some
heuristics to use a known-quadratic naive algorithm for short needles,
to end up averaging less work compared to the time we spend on
factorization (for a worst-case needle "aaaa", the naive algorithm has
no initialization cost, but ends up comparing each "a" almost 4 times
per byte of the haystack, while more common needles like "abcd" easily
approach one comparison per haystack byte. The two-way algorithm
guarantees no more than two comparisons per haystack byte, but also
costs 8 comparisons and several conditional jumps for any 4-byte needle.
So given that probability favors that the needle will not be periodic,
at what length of needle does the two-way algorithm save rather than
cost time on an average use? I'm thinking 4- or 8- bytes might also be
able to take advantage of SSE-type vector operations for some
assembly-based speedups to naive searches, when the entire needle fits
in special-purpose large registers.
http://sourceware.org/bugzilla/show_bug.cgi?id=11607
>
> Thank you!
> Carlos Bueno
> Facebook Engineering
> address@hidden
>
--
Eric Blake address@hidden +1-801-349-2682
Libvirt virtualization library http://libvirt.org
signature.asc
Description: OpenPGP digital signature
- Re: Question about critical_factorization() in the Two-Way algorithm,
Eric Blake <=