grep-commit
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

grep branch, master, updated. v2.22-16-g71c206b


From: Paul Eggert
Subject: grep branch, master, updated. v2.22-16-g71c206b
Date: Wed, 06 Jan 2016 07:29:47 +0000

This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "grep".

The branch, master has been updated
       via  71c206b5042a11c976c25a9f77aff04ebb29fcd9 (commit)
      from  40ed879db22d57516a31fefd1c39416974b74ec4 (commit)

Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.

- Log -----------------------------------------------------------------
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=71c206b5042a11c976c25a9f77aff04ebb29fcd9


commit 71c206b5042a11c976c25a9f77aff04ebb29fcd9
Author: Paul Eggert <address@hidden>
Date:   Tue Jan 5 23:29:07 2016 -0800

    Fix calculation of unibyte_mask
    
    * src/grep.c (initialize_unibyte_mask): The old method worked for
    UTF-8 and other typical encodings, but did not work for weird
    encodings, e.g., one where all bytes other than 0x7f and 0x80 are
    unibyte characters.

diff --git a/src/grep.c b/src/grep.c
index a5f1fa2..f6fb0bc 100644
--- a/src/grep.c
+++ b/src/grep.c
@@ -502,10 +502,10 @@ clean_up_stdout (void)
 /* An unsigned type suitable for fast matching.  */
 typedef uintmax_t uword;
 
-/* All bytes that are not unibyte characters, ANDed together, and then
-   with the pattern repeated to fill a uword.  For an encoding where
+/* A mask to test for unibyte characters, with the pattern repeated to
+   fill a uword.  For a multibyte character encoding where
    all bytes are unibyte characters, this is 0.  For UTF-8, this is
-   0x808080....  For encodings where unibyte characters have no useful
+   0x808080....  For encodings where unibyte characters have no discerned
    pattern, this is all 1s.  The unsigned char C is a unibyte
    character if C & UNIBYTE_MASK is zero.  If the uword W is the
    concatenation of bytes, the bytes are all unibyte characters
@@ -515,10 +515,23 @@ static uword unibyte_mask;
 static void
 initialize_unibyte_mask (void)
 {
-  unsigned char mask = UCHAR_MAX;
+  /* For each encoding error I that MASK does not already match,
+     accumulate I's most significant 1 bit by ORing it into MASK.
+     Although any 1 bit of I could be used, in practice high-order
+     bits work better.  */
+  unsigned char mask = 0;
+  int ms1b = 1;
   for (int i = 1; i <= UCHAR_MAX; i++)
-    if (mbclen_cache[i] != 1)
-      mask &= i;
+    if (mbclen_cache[i] != 1 && ! (mask & i))
+      {
+        while (ms1b * 2 <= i)
+          ms1b *= 2;
+        mask |= ms1b;
+      }
+
+  /* Now MASK will detect any encoding-error byte, although it may
+     cry wolf and it may not be optimal.  Build a uword-length mask by
+     repeating MASK.  */
   uword uword_max = -1;
   unibyte_mask = uword_max / UCHAR_MAX * mask;
 }

-----------------------------------------------------------------------

Summary of changes:
 src/grep.c |   25 +++++++++++++++++++------
 1 files changed, 19 insertions(+), 6 deletions(-)


hooks/post-receive
-- 
grep



reply via email to

[Prev in Thread] Current Thread [Next in Thread]