bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-libunistring] Re: UTF-8 backward iteration proposal for libunistrin


From: Ben Pfaff
Subject: [bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring
Date: Sat, 13 Nov 2010 12:20:33 -0800
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux)

Bruno Haible <address@hidden> writes:

>> But it has only the u8_prev() function for iterating backward.
>> That function has the pitfall that it only operates on
>> well-formed UTF-8 sequences
>
> Indeed. I'm adding a note about it to the manual.

Thank you.

>> (the manual also implies that it only 
>> works on null-terminated UTF-8 strings, but in fact it doesn't).
>
> Indeed. But I wanted to have u8_prev documented near u8_next, and
> u8_next _does_ assume a NUL-terminated UTF-8 string.

That makes sense, of course.

>> Consider how u8_mbtouc() treats ill-formed sequences.
>> There are three cases (the examples show byte sequences and the
>> code points for the immediately preceding bytes in parentheses):
>> 
>>   (a) For an incomplete sequence, it reports the whole incomplete
>>       sequence as a single code point U+FFFD.
>> 
>>         e0 a0 (U+FFFD)
>> 
>>           (Only 2 bytes in 3-byte sequence.)
>> 
>>   (b) For a sequence that is invalid in UTF-8, but that would be
>>       valid if overlong sequences or invalid Unicode code points
>>       were allowed, it reports the whole invalid sequence as a
>>       single code point U+FFFD.
>> 
>>         e0 80 (U+FFFD)
>> 
>>           (Not a prefix of any valid UTF-8 sequence, because the
>>           second byte must be at 0xa0 when the first byte is 0xe0.)
>> 
>>         f5 80 80 (U+FFFD)
>> 
>>           (This would be greater than the maximum code point U+10FFFF
>>           if it was allowed.)
>> 
>>   (c) For an invalid (but complete) sequence, it reports each
>>       byte as a separate code point U+FFFD.
>> 
>>         c0 (U+FFFD) 41 (U+0041)
>> 
>>           (c0 never appears in UTF-8)
>> 
>>         e1 (U+FFFD) e1 (U+FFFD) 80 (U+FFFD)
>>
>>           (This would be a UTF-16 surrogate if it was allowed.)

I goofed on this one: e1 e1 80 would not represent a UTF-16
surrogate and doesn't make any sense here.  A correct substitute
would be e0 80 80.

>>         e0 (U+FFFD) a0 (U+FFFD) 00 (U+0000)
>> 
>>           (e1 starts a 3-byte sequence but 00 is invalid as the
>>           third byte.)
>
> Part (c) is actually a bug. Now I'm looking at Markus Kuhn's recommendations
> how to parse UTF-8
>    <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>
>    <http://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html>
> and am finding that u8_mbtouc needs to be corrected so that in case (c)
> the results are:
>
>           c0 (U+FFFD) 41 (U+0041)
>           e1 e1 80 (U+FFFD)

Of course this should also be corrected to e0 80 80 (U+FFFD).

>           e0 a0 (U+FFFD) 00 (U+0000)
[...]
> After correcting the forward iteration, the answer for e0 a0 00
> is clear: e0 a0 must be grouped together, for a single U+FFFD.
>
> Are there other cases where the forward iteration behaviour does
> not allow an equivalent O(1) backward iteration?

No, that's the only one.  I have a corrected version of
u8-mbtouc-aux.c here, along with a draft of a reverse-iterating
version.  A test program that exhaustively tests all of the
possibilities in forward and reverse order reports that it works
OK.

>> So, I'd like to propose the following:
>> 
>>   1. Change libunistring functions that detect ill-formed UTF-8
>>      sequences to return each byte of an ill-formed sequence as a
>>      separate U+FFFD code point (and for case (b) to return these
>>      even when, e.g. e0 80 is seen but a third byte isn't
>>      available).  (This actually simplifies code.)
>
> No, according to the guidelines set out by Markus Kuhn and republished
> by the W3C it is better to return a single U+FFFD for a sequence like
> e0 80 or e0 80 bf.

OK.

> But it is well possible that with this changed behaviour of u8_mbtouc
> you can write an u8_prev function that also works for invalid input
> and yet satisfies both of your requirements. No?

Yes, it does satisfy both of my requirements with that change.

>>   2. Add libunistring functions to get the last code point out of
>>      a UTF-8 string.  Tentatively I was planning to add a "r"
>>      prefix (e.g. u8_rmbtouc(), u8_rmbtoucr()) but other
>>      conventions are fine too.
>
> 'r' is already used as a suffix, therefore I would not use that.
> Maybe u8_mb_prev_uc is a better name for such a function?

That's better, yes.

>> Bruno, does this sound like a worthwhile project, and would you
>> accept this kind of contribution, if it was written following the
>> existing libunistring conventions, etc.?
>
> Yes, if it satisfies your two requirements and is consistent with the
> new forward iteration behaviour (modified as of today).

OK.  I'll work on it.

Here's the diff for the u8-mbtoc-aux.c that I've got here so far,
by the way.  I'm not very satisfied with the style, and it
doesn't update the #if 0'd out portion, but it does do the trick.

Thanks,

Ben.

diff --git a/lib/unistr/u8-mbtouc-aux.c b/lib/unistr/u8-mbtouc-aux.c
index c997589..39a8258 100644
--- a/lib/unistr/u8-mbtouc-aux.c
+++ b/lib/unistr/u8-mbtouc-aux.c
@@ -61,13 +61,24 @@ u8_mbtouc_aux (ucs4_t *puc, const uint8_t *s, size_t n)
                          | (unsigned int) (s[2] ^ 0x80);
                   return 3;
                 }
-              /* invalid multibyte character */
+              else
+                {
+                  *puc = 0xfffd;
+                  if ((s[1] ^ 0x80) >= 0x40)
+                    return 1;
+                  else if ((s[2] ^ 0x80) >= 0x40)
+                    return 2;
+                  else
+                    return 3;
+                }
             }
           else
             {
-              /* incomplete multibyte character */
               *puc = 0xfffd;
-              return n;
+              if (n >= 2 && (s[1] ^ 0x80) < 0x40)
+                return 2;
+              else
+                return 1;
             }
         }
       else if (c < 0xf8)
@@ -88,13 +99,28 @@ u8_mbtouc_aux (ucs4_t *puc, const uint8_t *s, size_t n)
                          | (unsigned int) (s[3] ^ 0x80);
                   return 4;
                 }
-              /* invalid multibyte character */
+              else
+                {
+                  *puc = 0xfffd;
+                  if ((s[1] ^ 0x80) >= 0x40)
+                    return 1;
+                  else if ((s[2] ^ 0x80) >= 0x40)
+                    return 2;
+                  else if ((s[3] ^ 0x80) >= 0x40)
+                    return 3;
+                  else
+                    return 4;
+                }
             }
           else
             {
-              /* incomplete multibyte character */
               *puc = 0xfffd;
-              return n;
+              if (n < 2 || (s[1] ^ 0x80) >= 0x40)
+                return 1;
+              else if (n < 3 || (s[2] ^ 0x80) >= 0x40)
+                return 2;
+              else
+                return 3;
             }
         }
 #if 0

-- 
Ben Pfaff 
http://benpfaff.org



reply via email to

[Prev in Thread] Current Thread [Next in Thread]