[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-libunistring] Re: UTF-8 backward iteration proposal for libunistrin
From: |
Ben Pfaff |
Subject: |
[bug-libunistring] Re: UTF-8 backward iteration proposal for libunistring |
Date: |
Sat, 13 Nov 2010 12:20:33 -0800 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux) |
Bruno Haible <address@hidden> writes:
>> But it has only the u8_prev() function for iterating backward.
>> That function has the pitfall that it only operates on
>> well-formed UTF-8 sequences
>
> Indeed. I'm adding a note about it to the manual.
Thank you.
>> (the manual also implies that it only
>> works on null-terminated UTF-8 strings, but in fact it doesn't).
>
> Indeed. But I wanted to have u8_prev documented near u8_next, and
> u8_next _does_ assume a NUL-terminated UTF-8 string.
That makes sense, of course.
>> Consider how u8_mbtouc() treats ill-formed sequences.
>> There are three cases (the examples show byte sequences and the
>> code points for the immediately preceding bytes in parentheses):
>>
>> (a) For an incomplete sequence, it reports the whole incomplete
>> sequence as a single code point U+FFFD.
>>
>> e0 a0 (U+FFFD)
>>
>> (Only 2 bytes in 3-byte sequence.)
>>
>> (b) For a sequence that is invalid in UTF-8, but that would be
>> valid if overlong sequences or invalid Unicode code points
>> were allowed, it reports the whole invalid sequence as a
>> single code point U+FFFD.
>>
>> e0 80 (U+FFFD)
>>
>> (Not a prefix of any valid UTF-8 sequence, because the
>> second byte must be at 0xa0 when the first byte is 0xe0.)
>>
>> f5 80 80 (U+FFFD)
>>
>> (This would be greater than the maximum code point U+10FFFF
>> if it was allowed.)
>>
>> (c) For an invalid (but complete) sequence, it reports each
>> byte as a separate code point U+FFFD.
>>
>> c0 (U+FFFD) 41 (U+0041)
>>
>> (c0 never appears in UTF-8)
>>
>> e1 (U+FFFD) e1 (U+FFFD) 80 (U+FFFD)
>>
>> (This would be a UTF-16 surrogate if it was allowed.)
I goofed on this one: e1 e1 80 would not represent a UTF-16
surrogate and doesn't make any sense here. A correct substitute
would be e0 80 80.
>> e0 (U+FFFD) a0 (U+FFFD) 00 (U+0000)
>>
>> (e1 starts a 3-byte sequence but 00 is invalid as the
>> third byte.)
>
> Part (c) is actually a bug. Now I'm looking at Markus Kuhn's recommendations
> how to parse UTF-8
> <http://www.cl.cam.ac.uk/~mgk25/ucs/examples/UTF-8-test.txt>
> <http://www.w3.org/2001/06/utf-8-wrong/UTF-8-test.html>
> and am finding that u8_mbtouc needs to be corrected so that in case (c)
> the results are:
>
> c0 (U+FFFD) 41 (U+0041)
> e1 e1 80 (U+FFFD)
Of course this should also be corrected to e0 80 80 (U+FFFD).
> e0 a0 (U+FFFD) 00 (U+0000)
[...]
> After correcting the forward iteration, the answer for e0 a0 00
> is clear: e0 a0 must be grouped together, for a single U+FFFD.
>
> Are there other cases where the forward iteration behaviour does
> not allow an equivalent O(1) backward iteration?
No, that's the only one. I have a corrected version of
u8-mbtouc-aux.c here, along with a draft of a reverse-iterating
version. A test program that exhaustively tests all of the
possibilities in forward and reverse order reports that it works
OK.
>> So, I'd like to propose the following:
>>
>> 1. Change libunistring functions that detect ill-formed UTF-8
>> sequences to return each byte of an ill-formed sequence as a
>> separate U+FFFD code point (and for case (b) to return these
>> even when, e.g. e0 80 is seen but a third byte isn't
>> available). (This actually simplifies code.)
>
> No, according to the guidelines set out by Markus Kuhn and republished
> by the W3C it is better to return a single U+FFFD for a sequence like
> e0 80 or e0 80 bf.
OK.
> But it is well possible that with this changed behaviour of u8_mbtouc
> you can write an u8_prev function that also works for invalid input
> and yet satisfies both of your requirements. No?
Yes, it does satisfy both of my requirements with that change.
>> 2. Add libunistring functions to get the last code point out of
>> a UTF-8 string. Tentatively I was planning to add a "r"
>> prefix (e.g. u8_rmbtouc(), u8_rmbtoucr()) but other
>> conventions are fine too.
>
> 'r' is already used as a suffix, therefore I would not use that.
> Maybe u8_mb_prev_uc is a better name for such a function?
That's better, yes.
>> Bruno, does this sound like a worthwhile project, and would you
>> accept this kind of contribution, if it was written following the
>> existing libunistring conventions, etc.?
>
> Yes, if it satisfies your two requirements and is consistent with the
> new forward iteration behaviour (modified as of today).
OK. I'll work on it.
Here's the diff for the u8-mbtoc-aux.c that I've got here so far,
by the way. I'm not very satisfied with the style, and it
doesn't update the #if 0'd out portion, but it does do the trick.
Thanks,
Ben.
diff --git a/lib/unistr/u8-mbtouc-aux.c b/lib/unistr/u8-mbtouc-aux.c
index c997589..39a8258 100644
--- a/lib/unistr/u8-mbtouc-aux.c
+++ b/lib/unistr/u8-mbtouc-aux.c
@@ -61,13 +61,24 @@ u8_mbtouc_aux (ucs4_t *puc, const uint8_t *s, size_t n)
| (unsigned int) (s[2] ^ 0x80);
return 3;
}
- /* invalid multibyte character */
+ else
+ {
+ *puc = 0xfffd;
+ if ((s[1] ^ 0x80) >= 0x40)
+ return 1;
+ else if ((s[2] ^ 0x80) >= 0x40)
+ return 2;
+ else
+ return 3;
+ }
}
else
{
- /* incomplete multibyte character */
*puc = 0xfffd;
- return n;
+ if (n >= 2 && (s[1] ^ 0x80) < 0x40)
+ return 2;
+ else
+ return 1;
}
}
else if (c < 0xf8)
@@ -88,13 +99,28 @@ u8_mbtouc_aux (ucs4_t *puc, const uint8_t *s, size_t n)
| (unsigned int) (s[3] ^ 0x80);
return 4;
}
- /* invalid multibyte character */
+ else
+ {
+ *puc = 0xfffd;
+ if ((s[1] ^ 0x80) >= 0x40)
+ return 1;
+ else if ((s[2] ^ 0x80) >= 0x40)
+ return 2;
+ else if ((s[3] ^ 0x80) >= 0x40)
+ return 3;
+ else
+ return 4;
+ }
}
else
{
- /* incomplete multibyte character */
*puc = 0xfffd;
- return n;
+ if (n < 2 || (s[1] ^ 0x80) >= 0x40)
+ return 1;
+ else if (n < 3 || (s[2] ^ 0x80) >= 0x40)
+ return 2;
+ else
+ return 3;
}
}
#if 0
--
Ben Pfaff
http://benpfaff.org