[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[bug-libunistring] UTF-8 backward iteration proposal for libunistring
From: |
Ben Pfaff |
Subject: |
[bug-libunistring] UTF-8 backward iteration proposal for libunistring |
Date: |
Fri, 12 Nov 2010 21:30:11 -0800 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux) |
libunistring has several functions for iterating forward in a
UTF-8 string, such as u8_mbtouc(), u8_mbtoucr(), and u8_next().
But it has only the u8_prev() function for iterating backward.
That function has the pitfall that it only operates on
well-formed UTF-8 sequences (the manual also implies that it only
works on null-terminated UTF-8 strings, but in fact it doesn't).
So I set out to write backward iterating analogues of
u8_mbtouc(), u8_mbtoucr(), etc. that work well in the presence of
invalid and incomplete UTF-8 byte sequences. After some thinking
and experimentation I came up with two requirements:
1. Forward and backward iteration of the same bytes report the
same code points (in opposite order, of course).
2. Backward iteration does not require any more context than
forward iteration does.
It is easy to satisfy these requirements if the string contains
only well-formed UTF-8 byte sequences. A byte inside the range
0x80 to 0xBF, inclusive, is the the second, third, or fourth byte
of a character. Any byte outside that range is the first byte of
a character.
If the string might have ill-formed sequences, the problem is
harder. Consider how u8_mbtouc() treats ill-formed sequences.
There are three cases (the examples show byte sequences and the
code points for the immediately preceding bytes in parentheses):
(a) For an incomplete sequence, it reports the whole incomplete
sequence as a single code point U+FFFD.
e0 a0 (U+FFFD)
(Only 2 bytes in 3-byte sequence.)
(b) For a sequence that is invalid in UTF-8, but that would be
valid if overlong sequences or invalid Unicode code points
were allowed, it reports the whole invalid sequence as a
single code point U+FFFD.
e0 80 (U+FFFD)
(Not a prefix of any valid UTF-8 sequence, because the
second byte must be at 0xa0 when the first byte is 0xe0.)
f5 80 80 (U+FFFD)
(This would be greater than the maximum code point U+10FFFF
if it was allowed.)
(c) For an invalid (but complete) sequence, it reports each
byte as a separate code point U+FFFD.
c0 (U+FFFD) 41 (U+0041)
(c0 never appears in UTF-8)
e1 (U+FFFD) e1 (U+FFFD) 80 (U+FFFD)
(This would be a UTF-16 surrogate if it was allowed.)
e0 (U+FFFD) a0 (U+FFFD) 00 (U+0000)
(e1 starts a 3-byte sequence but 00 is invalid as the
third byte.)
Given these three cases, I believe that a function for backward
iteration can satisfy either property 1 or property 2 above, but
not both:
* Clearly it is possible to satisfy property 1, ignoring
efficiency: iterate forward through the entire UTF-8 byte
sequence, recording the code points and their positions as
they are generated, and then return them during backward
iteration. But such an implementation would use extra state
not allowed by property 2.
* If the backward iteration function only uses the same amount
of state as u8_mbtouc(), that is, a pointer to the beginning
of the string and the number of bytes in it, then this
satisfies property 2. But some of the cases above cannot be
distinguished. For example, consider the sequence e0 a0 00:
- In forward iteration, this is case (c) for u8_mbtouc(),
and each byte will be treated as a separate code point.
- In backward iteration, after returning the final byte 00
as code point U+0000, the reverse iteration code cannot
tell whether the remaining bytes e0 a0 should be treated
as case (a), a single U+FFFD code point, or case (c), two
separate U+FFFD code points, because it does not know
whether following bytes were trimmed off earlier.
Distinguishing these cases requires additional state,
which would violate property 2.
However, backward iteration satisfying both properties becomes
possible if we change the treatment of ill-formed sequences for
forward iteration to use the same rule in every case, so that
every ill-formed byte is reported as a separate code point. The
examples above change to the following under this rule:
(a) e0 (U+FFFD) a0 (U+FFFD)
(b) e0 (U+FFFD) 80 (U+FFFD)
f5 (U+FFFD) 80 (U+FFFD) 80 (U+FFFD)
(c) c0 (U+FFFD) 41 (U+0041)
e1 (U+FFFD) e1 (U+FFFD) 80 (U+FFFD)
e0 (U+FFFD) a0 (U+FFFD) 00 (U+0000)
So, I'd like to propose the following:
1. Change libunistring functions that detect ill-formed UTF-8
sequences to return each byte of an ill-formed sequence as a
separate U+FFFD code point (and for case (b) to return these
even when, e.g. e0 80 is seen but a third byte isn't
available). (This actually simplifies code.)
2. Add libunistring functions to get the last code point out of
a UTF-8 string. Tentatively I was planning to add a "r"
prefix (e.g. u8_rmbtouc(), u8_rmbtoucr()) but other
conventions are fine too.
(I would add the corresponding functions for UTF-16 and
UTF-32 too, of course, although they are much simpler.)
I'm willing to do any or all of the work for #1 and #2, and
document it. I've already started in enough to convince myself
that it is doable.
Bruno, does this sound like a worthwhile project, and would you
accept this kind of contribution, if it was written following the
existing libunistring conventions, etc.?
Thanks,
Ben.
--
Ben Pfaff
http://benpfaff.org
- [bug-libunistring] UTF-8 backward iteration proposal for libunistring,
Ben Pfaff <=