bug-libunistring
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[bug-libunistring] UTF-8 backward iteration proposal for libunistring


From: Ben Pfaff
Subject: [bug-libunistring] UTF-8 backward iteration proposal for libunistring
Date: Fri, 12 Nov 2010 21:30:11 -0800
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/23.2 (gnu/linux)

libunistring has several functions for iterating forward in a
UTF-8 string, such as u8_mbtouc(), u8_mbtoucr(), and u8_next().
But it has only the u8_prev() function for iterating backward.
That function has the pitfall that it only operates on
well-formed UTF-8 sequences (the manual also implies that it only
works on null-terminated UTF-8 strings, but in fact it doesn't).

So I set out to write backward iterating analogues of
u8_mbtouc(), u8_mbtoucr(), etc. that work well in the presence of
invalid and incomplete UTF-8 byte sequences.  After some thinking
and experimentation I came up with two requirements:

  1. Forward and backward iteration of the same bytes report the
     same code points (in opposite order, of course).

  2. Backward iteration does not require any more context than
     forward iteration does.

It is easy to satisfy these requirements if the string contains
only well-formed UTF-8 byte sequences.  A byte inside the range
0x80 to 0xBF, inclusive, is the the second, third, or fourth byte
of a character.  Any byte outside that range is the first byte of
a character.

If the string might have ill-formed sequences, the problem is
harder.  Consider how u8_mbtouc() treats ill-formed sequences.
There are three cases (the examples show byte sequences and the
code points for the immediately preceding bytes in parentheses):

  (a) For an incomplete sequence, it reports the whole incomplete
      sequence as a single code point U+FFFD.

        e0 a0 (U+FFFD)

          (Only 2 bytes in 3-byte sequence.)

  (b) For a sequence that is invalid in UTF-8, but that would be
      valid if overlong sequences or invalid Unicode code points
      were allowed, it reports the whole invalid sequence as a
      single code point U+FFFD.

        e0 80 (U+FFFD)

          (Not a prefix of any valid UTF-8 sequence, because the
          second byte must be at 0xa0 when the first byte is 0xe0.)

        f5 80 80 (U+FFFD)

          (This would be greater than the maximum code point U+10FFFF
          if it was allowed.)

  (c) For an invalid (but complete) sequence, it reports each
      byte as a separate code point U+FFFD.

        c0 (U+FFFD) 41 (U+0041)

          (c0 never appears in UTF-8)

        e1 (U+FFFD) e1 (U+FFFD) 80 (U+FFFD)

          (This would be a UTF-16 surrogate if it was allowed.)

        e0 (U+FFFD) a0 (U+FFFD) 00 (U+0000)

          (e1 starts a 3-byte sequence but 00 is invalid as the
          third byte.)

Given these three cases, I believe that a function for backward
iteration can satisfy either property 1 or property 2 above, but
not both:

  * Clearly it is possible to satisfy property 1, ignoring
    efficiency: iterate forward through the entire UTF-8 byte
    sequence, recording the code points and their positions as
    they are generated, and then return them during backward
    iteration.  But such an implementation would use extra state
    not allowed by property 2.

  * If the backward iteration function only uses the same amount
    of state as u8_mbtouc(), that is, a pointer to the beginning
    of the string and the number of bytes in it, then this
    satisfies property 2.  But some of the cases above cannot be
    distinguished.  For example, consider the sequence e0 a0 00:

      - In forward iteration, this is case (c) for u8_mbtouc(),
        and each byte will be treated as a separate code point.

      - In backward iteration, after returning the final byte 00
        as code point U+0000, the reverse iteration code cannot
        tell whether the remaining bytes e0 a0 should be treated
        as case (a), a single U+FFFD code point, or case (c), two
        separate U+FFFD code points, because it does not know
        whether following bytes were trimmed off earlier.
        Distinguishing these cases requires additional state,
        which would violate property 2.

However, backward iteration satisfying both properties becomes
possible if we change the treatment of ill-formed sequences for
forward iteration to use the same rule in every case, so that
every ill-formed byte is reported as a separate code point.  The
examples above change to the following under this rule:

  (a)   e0 (U+FFFD) a0 (U+FFFD)
  (b)   e0 (U+FFFD) 80 (U+FFFD)
        f5 (U+FFFD) 80 (U+FFFD) 80 (U+FFFD)
  (c)   c0 (U+FFFD) 41 (U+0041)
        e1 (U+FFFD) e1 (U+FFFD) 80 (U+FFFD)
        e0 (U+FFFD) a0 (U+FFFD) 00 (U+0000)

So, I'd like to propose the following:

  1. Change libunistring functions that detect ill-formed UTF-8
     sequences to return each byte of an ill-formed sequence as a
     separate U+FFFD code point (and for case (b) to return these
     even when, e.g. e0 80 is seen but a third byte isn't
     available).  (This actually simplifies code.)

  2. Add libunistring functions to get the last code point out of
     a UTF-8 string.  Tentatively I was planning to add a "r"
     prefix (e.g. u8_rmbtouc(), u8_rmbtoucr()) but other
     conventions are fine too.

     (I would add the corresponding functions for UTF-16 and
     UTF-32 too, of course, although they are much simpler.)

I'm willing to do any or all of the work for #1 and #2, and
document it.  I've already started in enough to convince myself
that it is doable.

Bruno, does this sound like a worthwhile project, and would you
accept this kind of contribution, if it was written following the
existing libunistring conventions, etc.?

Thanks,

Ben.
-- 
Ben Pfaff 
http://benpfaff.org



reply via email to

[Prev in Thread] Current Thread [Next in Thread]