[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Proposed alternative encoding for stray UTF-8 bytes in strings
From: |
felix . winkelmann |
Subject: |
Re: Proposed alternative encoding for stray UTF-8 bytes in strings |
Date: |
Mon, 27 Nov 2023 13:49:00 +0100 |
> From the unicode-transition page:
>
> The strategy that I favor in the moment is to handle all string data
> > injected into the system transparently, the actual bytes are unchanged and
> > unexpected UTF-8 bytes are decoded and marked as a U+DC80 - U+DCFF (low,
> > trailing) UTF-16 surrogate pair half.
>
>
> The trouble with this is that it means the internal representation is no
> longer valid UTF-8, which may cause problems down the line, since it is
> exposed to anyone dealing with bytevectors.
>
> There is an alternative based on the little-known "noncharacter" range.
> Despite the name, these really are perfectly valid characters, but Unicode
> guarantees that they will never be assigned to anything in the Real World
> and are reserved for internal use.[1] I propose using them instead of the
> surrogate space. Unfortunately there aren't enough of them to assign one
> to each possible stray byte, but we can assign one to each high and low
> nybble of each stray byte, analogously to the way Planes 1 to 1F are
> handled in UTF-16.
>
> Specifically, given a stray byte whose hex representation is xy, we decode
> it as the UTF-8 equivalent of U+FDDx U+FDEy, which is EF B7 9x EF B7 Ay in
> the internal encoding, which is now valid UTF-8. If any of these
> noncharacters (coming from a UTF-8 or UTF-16 source) is to be decoded, we
> escape it with the UTF-8 representation of U+FFFE, which is EF BF BE, so
> that (say) external U+FDDA is decoded as EF BF BE EF B7 AA. U+FFFE is also
> used to escape itself, so it becomes EF BF BE EF BF BE internally.
This is indeed a clever idea, thanks for pointing this out. I thought about
this and it seems that it might not be necessary to worry about the internal
encoding, as the current approach tries to handle strings received from the
OS in a transparent manner.
There is no translation step, as all strings are by default assumed to be
UTF-8, with the exception of strings read from ports that have an explicit
binary/latin-1 encoding. The U+DCxx only is relevant at the point of decoding,
when we extract characters from the underlying bytevector, it never appears
in the internal representation of a string itself. So, if we receive a
string (say a filename from a directory-read operation), it may or may not
be valid UTF-8, but it will be kept as the same sequence of bytes.
This also applies to "string->utf8", which just produces a copy of the
internal bytevector, regardless of whether the string contains invalid
sequences or not. R7RS doesn't state (to my knowledge) any requirements
regarding
the result of this procedure. It says "it is an error" for a string to
contain any "forbidden" characters, but, as I understand it, what is or
is not forbidden, is up to the implementation.
This may be insufficient (I'm not a UNICODE-lawyer), but appears to me
compatible to the standard, has a low overhead and has the advantage that we
don't have
to worry about what encoding OS-sources of strings (other than ports) may have.
Perhaps you can clarify what you mean when you say that this can
cause problems when dealing with bytevectors?
cheers,
felix