[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Proposed alternative encoding for stray UTF-8 bytes in strings
From: |
elf |
Subject: |
Re: Proposed alternative encoding for stray UTF-8 bytes in strings |
Date: |
Mon, 27 Nov 2023 15:08:14 +0200 |
User-agent: |
K-9 Mail for Android |
Question: if there is no translation at all, won't the invalid chars cause
issues with things like string-length and string-copy procs? That is, since the
number of octets can't be correctly translated to a number of glyphs, there
will be some unpleasant side effects.
-elf
On 27 November 2023 14:49:00 GMT+02:00, felix.winkelmann@bevuta.com wrote:
>> From the unicode-transition page:
>>
>> The strategy that I favor in the moment is to handle all string data
>> > injected into the system transparently, the actual bytes are unchanged and
>> > unexpected UTF-8 bytes are decoded and marked as a U+DC80 - U+DCFF (low,
>> > trailing) UTF-16 surrogate pair half.
>>
>>
>> The trouble with this is that it means the internal representation is no
>> longer valid UTF-8, which may cause problems down the line, since it is
>> exposed to anyone dealing with bytevectors.
>>
>> There is an alternative based on the little-known "noncharacter" range.
>> Despite the name, these really are perfectly valid characters, but Unicode
>> guarantees that they will never be assigned to anything in the Real World
>> and are reserved for internal use.[1] I propose using them instead of the
>> surrogate space. Unfortunately there aren't enough of them to assign one
>> to each possible stray byte, but we can assign one to each high and low
>> nybble of each stray byte, analogously to the way Planes 1 to 1F are
>> handled in UTF-16.
>>
>> Specifically, given a stray byte whose hex representation is xy, we decode
>> it as the UTF-8 equivalent of U+FDDx U+FDEy, which is EF B7 9x EF B7 Ay in
>> the internal encoding, which is now valid UTF-8. If any of these
>> noncharacters (coming from a UTF-8 or UTF-16 source) is to be decoded, we
>> escape it with the UTF-8 representation of U+FFFE, which is EF BF BE, so
>> that (say) external U+FDDA is decoded as EF BF BE EF B7 AA. U+FFFE is also
>> used to escape itself, so it becomes EF BF BE EF BF BE internally.
>
>This is indeed a clever idea, thanks for pointing this out. I thought about
>this and it seems that it might not be necessary to worry about the internal
>encoding, as the current approach tries to handle strings received from the
>OS in a transparent manner.
>
>There is no translation step, as all strings are by default assumed to be
>UTF-8, with the exception of strings read from ports that have an explicit
>binary/latin-1 encoding. The U+DCxx only is relevant at the point of decoding,
>when we extract characters from the underlying bytevector, it never appears
>in the internal representation of a string itself. So, if we receive a
>string (say a filename from a directory-read operation), it may or may not
>be valid UTF-8, but it will be kept as the same sequence of bytes.
>
>This also applies to "string->utf8", which just produces a copy of the
>internal bytevector, regardless of whether the string contains invalid
>sequences or not. R7RS doesn't state (to my knowledge) any requirements
>regarding
>the result of this procedure. It says "it is an error" for a string to
>contain any "forbidden" characters, but, as I understand it, what is or
>is not forbidden, is up to the implementation.
>
>This may be insufficient (I'm not a UNICODE-lawyer), but appears to me
>compatible to the standard, has a low overhead and has the advantage that we
>don't have
>to worry about what encoding OS-sources of strings (other than ports) may have.
>
>Perhaps you can clarify what you mean when you say that this can
>cause problems when dealing with bytevectors?
>
>
>cheers,
>felix
>
>
>