chicken-hackers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Proposed alternative encoding for stray UTF-8 bytes in strings


From: John Cowan
Subject: Proposed alternative encoding for stray UTF-8 bytes in strings
Date: Fri, 24 Nov 2023 02:34:03 -0500

(If this is too late in the process, I understand.  I think the required code changes will be small and localized.)

From the unicode-transition page:

The strategy that I favor in the moment is to handle all string data injected into the system transparently, the actual bytes are unchanged and unexpected UTF-8 bytes are decoded and marked as a U+DC80 - U+DCFF (low, trailing) UTF-16 surrogate pair half. 

The trouble with this is that it means the internal representation is no longer valid UTF-8, which may cause problems down the line, since it is exposed to anyone dealing with bytevectors.  

There is an alternative based on the little-known "noncharacter" range.  Despite the name, these really are perfectly valid characters, but Unicode guarantees that they will never be assigned to anything in the Real World and are reserved for internal use.[1]  I propose using them instead of the surrogate space.  Unfortunately there aren't enough of them to assign one to each possible stray byte, but we can assign one to each high and low nybble of each stray byte, analogously to the way Planes 1 to 1F are handled in UTF-16.

Specifically, given a stray byte whose hex representation is xy, we decode it as the UTF-8 equivalent of U+FDDx U+FDEy, which is EF B7 9x EF B7 Ay in the internal encoding, which is now valid UTF-8.  If any of these noncharacters (coming from a UTF-8 or UTF-16 source) is to be decoded, we escape it with the UTF-8 representation of U+FFFE, which is EF BF BE, so that (say) external U+FDDA is decoded as EF BF BE EF B7 AA.  U+FFFE is also used to escape itself, so it becomes EF BF BE EF BF BE internally.

I hope this is understandable.

[1] See https://www.unicode.org/versions/corrigendum9.html for details.

reply via email to

[Prev in Thread] Current Thread [Next in Thread]