gnustep-dev
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New ABI NSConstantString


From: Stefan Bidigaray
Subject: Re: New ABI NSConstantString
Date: Sat, 07 Apr 2018 13:48:15 +0000

I looked into this extensively when I was working on CFString, and came to the conclusion that that was probably the path of least resistance.

But just to clarify, the Unicode situation is even more complicated than that. Serogates are considered reserved character, and not allowed in UTF-8. So to find the length of a UTF-8 string in the UTF-16 encoding you have to decode the entire string to UTF-32, check that there are no serogates (those should be treated as illegal), if any character is >0xffff then it must checked for a valid surrogate pair.

Regardless of what is done, you'll end up delicate situation. The UTF encodings must be constantly error checked, because there's always a chance that all this back and forth can introduce an invalid character.

On top of it all, the UTF-16 serrogate pairs can only encode up to 0x10ffff characters, which means if/when this limit is reached, a new encoding will have to be devised.

On Sat, Apr 7, 2018, 04:49 David Chisnall <address@hidden> wrote:
On 5 Apr 2018, at 20:09, Stefan Bidigaray <address@hidden> wrote:
>
> I know this is probably going to be rejected, but how about making constant string either ASCII or UTF-16 only? Scratching UTF-8 altogether? I know this would increase the byte count for most European languages using Latin characters, but I don't see the point of maintaining both UTF-8 and UTF-16 encoding. Everything that can be done with UTF-16 can be encoded in UTF-8 (and vise-versa), so how would the compiler pick between the two? Additionally, wouldn't sticking to just 1 of the 2 encoding simplify the code significantly?

I am leaning in this direction.  The APIs all want UTF-16 codepoints.  In ASCII, each character is precisely one UTF-16 codepoint.  In UTF-16, every two-byte value is a UTF-16 codepoint.  In UTF-8, UTF-16 codepoints are somewhere between 1 and 3 characters long and the mapping is complicated.  It’s a shame that in the 64-bit transition Apple didn’t make unichar 32 bits and make it a unicode character, so we’re stuck in the same situation of Windows with a hasty s/UCS2/UTF-16/ and an attempt to make the APIs keep working.

My current plan is to make the format support ASCII, UTF-8, UTF-16, and UTF-32, but only generate ASCII and UTF-16 in the compiler and then decide later if we want to support generating UTF-8 and UTF-32.  I also won’t initialise the hash in the compiler initially, until we’ve decided a bit more what the hash should be.

David


reply via email to

[Prev in Thread] Current Thread [Next in Thread]