[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Proposed alternative encoding for stray UTF-8 bytes in strings
From: |
elf |
Subject: |
Re: Proposed alternative encoding for stray UTF-8 bytes in strings |
Date: |
Mon, 27 Nov 2023 15:58:23 +0200 |
User-agent: |
K-9 Mail for Android |
Yes, this is precisely my point - 'one or more'. The string-length with invalid
embedded sequences is not guaranteed to be consistent, which seems like a
problem. Doing a decode to ensure all points are valid - even if in the
undefined sequences - seems to be a good idea to prevent secondary issues.
I take your point that the string-copy would not be affected, though, thank you.
-elf
On 27 November 2023 15:41:59 GMT+02:00, felix.winkelmann@bevuta.com wrote:
>> Question: if there is no translation at all, won't the invalid chars cause
>> issues with things like string-length and string-copy procs? That is, since
>> the number of octets can't be correctly translated to a number of glyphs,
>> there will be some unpleasant side effects.
>
>Converting a octet-sequence to a string involves a decoding step to compute
>the length.
>Any invalid embedded UTF-8 sequence is taken as one ore more "illegal"
>code-points,
>counting for one ore more characters in the final string length. Note that the
>length
>of the "backing store" bytevector for the string is retained together with the
>number of
>code-points that the string holds (the former is stored in the header of the
>string's
>bytevector buffer, the latter in a slot of the string).
>
>
>felix
>
>