chicken-hackers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Proposed alternative encoding for stray UTF-8 bytes in strings


From: elf
Subject: Re: Proposed alternative encoding for stray UTF-8 bytes in strings
Date: Mon, 27 Nov 2023 15:58:23 +0200
User-agent: K-9 Mail for Android

Yes, this is precisely my point - 'one or more'. The string-length with invalid 
embedded sequences is not guaranteed to be consistent, which seems like a 
problem. Doing a decode to ensure all points are valid - even if in the 
undefined sequences - seems to be a good idea to prevent secondary issues.

I take your point that the string-copy would not be affected, though, thank you.

-elf

On 27 November 2023 15:41:59 GMT+02:00, felix.winkelmann@bevuta.com wrote:
>> Question: if there is no translation at all, won't the invalid chars cause 
>> issues with things like string-length and string-copy procs? That is, since 
>> the number of octets can't be correctly translated to a number of glyphs, 
>> there will be some unpleasant side effects.
>
>Converting a octet-sequence to a string involves a decoding step to compute 
>the length.
>Any invalid embedded UTF-8 sequence is taken as one ore more "illegal" 
>code-points,
>counting for one ore more characters in the final string length. Note that the 
>length
>of the "backing store" bytevector for the string is retained together with the 
>number of
>code-points that the string holds (the former is stored in the header of the 
>string's
>bytevector buffer, the latter in a slot of the string).
>
>
>felix
>
>



reply via email to

[Prev in Thread] Current Thread [Next in Thread]