poke-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Supporting UTF-8 (Was Re: Challenges of adding octal and hexadecimal esc


From: Jose E. Marchesi
Subject: Supporting UTF-8 (Was Re: Challenges of adding octal and hexadecimal escape sequences in strings)
Date: Mon, 02 Nov 2020 21:07:50 +0100
User-agent: Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)

>> Hi.
>>
>> I've added support for octal and hexadecimal escape sequences in strings.
>> But there's a problem with Poke strings: they are null-terminated.
>>
>> Please consider the following example:
>>
>> ```poke
>> defvar s = "a\0b";
>>
>> assert (s'length == 1);
>> assert (s'size == 2#B);
>> assert (s == "a");
>> ```
>>
>> This behavior is IMHO annoying and counter-intuitive.
>>
>> The desired behavior (IMHO):
>>
>> ```poke
>> defvar s = "a\0b";
>>
>> assert (s'length == 3);
>> assert (s'size == 4#B);
>> assert (s == "a\0b");
>> assert (s + "cde" == "a\0bcde");
>> ```
>>
>> I'm not sure about how `printf` (and `format` in future) should behave:
>>
>> ```poke
>> printf ("%s\n", a); // Should behave like C and prints only two bytes?
>>                     // Or should prints all the 3 bytes?
>> ```
>>
>> Maybe choosing the first approach plus providing something like the "%.*s"
>> specifier (like in `C`) to let the user choose about how many bytes he/she
>> wants to print.
>>
>>
>> Possible solution for Poke:
>>   Using a property to keep track of length of string.
>
> I think this is a good idea, because there can be situations where you
> want to have strings including a NULL (C++ explicitly supports this in
> std::string for instance).
>
> Also, something comparable would be probably required anyway for proper
> UTF8 support, where the string's length is not equal to the number of
> bytes (minus 1).
>
> So starting this now sounds like a good idea.

We discussed about supporting UTF-8 at the mini-poke-conf in Switzerland
last January.

Find below an excerpt of some notes published at the time in
http://www.jemarch.net/pokology-20200113.html.  As you can see, we dont'
really plan to support UTF-8 strings and characters as native language
types.

----------------------------------------------------------------------

Unicode

How to best support Unicode in poke? We concluded that mimicking the C
support (with its support of "wide" chars and strings) is not a good
idea. We will be splitting the support in several pickles:

 unicode.pk (also handling ucs encodings)
 utf8.pk
 utf16.pk

Additionally to provide suitable Poke types (like for an UTF-8
character) we will want to implement additional functionality (we used
the GNU libunistring API as a base):

* display width (for printing)
* We don't need explicit check functions because that logic shall be 
implemented in the UTF8
 type definitions (constraints.)
* Conversion functions (utf* -> utf*)> 
* mblen functions are not needed because the logic is implemented as part of 
the mapping.
* ditto for the *cpy functions.
* ditto for the *move functions.
* ditto for the *mbsnlen functions.
* ditto for *next and *prev.
* ditto for *strlen.
* Comparison functions are useful (for sorting for example.)
* strstr for unicode strings.



reply via email to

[Prev in Thread] Current Thread [Next in Thread]