[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Supporting UTF-8 (Was Re: Challenges of adding octal and hexadecimal esc
From: |
Jose E. Marchesi |
Subject: |
Supporting UTF-8 (Was Re: Challenges of adding octal and hexadecimal escape sequences in strings) |
Date: |
Mon, 02 Nov 2020 21:07:50 +0100 |
User-agent: |
Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux) |
>> Hi.
>>
>> I've added support for octal and hexadecimal escape sequences in strings.
>> But there's a problem with Poke strings: they are null-terminated.
>>
>> Please consider the following example:
>>
>> ```poke
>> defvar s = "a\0b";
>>
>> assert (s'length == 1);
>> assert (s'size == 2#B);
>> assert (s == "a");
>> ```
>>
>> This behavior is IMHO annoying and counter-intuitive.
>>
>> The desired behavior (IMHO):
>>
>> ```poke
>> defvar s = "a\0b";
>>
>> assert (s'length == 3);
>> assert (s'size == 4#B);
>> assert (s == "a\0b");
>> assert (s + "cde" == "a\0bcde");
>> ```
>>
>> I'm not sure about how `printf` (and `format` in future) should behave:
>>
>> ```poke
>> printf ("%s\n", a); // Should behave like C and prints only two bytes?
>> // Or should prints all the 3 bytes?
>> ```
>>
>> Maybe choosing the first approach plus providing something like the "%.*s"
>> specifier (like in `C`) to let the user choose about how many bytes he/she
>> wants to print.
>>
>>
>> Possible solution for Poke:
>> Using a property to keep track of length of string.
>
> I think this is a good idea, because there can be situations where you
> want to have strings including a NULL (C++ explicitly supports this in
> std::string for instance).
>
> Also, something comparable would be probably required anyway for proper
> UTF8 support, where the string's length is not equal to the number of
> bytes (minus 1).
>
> So starting this now sounds like a good idea.
We discussed about supporting UTF-8 at the mini-poke-conf in Switzerland
last January.
Find below an excerpt of some notes published at the time in
http://www.jemarch.net/pokology-20200113.html. As you can see, we dont'
really plan to support UTF-8 strings and characters as native language
types.
----------------------------------------------------------------------
Unicode
How to best support Unicode in poke? We concluded that mimicking the C
support (with its support of "wide" chars and strings) is not a good
idea. We will be splitting the support in several pickles:
unicode.pk (also handling ucs encodings)
utf8.pk
utf16.pk
Additionally to provide suitable Poke types (like for an UTF-8
character) we will want to implement additional functionality (we used
the GNU libunistring API as a base):
* display width (for printing)
* We don't need explicit check functions because that logic shall be
implemented in the UTF8
type definitions (constraints.)
* Conversion functions (utf* -> utf*)>
* mblen functions are not needed because the logic is implemented as part of
the mapping.
* ditto for the *cpy functions.
* ditto for the *move functions.
* ditto for the *mbsnlen functions.
* ditto for *next and *prev.
* ditto for *strlen.
* Comparison functions are useful (for sorting for example.)
* strstr for unicode strings.