Supporting UTF-8 (Was Re: Challenges of adding octal and hexadecimal esc

poke-devel

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Supporting UTF-8 (Was Re: Challenges of adding octal and hexadecimal esc

From:	Jose E. Marchesi
Subject:	Supporting UTF-8 (Was Re: Challenges of adding octal and hexadecimal escape sequences in strings)
Date:	Mon, 02 Nov 2020 21:07:50 +0100
User-agent:	Gnus/5.13 (Gnus v5.13) Emacs/28.0.50 (gnu/linux)

>> Hi.
>>
>> I've added support for octal and hexadecimal escape sequences in strings.
>> But there's a problem with Poke strings: they are null-terminated.
>>
>> Please consider the following example:
>>
>> ```poke
>> defvar s = "a\0b";
>>
>> assert (s'length == 1);
>> assert (s'size == 2#B);
>> assert (s == "a");
>> ```
>>
>> This behavior is IMHO annoying and counter-intuitive.
>>
>> The desired behavior (IMHO):
>>
>> ```poke
>> defvar s = "a\0b";
>>
>> assert (s'length == 3);
>> assert (s'size == 4#B);
>> assert (s == "a\0b");
>> assert (s + "cde" == "a\0bcde");
>> ```
>>
>> I'm not sure about how `printf` (and `format` in future) should behave:
>>
>> ```poke
>> printf ("%s\n", a); // Should behave like C and prints only two bytes?
>>                     // Or should prints all the 3 bytes?
>> ```
>>
>> Maybe choosing the first approach plus providing something like the "%.*s"
>> specifier (like in `C`) to let the user choose about how many bytes he/she
>> wants to print.
>>
>>
>> Possible solution for Poke:
>>   Using a property to keep track of length of string.
>
> I think this is a good idea, because there can be situations where you
> want to have strings including a NULL (C++ explicitly supports this in
> std::string for instance).
>
> Also, something comparable would be probably required anyway for proper
> UTF8 support, where the string's length is not equal to the number of
> bytes (minus 1).
>
> So starting this now sounds like a good idea.

We discussed about supporting UTF-8 at the mini-poke-conf in Switzerland
last January.

Find below an excerpt of some notes published at the time in
http://www.jemarch.net/pokology-20200113.html.  As you can see, we dont'
really plan to support UTF-8 strings and characters as native language
types.

----------------------------------------------------------------------

Unicode

How to best support Unicode in poke? We concluded that mimicking the C
support (with its support of "wide" chars and strings) is not a good
idea. We will be splitting the support in several pickles:

 unicode.pk (also handling ucs encodings)
 utf8.pk
 utf16.pk

Additionally to provide suitable Poke types (like for an UTF-8
character) we will want to implement additional functionality (we used
the GNU libunistring API as a base):

* display width (for printing)
* We don't need explicit check functions because that logic shall be 
implemented in the UTF8
 type definitions (constraints.)
* Conversion functions (utf* -> utf*)> 
* mblen functions are not needed because the logic is implemented as part of 
the mapping.
* ditto for the *cpy functions.
* ditto for the *move functions.
* ditto for the *mbsnlen functions.
* ditto for *next and *prev.
* ditto for *strlen.
* Comparison functions are useful (for sorting for example.)
* strstr for unicode strings.

[Prev in Thread]

Current Thread

[Next in Thread]

Re: Challenges of adding octal and hexadecimal escape sequences in strings, Dan Čermák, 2020/11/01
- Re: Challenges of adding octal and hexadecimal escape sequences in strings, Mohammad-Reza Nabipoor, 2020/11/01
- Supporting UTF-8 (Was Re: Challenges of adding octal and hexadecimal escape sequences in strings), Jose E. Marchesi <=

Prev by Date: Re: Proposal to change the ranges in array trimming to be "half-open" interval
Next by Date: [PATCH v2] libpoke: Enable octal and hex \-sequence in string literals
Previous by thread: Re: Challenges of adding octal and hexadecimal escape sequences in strings
Next by thread: Re: [PATCH] libpoke: Enable octal and hex \-sequence in string literals
Index(es):
- Date
- Thread