Re: [Chicken-hackers] CR #1142 and upcoming changes

From:

Alex Shinn

Subject:

Date:

Wed, 20 Aug 2014 17:51:54 +0900

On Wed, Aug 20, 2014 at 5:40 PM, Felix Winkelmann <address@hidden> wrote:

From: Peter Bex <address@hidden>
Subject: Re: [Chicken-hackers] CR #1142 and upcoming changes

Date: Wed, 20 Aug 2014 10:02:51 +0200

> On Wed, Aug 20, 2014 at 11:59:58AM +0400, Yaroslav Tsarko wrote:
>> On 19.08.2014 19:24, Felix Winkelmann wrote:
>> >
>> >Sounds like a good first step, even though I personally would prefer
>> >UCS-4 strings (constant lookup + modification and so on). But that
>> >seems to be unpopular, AFAICT...
>>
>> Wouldn`t that be possible to specify which internal string encoding is
>> used by the core as a CHICKEN build-time option? For embedded systems
>> with limited resources that will give a decent leverage to choose from -
>> either consume more memory but more fast lookups etc (in the case of
>> UCS-4) or consume less memory by the cost of UTF-8 conversions on the
>> fly during string operations.
>
> I think it would be possible, but I dislike the idea because it is hard
> to maintain two separate compilation options like that.

Well, actually we might as well support several: ASCII/Latin-1, UTF-8
and UCS-2/UCS-4. Without UTF-8 it would just be a variable
element-size option. But I agree that this doesn't make maintenance
any easier... Let's think some more about this. We don't have to
decide right now.

Shouldn't be too hard for the core. Larceny provides multiple

string encoding options along these lines.

The trickier part is the FFI. The majority of C libraries use UTF-8,

and users will expect a Scheme string to map naturally to char*

arguments. So for an internal coding of Latin-1 or UTF-16 or UCS-4

you'd need to recode on FFI barriers.

Similarly it might be nice to convert an internal UTF-8 for API's

based on wint_t, but these are less common and also locale-dependent,

which makes them a pain to work with.

Alex