chicken-hackers
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Chicken-hackers] Numbers egg interaction with other compiled code.


From: John Cowan
Subject: Re: [Chicken-hackers] Numbers egg interaction with other compiled code.
Date: Sat, 24 Oct 2009 15:04:25 -0400
User-agent: Mutt/1.5.13 (2006-08-11)

Tony Sidaway scripsit:

> I was under the impression that Chicken was already unicode-aware. but
> apparently it's only partial.

This is the story:

1) The character data type can handle the full range of Unicode characters
from #\u0 to #\u10FFFF.

2) Literal strings containing \u escapes encode those escapes as UTF-8,
independently of the input encoding in use.

3) Otherwise, literal strings just contain the bytes provided by the
current encoding.

4) Using string or string-set! or any similar operation to put characters
into a string will chop them to the lowest 8 bits.

Thus:

#;1> (string-length "€")
3
#;2> (string-length "\u20ac")
3
#;3> (string #\€)
"�"
#;4> (char->integer (string-ref #3 0))
172
#;5> (number->string #4 16)
"ac"

(It's sheer coincidence that "ac" is both the last two hex digits of
the Unicode code point value of the euro sign and the hex digits for
the first byte of its UTF-8 representation.)

> The Euro symbol "€" is utf-8 #x20ac
> 
> (string-length (string #\€))
> ===> 1

Because the string is only 1 byte long, the first byte of the euro sign.

> (string=? (string #\€) (string (integer->char #x20ac)))
> ===> #t

They're equal because they're both bogus.

> (char=? (integer->char #x20ac) #\€)
> ===> #t

Here you're dealing with the real euro sign, not chopped.

> but (on my system at least):
> 
> (number->string (char->integer (string-ref (string (integer->char
> #x20ac )) 0)) 16)
> ===> "ac"

The "string" function is doing the chopping here.

> This is deeply puzzling.  string-length knows that the string is a
> single character.  but string-ref will only let you look at the first
> byte. And worse, it refuses to look at the second byte because as far
> as it's concerned the string only contains 1 byte:

It really does contain only one byte.

> (string-ref (string (integer->char  #x20ac )) 1)
> ===> Error: (string-ref) out of range
> 
> This sounds like something that is relatively easy to fix. There's no
> reason that I can think of why Chicken shouldn't be fully UTF-aware,

Basically because strings are used as both character sequences and
byte sequences, and that's the way Felix wants it.  He considers that
the practical way of doing things.  (I've discussed the question with
him offline.)  Those who want Unicode, according to him, should pay
the price for it and everyone else should not.  The fact that this
can result in modules that treat non-ASCII strings inconsistently,
he considers one of the things programmers should work around.

(I am reasonably sure I am not misrepresenting him, but I'm constrained
from quoting his private email.)

> Is this a limitation due to Chicken's being a Scheme-to-C implementation?

Definitely not.  It's a choice about data structure representations.

My view is that string-set! and string-fill should be removed from
Scheme, leaving immutable full-Unicode character strings, and then a
separate byte-vector/blob datatype should be provided.  However, that
seems unlikely to happen in Chicken unless and until Felix abandons
creative control.

-- 
John Cowan    address@hidden    http://ccil.org/~cowan
Objective consideration of contemporary phenomena compel the conclusion
that optimum or inadequate performance in the trend of competitive
activities exhibits no tendency to be commensurate with innate capacity,
but that a considerable element of the unpredictable must invariably be
taken into account. --Ecclesiastes 9:11, Orwell/Brown version




reply via email to

[Prev in Thread] Current Thread [Next in Thread]