[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation
From: |
Raymond Toy |
Subject: |
Re: [Gcl-devel] utf8 and emacs text/string multibyte representation |
Date: |
Sat, 01 Nov 2014 11:23:31 -0700 |
User-agent: |
Gnus/5.101 (Gnus v5.10.10) XEmacs/21.5-b34 (darwin) |
>>>>> "Matt" == Matt Kaufmann <address@hidden> writes:
Matt> Hi --
Matt> I think you and Camm know more about this than I do, but to answer
Matt> your question, below is what I get in GCL 2.6.12. Except, I don't
Matt> know how mailers handle high characters of the sort GCL printed in the
Matt> output from (string (code-char 232)) below, so although that string
Matt> was printed using a single character, here I show it as four
Matt> characters (that visually appear just like the one-character version).
>> (code-char 232)
Matt> #\\350
>> (string (code-char 232))
Matt> "\350"
I think this is really what we're trying to figure out. What you show
is what gcl does today. The question is what happens if unicode
support were added to gcl using 8-bit characters with utf-8 strings.
I think when unicode is added, gcl will do pretty much the same as
above, but the string is utf-8 encoded so a string consisting of a
single octet with value 232 is not a valid utf-8 string. You need
more octets to form a unicode code-point.
To make a utf-8 string, you would have to do something like
(let ((s (make-string 2)))
(setf (aref s 0) (code-char 195))
(setf (aref s 1) (code-char 168))
s)
Or maybe a utility function codepoints-to-string that takes a vector
of codepoints and creates a utf-8 string out of them.
Matt> Interestingly, your (count nil (loop ...)) form also evaluates to 63
Matt> in CCL, CLISP, and SBCL, but it evaluates to 32 in Allegro CL and 66
Matt> in LispWorks. It seems to me that the HyperSpec documentation allows
Matt> for these differences.
Interesting. I don't have a copy of acl or lispworks, but cmucl
determines if a character is alpha-char-p using the unicode-category
of the codepoint. I wonder how they differ....
--
Ray