classpath
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: characters set's problem


From: Sascha Brawer
Subject: Re: characters set's problem
Date: Tue, 4 Nov 2003 11:55:48 +0100

jsona laio <address@hidden> wrote on Thu, 30 Oct 2003 07:55:17 +0000:

>however, lately i want to participate a porject, in
>which involves developing encoding like CCCII (CJK
>based characters set for asian characters, which
>defines more characters than unicode supports in the
>parts of CJK). however, as i know, java vm is based
>upon unicode.

What exactly do you mean when you say "Unicode"? It seems that you think
that Java uses only U+0000 .. U+FFFF (the "basic multilingual plane",
also known as UCS-2). As far as I know, this was true in the past, but
this restriction has changed in the meantime. Nowadays, Java uses the
full Unicode character set.

The following is quoted from [1]:

>The native coded character set of the Java programming language is that
>of the first seventeen planes of the Unicode version 3.0 character set;
>that is, it consists in the basic multilingual plane (BMP) of Unicode
>version 1 plus the next sixteen planes of Unicode version 3. This is
>because the language's internal representation of characters uses the
>UTF-16 encoding, which encodes the BMP directly and uses surrogate pairs,
>a simple escape mechanism, to encode the other planes. Hence a charset in
>the Java platform defines a mapping between sequences of sixteen-bit
>values in UTF-16 and sequences of bytes.

Basically, you are free to use any Unicode code point that can be mapped
to and from UTF-16. You also might want to have a look at Unicode 4.0,
which has added many additional code points for CJK ideographs. (By the
way, CCCII is listed among the "Source Standards and Specifications" of
Unicode 4.0, chapter R.1, page 1385; also available online from the
Unicode site). But if you really need ideographs that are not covered in
Unicode, there are the "private usage areas". These are large ranges of
Unicode code points that are reserved for private purposes.

If you want to write a converter between some character encoding and
Unicode (possibly using code points from a private usage area, if Unicode
does not provide a code point for a specific ideograph), please have a
look that the java.nio.charset package. I think that GNU Classpath would
be glad to accept converters.

The distinction between character sets and character encodings can cause
a lot of confusion. A helpful introduction is [2].

[1] http://java.sun.com/j2se/1.4.2/docs/api/java/nio/charset/Charset.html
[2] http://www.unicode.org/standard/principles.html


Best regards,

-- Sascha

Sascha Brawer, address@hidden, http://www.dandelis.ch/people/brawer/ 






reply via email to

[Prev in Thread] Current Thread [Next in Thread]