help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Flex and 32-bits characters


From: Hans Aberg
Subject: RE: Flex and 32-bits characters
Date: Mon, 26 Aug 2002 11:51:02 +0200

At 05:54 +0100 2002/08/26, Mark Weaver wrote:
>Other notes on unicode:
>
>- It is designed to be 16-bit (UTF-16).

What is this? Unicode proper does not have any preference for a particular
encoding, but merely assign numbers to symbols.

>- No unicode characters are assigned by the unicode consortium that are
>outside the 16-bit range.  Currently they have assign 48K characters and
>have space for up to 64K characters.

You must have an old Unicode version. In the later versions, one has added
a significant amount of characters outside the 0..2^16 range, including the
mathematical semantic styles.

>- Private use areas exist in unicode which are surrogate pairs and there may
>be up to 1 million of these.

What do you mean by "surrogate pairs" -- does that refer to a specific
Unicode encoding. Do you know which range these private numbers are in?

>- (Presumably) the other 1Mb surrogate pair space is for use by unicode.
>- Unicode itself will therefore never be more than the 21-bits that they
>suggest.  Something that was, would not in my view be unicode, nor is it
>likely to happen as they have carefully analysed the character set space
>requirements, then multiplied their requirements by a factor of 15 or so.
>Language does not develop as fast as computers!

My notes say that Unicode code points have range U+0000..10FFFF; but I do
not know if the private symbols should fall within that range.

And Unicode is expected to never add any symbols outside that range.

>As regards handling of unicode character classes &c, the simplest way to
>implement this in my view would be to use ICU
>(http://oss.software.ibm.com/icu/).  ICU is open source, available for free
>commercial use and developed/supported largely by IBM.  IBM's OS products
>all (AFAIK) use UTF-16 internally where applicable, including Xalan/Xerces,
>and ICU.  Also this is the base character set for Java (sun), the only
>'wide' character set for Oracle (oracle corp).  So UTF-16 is /not/ limited
>simply to MS.

If you use 16-bit, I think one has to solve the problems with location
tracking on variable width characters and C++.

And I would make sure that the 16-bit stuff isn't just some hangover from
the days one thought most characters would be able to fit within the 16-bit
range.

>Dealing with multibyte encodings (which is rare in Unicode, as you have to
>be doing something off the wall with it is a fact of life.

You mean, like using a math variable?

-- The problem is that you assume that multibyte characters in UTF-16 are
going be extremely rare, so it does not make any difference if the program
crashes. That's probably because you haven't checked the later Unicode
versions.

They probably figured that not everything needed will fit into the 16-bit
range, and had to break it. Once that limit has been broken, there is no
reason to not add all characters that can be fit into the new range.

>  So I would (and
>do) plump for UTF-16.  But I don't have terribly strong feelings on this.
>Where you run into trouble is say in the CRT where if you provider wide
>versions of isalpha() and so on, then you have to worry if those take a
>16-bit wchar_t (which for MS is true).  That means that if (when, but a
>while yet I would think) unicode start to assign from the larger space they
>have allocated, things will start to be broken.  Plus there is no way of
>loading the character defns into that particular CRT.  ICU of course takes
>care of this by giving you u_isalpha that takes 32-bit.

So I think that one starts to think about all those details, it is better
to use UTF-n, n >= 21.

Also, note that in the group comp.std.c++ they have mentioned some
platforms with strange paddings. So I would not assume that it is UTF-32
unless needed, but UTF-n, n >= 21 for some suitable n, plus padding. In a
C/C++ compiler, this may make a difference.

>What this means for flex is that it's going to start to have to be clever
>about the way it characterises characters.  It's no longer sufficient to get
>the user to specify the character classes, it is going to have to use (or
>implement) the services of something that knows about the semantics of
>unicode characters.  This isn't trivial, so I would strongly suggest ICU for
>this purpose.  Relying on base OS services certainly isn't a good move.  I
>don't know what you would call this new version of flex; but it would be
>quite a bit different (and more heavy weight) from the traditional ASCII
>based version.

Unicode publishes a text file where those properties are described in a
simple format. So one idea that comes to my mind is to design a program
able to read such a file and then compiles it into a format that Flex can
read. This way, it's simple to accommodate for Unicode updates.

>Basically, what we need is examples of how it would be used in order to give
>a sensible set of requirements, and then to proceed from there.

So I think Unicode support must be developed over some period of time, so
one can see what Unicode practises eventually emerge.

  Hans Aberg






reply via email to

[Prev in Thread] Current Thread [Next in Thread]