help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Flex and 32-bits characters


From: Mark Weaver
Subject: RE: Flex and 32-bits characters
Date: Mon, 26 Aug 2002 16:49:29 +0200

> At 05:54 +0100 2002/08/26, Mark Weaver wrote:
> >Other notes on unicode:
> >
> >- It is designed to be 16-bit (UTF-16).
>
> What is this? Unicode proper does not have any preference for a particular
> encoding, but merely assign numbers to symbols.

literally from "The Unicode Standard"
http://www.unicode.org

> >- No unicode characters are assigned by the unicode consortium that are
> >outside the 16-bit range.  Currently they have assign 48K characters and
> >have space for up to 64K characters.
>
> You must have an old Unicode version. In the later versions, one has added
> a significant amount of characters outside the 0..2^16 range,
> including the
> mathematical semantic styles.

Mathematical alphanumeric symbols U+1D400-U+1D7FF presumably :)  You're
correct, I'm wrong in my assertion that unicode have no assigned any
characters outside of the 16-bit range.  However:

"For the vast majority of computer text in the world, UTF-16 code units
correspond to code points, since the frequency of characters with code
points above U+FFFF is and will remain vanishingly small."
from http://www.unicode.org/unicode/reports/tr19/

> >- Private use areas exist in unicode which are surrogate pairs
> and there may
> >be up to 1 million of these.
>
> What do you mean by "surrogate pairs" -- does that refer to a specific
> Unicode encoding. Do you know which range these private numbers are in?
Yes, to UTF-16.  And yes, the range is well defined.  You can look it up in
the standard ;)

> My notes say that Unicode code points have range U+0000..10FFFF; but I do
> not know if the private symbols should fall within that range.

Yes, they have to.  U+F0000-U+FFFFD and U+100000-U+10FFFD are the private
code ranges (last two code points reserved for non-characters apparently).

> And Unicode is expected to never add any symbols outside that range.
Correct.

> If you use 16-bit, I think one has to solve the problems with location
> tracking on variable width characters and C++.
> And I would make sure that the 16-bit stuff isn't just some hangover from
> the days one thought most characters would be able to fit within
> the 16-bit
> range.

True, see my other reply on this.  Basically, UTF-16 or UTF-32 puts you in
the same hole.

>
> >Dealing with multibyte encodings (which is rare in Unicode, as
> you have to
> >be doing something off the wall with it is a fact of life.
>
> You mean, like using a math variable?
>
> -- The problem is that you assume that multibyte characters in UTF-16 are
> going be extremely rare, so it does not make any difference if the program
> crashes. That's probably because you haven't checked the later Unicode
> versions.
>
Not true!  I was suggesting that the /majority/ of the time you will be
dealing with code points in the range U+0000-U+FFFF.  Which i still think is
true.  I wasn't suggesting you should write code that falls over if it
encounters something legitimate that it doesn't expect ;)

> Also, note that in the group comp.std.c++ they have mentioned some
> platforms with strange paddings. So I would not assume that it is UTF-32
> unless needed, but UTF-n, n >= 21 for some suitable n, plus padding. In a
> C/C++ compiler, this may make a difference.

OK, you've lost me there.  You are suggesting that the input stream would be
weirdly padded?  In which case that is of little consequence to flex, if it
is going to be taking its input in UTF-{pick a number}.

> Unicode publishes a text file where those properties are described in a
> simple format. So one idea that comes to my mind is to design a program
> able to read such a file and then compiles it into a format that Flex can
> read. This way, it's simple to accommodate for Unicode updates.

True, true.  But I'm lazy and tend to try to build on software provided by
others if they are willing to do that for me ;)

> So I think Unicode support must be developed over some period of time, so
> one can see what Unicode practises eventually emerge.

True enough, but it is always useful to have the input of others to see what
kind of things they want to do, st any work on flex could be with a view to
getting something that would work for everyone.

Thanks,

Mark






reply via email to

[Prev in Thread] Current Thread [Next in Thread]