help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Flex and 32-bits characters


From: Hans Aberg
Subject: RE: Flex and 32-bits characters
Date: Mon, 26 Aug 2002 16:49:34 +0200

At 13:02 +0100 2002/08/26, Mark Weaver wrote:
>> >- It is designed to be 16-bit (UTF-16).
>>
>> What is this? Unicode proper does not have any preference for a particular
>> encoding, but merely assign numbers to symbols.
>
>literally from "The Unicode Standard"
>http://www.unicode.org

Which version? Have you chacked out some of the later references:
  Unicode 3.1: http://www.unicode.org/unicode/reports/tr27/
  Unicode 3.2:
    http://www.unicode.org/Public/BETA/Unicode3.2/NamesList-3.2.0d4.txt
    http://www.unicode.org/versions/beta.html
    http://www.unicode.org/Public/BETA/Unicode3.2/UnicodeData-3.2.0d1.html
    http://www.unicode.org/unicode/reports/tr25/

>Mathematical alphanumeric symbols U+1D400-U+1D7FF presumably :)  You're
>correct, I'm wrong in my assertion that unicode have no assigned any
>characters outside of the 16-bit range.  However:
>
>"For the vast majority of computer text in the world, UTF-16 code units
>correspond to code points, since the frequency of characters with code
>points above U+FFFF is and will remain vanishingly small."
>from http://www.unicode.org/unicode/reports/tr19/

Unless this is an old report, referring to Unicode 3.0 and older.

>> My notes say that Unicode code points have range U+0000..10FFFF; but I do
>> not know if the private symbols should fall within that range.
>
>Yes, they have to.  U+F0000-U+FFFFD and U+100000-U+10FFFD are the private
>code ranges (last two code points reserved for non-characters apparently).

Good, so Unicode is 21 bit and will never exceed that.

>> If you use 16-bit, I think one has to solve the problems with location
>> tracking on variable width characters and C++.
>> And I would make sure that the 16-bit stuff isn't just some hangover from
>> the days one thought most characters would be able to fit within
>> the 16-bit
>> range.
>
>True, see my other reply on this.  Basically, UTF-16 or UTF-32 puts you in
>the same hole.

I don't see this: 32 bit is single width, which 16 bit isn't, which may
make a big difference for the internal implementation.

>> -- The problem is that you assume that multibyte characters in UTF-16 are
>> going be extremely rare, so it does not make any difference if the program
>> crashes. That's probably because you haven't checked the later Unicode
>> versions.
>>
>Not true!  I was suggesting that the /majority/ of the time you will be
>dealing with code points in the range U+0000-U+FFFF.  Which i still think is
>true.  I wasn't suggesting you should write code that falls over if it
>encounters something legitimate that it doesn't expect ;)

Well, the main point is that when the other characters appear, the system
does not break. Use UTF-8 if you think that is implementable. This ought to
be easier with current Flex, as the tables need not be enlarged. Then one
can see what problems one encounters if any.

>> Also, note that in the group comp.std.c++ they have mentioned some
>> platforms with strange paddings. So I would not assume that it is UTF-32
>> unless needed, but UTF-n, n >= 21 for some suitable n, plus padding. In a
>> C/C++ compiler, this may make a difference.
>
>OK, you've lost me there.  You are suggesting that the input stream would be
>weirdly padded?  In which case that is of little consequence to flex, if it
>is going to be taking its input in UTF-{pick a number}.

I do not know if that will happen in the case of integral types or
character types, but it does happen for floating point numbers, and it is
also legal for integral types. -- The reason is archaic: In the old days,
bytes could have say 9 bits and words 18 bits, like on PDP 15/40 used at
EMS (Electronic Music Studio) in Stockholm in the early 1970'ies.

That's why I suggest that one settles for say UTF-21 plus padding intead of
just saying it must be UTF-32: Most likely, the padding will be so that one
ends up on 32 bit, because that is likely to be the chunk that the CPU is
going to handle. But some computers may use say 64 bit, because that is
what its CPU is optimized around.

This is also the reason that I am sceptical about UTF-16 for the internals
of the program: Most CPU's are at least 32 bit, and cutting characters into
16-bit chunks is going to give a performance penalty. -- Perhaps when MS
started with 16-bit Unicode, 16 bit CPU's were still common. But they
aren't nowadays on the level of PC's and up (where one might use Flex).

Note that we are only speaking about what is going to be used by the Flex
lexer internally: By suitable code converters, one still can use UTF-16
externally, if one so wants.

Alternatives to n = 16, 32 might be UTF-20, because it will take some time
before the 21'st bit is hit, and UTF-24, which is three bytes per
character. For the latter, one then still is using fixed width characters.

I think the main divide will be between using fixed width characters and
variable width characters. -- The latter will require some specialty
techniques in order to keep track of character boundaries.

But location tracking should perhaps give both character count and stream
positions.

So if somebody wants to implement UTF-16, that's OK with me, but I would
settle for UTF-32. But I leave it up to the guy doing the actual
implementation to decide: When one sits down with the code, I figure one
will see what is best.

  Hans Aberg






reply via email to

[Prev in Thread] Current Thread [Next in Thread]