help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Flex and 32-bits characters


From: Hans Aberg
Subject: RE: Flex and 32-bits characters
Date: Sat, 24 Aug 2002 19:29:36 +0200

At 13:42 +0100 2002/08/24, Mark Weaver wrote:
>UCS-4 is a 4-byte representation of ISO-10646-1, which essentially equals
>unicode.  There is some historical difference between ISO-10646-1 and
>unicode, but they are (basically) the same thing.

I have a vague memory of it. So it is Unicode one should use then.

>  Unicode is a 21 bit
>character set, which guarantees (presumably by having plenty of spare code
>points) never to expand upon that range.

Even though Unicode defined characters proper will not extend beyond 21
bits, I think they reserved some numbers for "user characters". -- So that
Unicode may in reality be 24 bit or something. (Such a "user character"
range will be important in typesetting, because one cannot expect Unicode
to fully cover all symbols.)

>  Then there are 5 basic ways of
>representing the unicode character set:
>
>UTF-8, which uses 8 bit characters and stores a code point with 1-4 bytes.
>UTF-16 (little or big endian), which uses 16 bit integers, and stores a code
>point with 1 or 2 integer
>UTF-32 (LE/BE), which uses 32 bit integers and stores a code point with 1
>integer

I think you should put UTF-n, where n >= 24. The thing is that in an
implementation like Bison, one may want to reserve say Unicode character
numbers and on top of that have token numbers. (I am not sure there ever
will be, as there will be problems with tables sizes.) Therefore, there
should be some padding bits should be reserved for other uses.

>Most processing schemes (ref. ICU, windows) choose to use UTF-16 internally
>as it provides the nicest trade-off between memory and performance.

>From discussions in the comp.std.c++ newsgroup I infer that when people say
"most", they mean MSOS (Microsoft OS), because that is what they use.
Microsoft did that probably because Unicode fit into 16 bit when they
decided about it. Linux, I am told, is not doing that.

So I think people will settle for 32 bit, at least internally in programs.
-- It is too difficult to handle variable width characters inside a
program, and it is probably not very efficient either, as most CPU's (PC's
and up) are at least 32-bit.

So I suspect that one should best use 32-bit integral types internally in
programs to handle Unicode, with code converters for the externals.

I think that the intuition may play one a trick, because 32-bit for every
character sounds a lot. But memory doubles about every year lately, so it
in reality more tiny that 8 bits were only half a decade ago!

>The patch just defines YY_CHAR as wchar_t, which (cough) causes somewhat of
>an explosion in the table size, as you would predict.  Flex spits out its
>handy equivalence class table, which it indexes directly by character to
>produce an int.  With a plain old char, this is table is a modest 1K, with a
>wchar_t of 16-bits, it's 256K, and if your platform (which some do) defines
>wchar_t as 32-bits, well...

As you note, there is no compiler independent defined size for wchar_t in
C/C++. Nor is there a good way to find a 32-bit integral type.

One knows that a long will hold 2^32 (signed) integer numbers, even though
that does not say anything about padding. So one can define macros that
checks wchar_t and various smaller integral types moving towards "long",
until one finds one that holds 2^32 integer numbers.

>The main questions to be addressed as I see it are:
>
>1) What representation of unicode are you expecting to lex?  Are we
>expecting flex to deal with this?  On this, I would pick one (say UTF-16
>default platform endianness), and leave it to the caller to do the
>relatively trivial conversion on the input stream.

I will give my personal opinions (not knowing what the Flex developers
eventually will do):

First, variable width character in the Flex internals will probably be a
hell, as it is difficult to keep track of locations and relate that to file
positions. So that suggests UTF-n, n >= 24 in the Flex lexer internals with
padding to nearest integral type (most likely 32 bit).

>  I don't see it as an
>issue that some characters are represented by pairs of 16-bit integers; the
>scanner would just treat these as two separate characters to match, with the
>correct result ensuing.

Location tracking becomes difficult, as one does not know if one should
count characters or bytes. There might be some other points as well. For
example, under C++, it seems to not all be built for variable width
characters.

>  The same applies to UTF-8.  In fact, what you are
>doing with accepting UTF-16 as an input is simply a speed optimisation (at
>the expense of memory); you save the scanner dealing with single bytes when
>you know that you never really have any - hence the DFA is smaller/faster.

There was a post lately that suggested that implementing a NFA that
converts to DFA states as needed and cashes them is faster than a DFA. The
reason might be that large tables are slow to look up relative a set of
local CPU computations. (The bottlenecks might be large table jumps and
reading memory into the CPU.)

>2) Flex itself needs to have a unicode input for this to be truely useful.
>As it stands, it arbitrarily widens characters; which works for ASCII but is
>a unhelpful for anything else.  If you want to parse something that was
>outside the ASCII range then you'd need flex to accept unicode (of some
>form) input.  (Of course, flex scans its input with a scanner it generates,
>so this wouldn't be too hard once you'd found an acceptable way of doing
>so).
>
>3) Some way of solving the memory explosion problem that results from having
>large ec tables.  More details on your NFA->DFA algorithm?

This is described in the book by Aho, Sethi & Ullman, "Compilers...", sec.
3.6, p.117. One can do this for general grammars and pushdown automata as
well, see the Parsing Techniques book:
    http://www.cs.vu.nl/~dick/PTAPG.html

The Aho et. al book says at the end of sec. 3.7, p.128 that this technique
provides the best combined time/space complexity.

>  Is this what was
>being discussed earlier, when the great Mr Paxson popped in?  (I was keeping
>a lazy eye on it).

He mentioned that he experimented with something like that, and noted that
curiously enough the technique was faster than a full static DFA.

Since the NFA tables are expected to be small, this might be worth to
investigate.

Perhaps it does not actually solve the problem with large Unicode tables,
because the DFA states must be stored somehow.

  Hans Aberg






reply via email to

[Prev in Thread] Current Thread [Next in Thread]