help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: Flex and 32-bits characters


From: Mark Weaver
Subject: RE: Flex and 32-bits characters
Date: Mon, 26 Aug 2002 05:54:56 +0100

Actually I sat down and thought about this, and did a little research,
basically from the point of how you would want to use flex to tokenise
unicode streams.  Let's take a hypothetical example; say I'm a database
vendor and want to allow my customers to specify table names outside the
restrictive ascii set.

So now maybe I'm matching names with an expression like:

[A-Za-z]+[A-Za-z0-9_]*

How do I say that for a unicode flex?  Well what I really mean is "alpha"
followed by "alpha/number/underline".  How do I express the notion of alpha
to flex?  Certainly a useful unicode flex would have this feature built in.
We can find the notion of alpha, btw, as character classes exist in unicode
(and a character can belong to more than one functional group).  So you are
wanting to match (likely to want to match) arbitrary subsets of these
classes.

Then the rest proceeds as normally; simply translate your tokens
(SELECT/INSERT/...) into the local language, and spit out a parser.  I would
be interested in other uses of flex that are of the wall from this
traditional lexing a computer(ish) language and that may not fit this set of
requirements.

Other notes on unicode:

- It is designed to be 16-bit (UTF-16).
- No unicode characters are assigned by the unicode consortium that are
outside the 16-bit range.  Currently they have assign 48K characters and
have space for up to 64K characters.
- Private use areas exist in unicode which are surrogate pairs and there may
be up to 1 million of these.
- (Presumably) the other 1Mb surrogate pair space is for use by unicode.
- Unicode itself will therefore never be more than the 21-bits that they
suggest.  Something that was, would not in my view be unicode, nor is it
likely to happen as they have carefully analysed the character set space
requirements, then multiplied their requirements by a factor of 15 or so.
Language does not develop as fast as computers!

As regards handling of unicode character classes &c, the simplest way to
implement this in my view would be to use ICU
(http://oss.software.ibm.com/icu/).  ICU is open source, available for free
commercial use and developed/supported largely by IBM.  IBM's OS products
all (AFAIK) use UTF-16 internally where applicable, including Xalan/Xerces,
and ICU.  Also this is the base character set for Java (sun), the only
'wide' character set for Oracle (oracle corp).  So UTF-16 is /not/ limited
simply to MS.

Dealing with multibyte encodings (which is rare in unicode, as you have to
be doing something off the wall with it is a fact of life.  So I would (and
do) plump for UTF-16.  But I don't have terribly strong feelings on this.
Where you run into trouble is say in the CRT where if you provider wide
versions of isalpha() and so on, then you have to worry if those take a
16-bit wchar_t (which for MS is true).  That means that if (when, but a
while yet I would think) unicode start to assign from the larger space they
have allocated, things will start to be broken.  Plus there is no way of
loading the character defns into that particular CRT.  ICU of course takes
care of this by giving you u_isalpha that takes 32-bit.

What this means for flex is that it's going to start to have to be clever
about the way it characterises characters.  It's no longer sufficient to get
the user to specify the character classes, it is going to have to use (or
implement) the services of something that knows about the semantics of
unicode characters.  This isn't trivial, so I would strongly suggest ICU for
this purpose.  Relying on base OS services certainly isn't a good move.  I
don't know what you would call this new version of flex; but it would be
quite a bit different (and more heavy weight) from the traditional ASCII
based version.

Due to this, I would say that it is almost certain that you would see flex
start to have u_isalpha, u_ischar, etc calls in the generated scanners.  How
the best way to implement this would be i'm not sure.  Once that is done, it
can then feed the classified characters to the standard state machines.
Effectively we are compressing the character set space based on the
semantics of the characters.  Which is pretty much what you do when you
define your own lexer.  It's just that in this case, the space is so large
and so complicated that the user can't do this alone.

In my view this way of working would be a good thing.  Imagine that Unicode
release a new character which they somehow forgot from one of your
customer's languages.  You could compile and ship a new parser, presuming
flex wasn't doing this and had some kind of internal classification tables.
However with ICU you just ship a new data file (which on Win32 is
implemented as a resource only DLL, not sure how it does this on other OSs).
You can be more sure that this isn't going to break, as it's just data.
Well slightly ;)

Basically, what we need is examples of how it would be used in order to give
a sensible set of requirements, and then to proceed from there.  The way I
propose seems sane to me, but my impoverished language skills stick me
firmly in the land of ISO-8859-15, so I would certainly say that I am not
going to be the best person to analyse requirements.  I guess anyone who has
done this would simply have implemented a custom lexer, then fed the results
back to bison.  If anyone has done such a thing, I would be interested to
here from them.

Oh nearly forgot.  Yes, keeping track of UTF-16 strings is a little bit of a
pain, but it's not too vast.  There are macros provided that will help you
iterate through a string.  And I would simply pass yyleng as the correct
character count, and yytext as the UTF-16 string.  The user can take it from
there.  As input, of course, the scanner would also take UTF-16, analysing
the characters as it pulled them in (which it does one-at-a-time does it
not?) in order to determine if they are a surrogate pair.  If so, the whole
pair is classified, or if not, just the single 16-bit word.  The original
16-bit word (or pair thereof) are stored in yytext, and yyleng is
incremented by 1.  (Backing up through a string is sensibly a symmetric
process).

Mark

> -----Original Message-----
> From: address@hidden [mailto:address@hidden Behalf
> Of Hans Aberg
> Sent: 24 August 2002 18:30
> To: Mark Weaver
> Cc: address@hidden
> Subject: RE: Flex and 32-bits characters
>
>
> At 13:42 +0100 2002/08/24, Mark Weaver wrote:
> >UCS-4 is a 4-byte representation of ISO-10646-1, which essentially equals
> >unicode.  There is some historical difference between ISO-10646-1 and
> >unicode, but they are (basically) the same thing.
>
> I have a vague memory of it. So it is Unicode one should use then.
>
> >  Unicode is a 21 bit
> >character set, which guarantees (presumably by having plenty of
> spare code
> >points) never to expand upon that range.
>
> Even though Unicode defined characters proper will not extend beyond 21
> bits, I think they reserved some numbers for "user characters". -- So that
> Unicode may in reality be 24 bit or something. (Such a "user character"
> range will be important in typesetting, because one cannot expect Unicode
> to fully cover all symbols.)
>
> >  Then there are 5 basic ways of
> >representing the unicode character set:
> >
> >UTF-8, which uses 8 bit characters and stores a code point with
> 1-4 bytes.
> >UTF-16 (little or big endian), which uses 16 bit integers, and
> stores a code
> >point with 1 or 2 integer
> >UTF-32 (LE/BE), which uses 32 bit integers and stores a code point with 1
> >integer
>
> I think you should put UTF-n, where n >= 24. The thing is that in an
> implementation like Bison, one may want to reserve say Unicode character
> numbers and on top of that have token numbers. (I am not sure there ever
> will be, as there will be problems with tables sizes.) Therefore, there
> should be some padding bits should be reserved for other uses.
>
> >Most processing schemes (ref. ICU, windows) choose to use UTF-16
> internally
> >as it provides the nicest trade-off between memory and performance.
>
> From discussions in the comp.std.c++ newsgroup I infer that when
> people say
> "most", they mean MSOS (Microsoft OS), because that is what they use.
> Microsoft did that probably because Unicode fit into 16 bit when they
> decided about it. Linux, I am told, is not doing that.
>
> So I think people will settle for 32 bit, at least internally in programs.
> -- It is too difficult to handle variable width characters inside a
> program, and it is probably not very efficient either, as most CPU's (PC's
> and up) are at least 32-bit.
>
> So I suspect that one should best use 32-bit integral types internally in
> programs to handle Unicode, with code converters for the externals.
>
> I think that the intuition may play one a trick, because 32-bit for every
> character sounds a lot. But memory doubles about every year lately, so it
> in reality more tiny that 8 bits were only half a decade ago!
>
> >The patch just defines YY_CHAR as wchar_t, which (cough) causes
> somewhat of
> >an explosion in the table size, as you would predict.  Flex spits out its
> >handy equivalence class table, which it indexes directly by character to
> >produce an int.  With a plain old char, this is table is a
> modest 1K, with a
> >wchar_t of 16-bits, it's 256K, and if your platform (which some
> do) defines
> >wchar_t as 32-bits, well...
>
> As you note, there is no compiler independent defined size for wchar_t in
> C/C++. Nor is there a good way to find a 32-bit integral type.
>
> One knows that a long will hold 2^32 (signed) integer numbers, even though
> that does not say anything about padding. So one can define macros that
> checks wchar_t and various smaller integral types moving towards "long",
> until one finds one that holds 2^32 integer numbers.
>
> >The main questions to be addressed as I see it are:
> >
> >1) What representation of unicode are you expecting to lex?  Are we
> >expecting flex to deal with this?  On this, I would pick one (say UTF-16
> >default platform endianness), and leave it to the caller to do the
> >relatively trivial conversion on the input stream.
>
> I will give my personal opinions (not knowing what the Flex developers
> eventually will do):
>
> First, variable width character in the Flex internals will probably be a
> hell, as it is difficult to keep track of locations and relate
> that to file
> positions. So that suggests UTF-n, n >= 24 in the Flex lexer
> internals with
> padding to nearest integral type (most likely 32 bit).
>
> >  I don't see it as an
> >issue that some characters are represented by pairs of 16-bit
> integers; the
> >scanner would just treat these as two separate characters to
> match, with the
> >correct result ensuing.
>
> Location tracking becomes difficult, as one does not know if one should
> count characters or bytes. There might be some other points as well. For
> example, under C++, it seems to not all be built for variable width
> characters.
>
> >  The same applies to UTF-8.  In fact, what you are
> >doing with accepting UTF-16 as an input is simply a speed
> optimisation (at
> >the expense of memory); you save the scanner dealing with single
> bytes when
> >you know that you never really have any - hence the DFA is
> smaller/faster.
>
> There was a post lately that suggested that implementing a NFA that
> converts to DFA states as needed and cashes them is faster than a DFA. The
> reason might be that large tables are slow to look up relative a set of
> local CPU computations. (The bottlenecks might be large table jumps and
> reading memory into the CPU.)
>
> >2) Flex itself needs to have a unicode input for this to be
> truely useful.
> >As it stands, it arbitrarily widens characters; which works for
> ASCII but is
> >a unhelpful for anything else.  If you want to parse something that was
> >outside the ASCII range then you'd need flex to accept unicode (of some
> >form) input.  (Of course, flex scans its input with a scanner it
> generates,
> >so this wouldn't be too hard once you'd found an acceptable way of doing
> >so).
> >
> >3) Some way of solving the memory explosion problem that results
> from having
> >large ec tables.  More details on your NFA->DFA algorithm?
>
> This is described in the book by Aho, Sethi & Ullman, "Compilers...", sec.
> 3.6, p.117. One can do this for general grammars and pushdown automata as
> well, see the Parsing Techniques book:
>     http://www.cs.vu.nl/~dick/PTAPG.html
>
> The Aho et. al book says at the end of sec. 3.7, p.128 that this technique
> provides the best combined time/space complexity.
>
> >  Is this what was
> >being discussed earlier, when the great Mr Paxson popped in?  (I
> was keeping
> >a lazy eye on it).
>
> He mentioned that he experimented with something like that, and noted that
> curiously enough the technique was faster than a full static DFA.
>
> Since the NFA tables are expected to be small, this might be worth to
> investigate.
>
> Perhaps it does not actually solve the problem with large Unicode tables,
> because the DFA states must be stored somehow.
>
>   Hans Aberg
>
>
>
>
> _______________________________________________
> Help-flex mailing list
> address@hidden
> http://mail.gnu.org/mailman/listinfo/help-flex
>





reply via email to

[Prev in Thread] Current Thread [Next in Thread]