help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Flex and 32-bits characters


From: Hans Aberg
Subject: Re: Flex and 32-bits characters
Date: Sat, 24 Aug 2002 11:32:35 +0200

At 16:20 -0400 2002/08/23, Antoine Fink wrote:
>I am currently working on a regular expression parser, using Flex and
>Yacc, and >I want to be able to parse either ASCII, UTF-8 or UCS-4 strings.

What is UCS-4; same as UTF-n, n = 24-32?

>The general idea was to convert anything to ucs-4 (using 32 bits chars),
>parse the regex, then re-convert (whenever possible) to the specified
>matching encoding. (That part was already done some time ago when we used
>our own C parsing program instead of Flex & Yacc, so this is not really
>the issue).

First note that I am not a Flex developer. But the issue was discussed here
and in the Bison lists quite a bit some time ago:

I think what you mention is the major candidate for implementing Unicode
onto Flex: Hook up code converters (like C++ std::codecvt), so that
internally Flex only sees say UTF-32. This is the only way to handle the
many different possible encodings, plus the problem of variable width
characters.

The problem is really that Flex uses static tables indexed on the character
type. Thus, if a 32-bit character type, you end up in principle with a 2^32
large table, unless you somehow cut it down using some form of table
compression.

One interesting alternative might be to make Flex produce a very compact
NFA machine table, which is converted to DFA states and cached as needed.

>The problem is that I am unable to make Flex read in 32-bits characters
>(in an easy fashion... say, typedef'ing chars to 32-bits integers, or
>re-#define'ing chars has 32-bits integers, but that won't work at all, for
>numerous reasons.)

There is a "Unicode Flex" on the Internet:
    Unicode Flex: ftp://ftp.lauton.com/pub/flex-2.5.4-unicode-patch.tar.gz
In reality, I think though that it only changes char to wchar_t, and assume
that that latter type is 16 bit.
If you want to experiment with current Flex, it is at:
    Flex Beta (2.5.15): ftp://ftp.uncg.edu/people/wlestes/

>After reading lots of posts on the web, I have found some ways to
>accomplish this 32-bit character lexing, and that which makes the most
>sense to me is to (locally) modify Flex's own source code and make it
>generate lexers that use 32-bits integers instead of chars.

So, as said above, this is the obvious way to go, but one then hits the
problem of large tables.

  Hans Aberg






reply via email to

[Prev in Thread] Current Thread [Next in Thread]