help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Flex and 32-bits characters


From: Antoine Fink
Subject: Re: Flex and 32-bits characters
Date: Mon, 26 Aug 2002 11:46:41 -0400

On Sat, 24 Aug 2002 11:32:35 +0200
Hans Aberg <address@hidden> wrote:

> At 16:20 -0400 2002/08/23, Antoine Fink wrote:
> >I am currently working on a regular expression parser, using Flex and
> >Yacc, and >I want to be able to parse either ASCII, UTF-8 or UCS-4 strings.
> 
> What is UCS-4; same as UTF-n, n = 24-32?

It's a different 32-bits encoding, not as UTF, and not very largely used. I 
can't really go into details here because a)I don't know a lot about it either 
and b)it's not really the issue here.

As I said (see below), the part of converting from and to different encodings( 
ascii, utf-8, ucs-4...) is already resolved. I only need Flex to read in 
32-bits chars, and I will write the lexing rules to interpret it. It's more a 
memory/data structure issue than an interpreting/encoding problem. As long as 
Flex can read 32-bits (wether its ucs-4 or not), I'll be happy :)

> >The general idea was to convert anything to ucs-4 (using 32 bits chars),
> >parse the regex, then re-convert (whenever possible) to the specified
> >matching encoding. (That part was already done some time ago when we used
> >our own C parsing program instead of Flex & Yacc, so this is not really
> >the issue).
> 
> First note that I am not a Flex developer. But the issue was discussed here
> and in the Bison lists quite a bit some time ago:
> 
> I think what you mention is the major candidate for implementing Unicode
> onto Flex: Hook up code converters (like C++ std::codecvt), so that
> internally Flex only sees say UTF-32. This is the only way to handle the
> many different possible encodings, plus the problem of variable width
> characters.

Hm.. I'm sorry I must've forgot to specify this : the solution I seek must use 
C langage and not C++... I agree that codecvt would be helpful here, but again, 
in my current situation, the conversion is already done (back and forth).

> The problem is really that Flex uses static tables indexed on the character
> type. Thus, if a 32-bit character type, you end up in principle with a 2^32
> large table, unless you somehow cut it down using some form of table
> compression.
> 
> One interesting alternative might be to make Flex produce a very compact
> NFA machine table, which is converted to DFA states and cached as needed.
> 

I know about the character-type indexed tables. There can be more than one 
possible work-around (hash tables, cached compressed tables, DFA's..) but in 
fact, the goal behing all of this 32-bits-in-flex thing is to convert a regular 
expression to a DFA. We are quite experienced with DFA's and transducers so 
this could be a little easier for us, but then again, we have to look closely 
at the problem (if this is the way we want to go, that is, digging into Flex' 
own source code..)

> >The problem is that I am unable to make Flex read in 32-bits characters
> >(in an easy fashion... say, typedef'ing chars to 32-bits integers, or
> >re-#define'ing chars has 32-bits integers, but that won't work at all, for
> >numerous reasons.)
> 
> There is a "Unicode Flex" on the Internet:
>     Unicode Flex: ftp://ftp.lauton.com/pub/flex-2.5.4-unicode-patch.tar.gz
> In reality, I think though that it only changes char to wchar_t, and assume
> that that latter type is 16 bit.
> If you want to experiment with current Flex, it is at:
>     Flex Beta (2.5.15): ftp://ftp.uncg.edu/people/wlestes/

Yep, this patch helps only for wchar_t (16 bits) characters. I might consider 
doing such a patch (using the same approach) but for 32-bits chars...
 
> >After reading lots of posts on the web, I have found some ways to
> >accomplish this 32-bit character lexing, and that which makes the most
> >sense to me is to (locally) modify Flex's own source code and make it
> >generate lexers that use 32-bits integers instead of chars.
> 
> So, as said above, this is the obvious way to go, but one then hits the
> problem of large tables.
> 
>   Hans Aberg

Thanks ! You're interest in my problem is appreciated !! I will keep posting on 
help-flex about this (and hopefully, submit a decent patch or something usefull 
to others :)
------------ 
Antoine Fink, 
Co-op Software Designer
Solidum Systems corp
(613)724-6004 x268 - address@hidden





reply via email to

[Prev in Thread] Current Thread [Next in Thread]