help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: UTF-8 doc scanning


From: Hans Aberg
Subject: Re: UTF-8 doc scanning
Date: Thu, 28 Oct 2004 19:21:13 +0200
User-agent: Microsoft-Outlook-Express-Macintosh-Edition/5.0.6

At 11:43 -0700 2004/10/06, Raman Muthukrishnan wrote:
>Does anyone have experience with scanning a UTF-8 doc
>with UTF-8 regular expressions?
>Theoretically is a 8-bit scanner suited to match UTF-8
>regular expressions?

One idea is to perhaps make Flex to a UTF-8 scanner in the future. The
advantage of this approach is that the index ranges of the scanner tables do
not become larger. There is a patch for 16-bit characters, made in the days
Unicode would fit into 16 bits:
  ftp://ftp.lauton.com/pub/flex-2.5.4-unicode-patch.tar.gz
But then the scanner tables become very large, indexed over 2^16.

Unicode admits 21 bits; if UTF-21+ should be admitted, one needs to write
table compression algorithms.

-- 
  Hans Aberg






reply via email to

[Prev in Thread] Current Thread [Next in Thread]