help-flex
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

RE: UTF-8 doc scanning


From: Thurn, Martin
Subject: RE: UTF-8 doc scanning
Date: Thu, 7 Oct 2004 08:20:14 -0400

> Theoretically is a 8-bit scanner suited to match UTF-8
> regular expressions?

  Depends on what exactly you mean by "UTF-8 regexen".  Start by reading the 
UTF-8 spec and create patterns.  I did this years ago and my patterns looked 
like this (each match is ONE unicode character).  The standard may have changed 
since then.  

UB     [\200-\277]
%%
[\300-\337]{UB}             { UNICODE }
[\340-\357]{UB}{2}          { UNICODE }
[\360-\367]{UB}{3}          { UNICODE }
[\370-\373]{UB}{4}          { UNICODE }
[\374-\375]{UB}{5}          { UNICODE }

 - - Martin





reply via email to

[Prev in Thread] Current Thread [Next in Thread]