[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: [vile] spellflt.l: Include UTF-8 code points
From: |
Thomas Dickey |
Subject: |
Re: [vile] spellflt.l: Include UTF-8 code points |
Date: |
Sun, 23 Jun 2019 16:35:01 -0400 |
User-agent: |
Mutt/1.5.23 (2014-03-12) |
On Sun, Jun 23, 2019 at 09:47:18PM +0200, Michael von der Heide wrote:
> It works (hunspell) for me with words like "prüfen" or "Straße". Flex
> generates an 8-bit scanner. UTF-8 should work. Would you mind testing it?
sorry - when you said "code points", I had in mind Unicode.
Applying the term to UTF-8 sequences doesn't seem entirely correct,
though I'm aware people use the two interchangeably. (not to argue,
but a string isn't a point)
lex/flex will allow ranges, and hexadecimal's standard (hence "lex" too):
http://pubs.opengroup.org/onlinepubs/9699919799/utilities/lex.html
> --
> Michael von der Heide
>
>
> Thomas Dickey <address@hidden> schrieb am So., 23. Juni 2019, 21:24:
>
> > On Sun, Jun 23, 2019 at 07:42:26PM +0200, Michael von der Heide wrote:
> > > Would it be possible to include UTF-8 code points to check words
> > containing
> > > umlauts?
> > >
> > > WORD ([a-zA-Z]|\xc3[\x80-\xbf])+
for reference, that's the UTF-8 encoding for the Unicode codepoints 192-255:
192: 192 0300 0xc0 text "\300" utf8 \303\200
255: 255 0377 0xff text "\377" utf8 \303\277
and
0303: 195 0303 0xc3 text "\303" utf8 \303\203
0200: 128 0200 0x80 text "\200" utf8 \302\200
0277: 191 0277 0xbf text "\277" utf8 \302\277
Possibly clearer (ispell on my Debian8 works with this):
diff -u -r1.59 filters/spellflt.l
--- filters/spellflt.l 2013/12/02 01:32:53 1.59
+++ filters/spellflt.l 2019/06/23 20:28:42
@@ -157,7 +157,10 @@
%}
-WORD [[:alpha:]]([[:alnum:]])*
+ALPHA [[:alpha:]]
+UMLAUT \xc3[\x80-\xbf]
+LETTER ({ALPHA}|{UMLAUT})+
+WORD {LETTER}({LETTER}|[[:digit:]])*
%%
> > > WORD ([a-zA-Z]|\xc3[\x80-\xbf])+
> >
> > lex/flex doesn't do that :-(
> >
> > They use small (256-entry) tables for the character types.
> >
> > I've seen a (long ago) patch to use big tables (which I've read
> > doesn't work well).
> >
> > on my (too-long) to-do list, I have an idea which could be developed,
> > to provide the feature using character-classes. That is, flex could
> > be modified (perhaps a month's work...)
--
Thomas E. Dickey <address@hidden>
https://invisible-island.net
ftp://ftp.invisible-island.net
signature.asc
Description: Digital signature