[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Regular expressions for Unicode general categories
From: |
Derick Eddington |
Subject: |
Re: Regular expressions for Unicode general categories |
Date: |
Sun, 07 Dec 2008 16:49:55 -0800 |
On Mon, 2008-12-08 at 00:35 +0100, Peter Dyballa wrote:
> Am 07.12.2008 um 21:47 schrieb Derick Eddington:
>
> > So, what can I do? If Emacs regular expressions' backslash construct
> > `\cC' supported Unicode general categories, or if there was some
> > construct which did, I think that would do it nicely. Is that
> > planned, or should I resort to doing more manual parsing, or something
> > else?
>
>
> Can't you use the Unicode characters themselves? In ranges like [À-Ëà-
> ë]?
I'm using `rx-to-string' on my computed character sets (i.e., using it
on my one big SRE (s-expression regular expression) that has sub-SREs of
`(char . ,<list-of-characters>)), and `rx-to-string' consolidates the
characters into ranges and I'm assuming it does so as much as possible,
so I think I'm already using ranges as much as possible. Here's a
modified simplified version of what I'm doing and it shows
`rx-to-string' is computing ranges:
(require 'rx)
(let* ((general-categories
(let ((al (list (list 'Po) (list 'Sc))) ;; removed a bunch of others
(c 0))
(while (< c #x110000)
(unless (and (<= #xD800 c) (<= c #xDFFF))
(let* ((gc (get-char-code-property c 'general-category))
(a (assq gc al)))
(when a (setcdr a (cons c (cdr a))))))
(setq c (1+ c)))
al))
(char-set (lambda (gc) `(char . ,(cdr (assq gc general-categories)))))
(Po (funcall char-set 'Po))
(Sc (funcall char-set 'Sc))
;; removed a bunch of other stuff
(thing `(seq "foo" (or ,Po ,Sc) "bar")))
(rx-to-string thing))
=>
"\\(?:foo\\(?:[!-#%-'*,./:;?@\\¡·¿;·՚-՟։׀׃׆׳״؉؊،؍؛؞؟٪-٭۔܀-܍߷-߹।॥॰෴๏๚๛༄-༒྅࿐-࿔၊-၏჻፡-፨᙭᙮᛫-᛭᜵᜶។-៖៘-៚᠀-᠅᠇-᠊᥄᥅᧞᧟᨞᨟᭚-᭠᰻-᰿᱾᱿‖‗†-‧‰-‸※-‾⁁-⁃⁇-⁑⁓⁕-⁞⳹-⳼⳾⳿⸀⸁⸆-⸈⸋⸎-⸖⸘⸙⸛⸞⸟⸪-⸮⸰、-〃〽・꘍-꘏꙳꙾꡴-꡷꣎꣏꤮꤯꥟꩜-꩟︐-︖︙︰﹅﹆﹉-﹌﹐-﹒﹔-﹗﹟-﹡﹨﹪﹫!-#%-'*,./:;?@\。、・𐄀𐄁𐎟𐏐𐤟𐤿𐩐-𐩘𒑰-𒑳]\\|[$¢-¥؋৲৳૱௹฿៛₠-₵﷼﹩$¢£¥₩]\\)bar\\)"
If I did type the characters themselves in "[x-y]" ranges, I'd have to
figure out a lot them because the Unicode general categories are not
simple ranges, they're scattered across the code-points. I need these
general categories which have these numbers of elements:
((Lu 1438) (Ll 1770) (Lt 31) (Lm 187) (Lo 90794) (Mn 1082) (Nl 214)
(No 349) (Pd 20) (Pc 10) (Po 318) (Sc 41) (Sm 946) (Sk 99) (So 3695)
(Co 137468) (Nd 408) (Mc 269) (Me 13) (Zs 18) (Zl 1) (Zp 1))
which is way more than I can manually manage.
--
: Derick
----------------------------------------------------------------