help-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Regular expressions for Unicode general categories


From: Derick Eddington
Subject: Re: Regular expressions for Unicode general categories
Date: Sun, 07 Dec 2008 16:49:55 -0800

On Mon, 2008-12-08 at 00:35 +0100, Peter Dyballa wrote:
> Am 07.12.2008 um 21:47 schrieb Derick Eddington:
> 
> > So, what can I do?  If Emacs regular expressions' backslash construct
> > `\cC' supported Unicode general categories, or if there was some
> > construct which did, I think that would do it nicely.  Is that
> > planned, or should I resort to doing more manual parsing, or something
> > else?
> 
> 
> Can't you use the Unicode characters themselves? In ranges like [À-Ëà- 
> ë]?

I'm using `rx-to-string' on my computed character sets (i.e., using it
on my one big SRE (s-expression regular expression) that has sub-SREs of
`(char . ,<list-of-characters>)), and `rx-to-string' consolidates the
characters into ranges and I'm assuming it does so as much as possible,
so I think I'm already using ranges as much as possible.  Here's a
modified simplified version of what I'm doing and it shows
`rx-to-string' is computing ranges:

(require 'rx)

(let* ((general-categories
        (let ((al (list (list 'Po) (list 'Sc))) ;; removed a bunch of others
              (c 0))
          (while (< c #x110000)
            (unless (and (<= #xD800 c) (<= c #xDFFF))
              (let* ((gc (get-char-code-property c 'general-category))
                     (a (assq gc al)))
                (when a (setcdr a (cons c (cdr a))))))
            (setq c (1+ c)))
          al))
       (char-set (lambda (gc) `(char . ,(cdr (assq gc general-categories)))))
       (Po (funcall char-set 'Po))
       (Sc (funcall char-set 'Sc))
       ;; removed a bunch of other stuff
       (thing `(seq "foo" (or ,Po ,Sc) "bar")))
  (rx-to-string thing))
=> 
"\\(?:foo\\(?:[!-#%-'*,./:;?@\\¡·¿;·՚-՟։׀׃׆׳״؉؊،؍؛؞؟٪-٭۔܀-܍߷-߹।॥॰෴๏๚๛༄-༒྅࿐-࿔၊-၏჻፡-፨᙭᙮᛫-᛭᜵᜶។-៖៘-៚᠀-᠅᠇-᠊᥄᥅᧞᧟᨞᨟᭚-᭠᰻-᰿᱾᱿‖‗†-‧‰-‸※-‾⁁-⁃⁇-⁑⁓⁕-⁞⳹-⳼⳾⳿⸀⸁⸆-⸈⸋⸎-⸖⸘⸙⸛⸞⸟⸪-⸮⸰、-〃〽・꘍-꘏꙳꙾꡴-꡷꣎꣏꤮꤯꥟꩜-꩟︐-︖︙︰﹅﹆﹉-﹌﹐-﹒﹔-﹗﹟-﹡﹨﹪﹫!-#%-'*,./:;?@\。、・𐄀𐄁𐎟𐏐𐤟𐤿𐩐-𐩘𒑰-𒑳]\\|[$¢-¥؋৲৳૱௹฿៛₠-₵﷼﹩$¢£¥₩]\\)bar\\)"

If I did type the characters themselves in "[x-y]" ranges, I'd have to
figure out a lot them because the Unicode general categories are not
simple ranges, they're scattered across the code-points.  I need these
general categories which have these numbers of elements:

((Lu 1438) (Ll 1770) (Lt 31) (Lm 187) (Lo 90794) (Mn 1082) (Nl 214) 
(No 349) (Pd 20) (Pc 10) (Po 318) (Sc 41) (Sm 946) (Sk 99) (So 3695) 
(Co 137468) (Nd 408) (Mc 269) (Me 13) (Zs 18) (Zl 1) (Zp 1))

which is way more than I can manually manage.

-- 
: Derick
----------------------------------------------------------------






reply via email to

[Prev in Thread] Current Thread [Next in Thread]