emacs-devel
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Improving regexp-opt


From: Miguel V. S. Frasson
Subject: Re: Improving regexp-opt
Date: Fri, 12 Apr 2019 12:40:48 -0300

Hi

Adding some thoughts on why current fa implementation is so slow. 

It strives to keep well nested trees in all steps, so it spends a lot of time on branch compassion. This ensures well nested trees for the fa and eases final fa to regexp conversion. 

If we drop well-formness for fa and manage to convert an not well-formed fa to regexp, the process would be very fast... 

Regards

Miguel

Em sex, 12 de abr de 2019 12:06, Miguel V. S. Frasson <address@hidden> escreveu:
Dear Stefan and others

Some time ago I suggested an improvement in regexp-opt, factoring
similarities at the end of groups. At the end, Stefan wrote:

Em sex, 8 de fev de 2019 às 01:48, Stefan Monnier
<address@hidden> escreveu:
> A much better approach is to go for a real "regexp to NFA/DFA
> conversion".  The `lex.el` package is one such example, but it's very
> inefficient (in terms of building the FA and in the size of the FA, not
> in terms of running the FA).

After some time, I had an idea of simplification by FA.  The base idea
is implement FA as "nodes" being lists of ARROWi and arrows being
(CHAR . NODE). For example the initial FA for strings ("abd" "acd") is

 >1 --a--> 2 --b--> 3 --d--> 4 --epsilon--> nil
           |                                 ^
           +---c--> 5 --d--> 6 --epsilon-----|
and inplemented as
(?a (?b (?d (0))) (?c (?d (0))))
= ((?a . ((?b . ((?d . ((0 . nil)))))
          (?c . ((?d . ((0 . nil)))))))

Note that nodes 3 and 5 are `equal'.  Simplification is to make them `eq'.

Also, there is simplification where a node is equal to a subset of a
parent node, resulting is a ? construction.
For example,
            +---f--> 9 ----------epsilon -------------\
            |                                          v
  >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon--> nil
            |                 ^
            +---d--> 6 --e---/
Node 4 is equal to a subnode derived from 2, resulting on
            +---epsilon-------+
            |                 v
  >1 --a--> 2 --b--> 3 --c--> 4 --f--> 5 --epsilon--> nil
            |                 ^
            +---d--> 6 --e--> 7

I made an inplementation, a patch to regexp-opt.el

The pros:
If the resulting strings "came from" a regexp that is splittable, the
FA implementation always simplifies to it.  In pratice, these are
uncommun, and in most cases, the results are equivalent.

The cons:
The algorithm for FA seams to have greater computation complexity,
takes about 20 times to compute in average.

Example, the case on the

(setq strings '("cond" "if" "when" "unless" "while"
                "let" "let*" "progn" "prog1" "prog2"
                "save-restriction" "save-excursion" "save-window-excursion"
                "save-current-buffer" "save-match-data"
                "catch" "throw" "unwind-protect" "condition-case"))

(regexp-opt strings) ->
"\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n]\\|save-\\(?:current-buffer\\|excursion\\|match-data\\|\\(?:restrict\\|window-excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\\)\\)"

(regexp-opt2 strings) ->
"\\(?:c\\(?:atch\\|ond\\(?:ition-case\\)?\\)\\|if\\|let\\*?\\|prog[12n]\\|save-\\(?:current-buffer\\|match-data\\|\\(?:restrict\\|\\(?:window-\\)?excurs\\)ion\\)\\|throw\\|un\\(?:less\\|wind-protect\\)\\|wh\\(?:en\\|ile\\)\\)"

The difference is that FA algorithm

BUT my version takes 70 times more time to compute.

Example 2:
(setq strings2 '("car" "cdr"
                 "caar" "cadr" "cdar" "cddr"
                 "caaar" "caadr" "cadar" "caddr"
                 "cdaar" "cdadr" "cddar" "cdddr"))

(regexp-opt strings2) ->
"\\(?:c\\(?:\\(?:a\\(?:a[ad]\\|d[ad]\\|[ad]\\)\\|d\\(?:a[ad]\\|d[ad]\\|[ad]\\)\\|[ad]\\)r\\)\\)"

(regexp-opt2 strings2) ->
"\\(?:c[ad]\\(?:[ad][ad]?\\)?r\\)"

FA is 7 times slower here.

If this implementation is useful, I would like very much to contribute it.

I actually have the other implementation from previous idea. It is
faster than FA, same complexity of current regexp-opt, a bit slower of
course, but I like this better.

Best regards

Miguel Frasson

--
Miguel Vinicius Santini Frasson
address@hidden

reply via email to

[Prev in Thread] Current Thread [Next in Thread]