bug#63225: Compiling regexp patterns (and REGEXP_CACHE

bug-gnu-emacs

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)

From:	Mattias Engdegård
Subject:	bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
Date:	Sun, 7 May 2023 12:32:52 +0200

6 maj 2023 kl. 15.38 skrev Ihor Radchenko <yantar92@posteo.net>:

> I may, but it will be even more complex regexp. Currently, ordinary
> drawers have somewhat complex :BEGIN: line, because they can have any
> word there, while property drawers require very complex match for the
> lines inside. Also, property drawers only occur right after headings, as
> marked by appropriate parser flag. So, matching property drawers mostly
> happens what they are supposed to be. If we try to match ordinary
> drawers at the same time, it will actually be slower in practice.

What I meant was that the consolidated root regexp could just match the initial 
:BEGIN: line and then dispatch to different branches for parsers specific to 
the drawer type. That would reduce complexity and time spent at the critical 
parser root.

> This will account for Org syntax change, so no.

Don't dismiss it out of hand. I'm not trying to optimise a few regexps, but to 
use examples to illustrate some useful principles that would help you improve 
many of them yourself.

When matching something terminated by a specific character, it's particularly 
useful if the regexp engine can be made to understand that the terminator 
doesn't occur in what precedes it, as that enables it to omit backtracking 
points. For example, in "a*b", the engine doesn't need to save backtracking 
points for each 'a' matched since the sets {a} and {b} are obviously disjoint.

In this case, the

   (group (+ (| wordchar (in "_-"))))

part is unnecessarily slow because it's an or-pattern, which also inhibits that 
optimisation. Fortunately it can easily be rewritten as

   (group (+ (in "_-" word)))

which solves both problems.

> Slight improvement in performance cannot justify syntax changes.

Always question your assumptions. A slight change of spec may not be so bad 
after all if it buys speed and/or improves our understanding of the code. Do 
you know what characters have 'word' syntax in org-mode? If not, better be 
careful before using them in regexps.

(Looks like org-tags-expand permanently adds @ and _ to the set of word chars. 
A bug, surely?)

> (defvar org--item-re-cache nil
>  "Results cache for `org-item-re'.")
> (defsubst org-item-re ()
>  "Return the correct regular expression for plain lists."
>  (or (plist-get
>       org--item-re-cache
>       (cons org-list-allow-alphabetical
>             org-plain-list-ordered-item-terminator)
>       #'equal)
>      ...))
> 
> It should not give much overhead.

Maybe, but you still cons each time. (And remember that the plist-get equality 
funarg is new in Emacs 29.)

> A larger number of regexps is matched in the individual
> element parsers. They just don't contribute as much as
> `org-element--current-element' individually and thus do not show up high
> in the profiler.

Still, if called often enough they do outsized damage by evicting regexps used 
elsewhere.

Also make sure that if the same regexp is used in multiple places, it should 
always use the same `case-fold-search` value or they will be considered 
different for cache purposes.

> [ now we are just 2x slower than tree-sitter rather than 2.5x :) ]

Progress!

[Prev in Thread]

Current Thread

[Next in Thread]

bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c), (continued)

Prev by Date: bug#63337: [PATCH] package-vc--build-documentation: Fix relative @include statements
Next by Date: bug#63260: 29.0.90; Regression installing/activating packages without autoloads
Previous by thread: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
Next by thread: bug#63225: Compiling regexp patterns (and REGEXP_CACHE_SIZE in search.c)
Index(es):
- Date
- Thread