bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#64128: regexp parser zero-width assertion bugs


From: Mattias Engdegård
Subject: bug#64128: regexp parser zero-width assertion bugs
Date: Sat, 17 Jun 2023 14:20:27 +0200

In Emacs regexps, some but not all zero-width assertions have the special 
property in that they are not treated as an element for an immediately 
following ?, * or +. For example,

  \b*

matches a literal asterisk at a word boundary -- the `*` becomes literal 
because it is treated as if there were nothing for it to act upon. Even 
stranger:

  xy\b*

is parsed as, in rx syntax, (* "xy" word-boundary) which is remarkable: the 
repetition operator encompasses several elements even though there are no 
brackets given. Demo:

(and (string-match "quack,\\b*" "quack,quack,quack,quaaaack!")
     (match-data))
=> (0 18)

Zero-width assertions that have the property:
^ (bol), $ (eol), \` (bos), \' (eos), \b (word-boundary), \B (not-word-boundary)

Zero-width assertions that do not have the property (and are treated as any 
other element):
\< (bow), \> (eow), \_< (symbol-start), \_> (symbol-end), \= (point)

These regexp patterns should be very rare in practice: they should always be a 
mistake, but it would be nice if they behaved in a way that makes some kind of 
sense.

A modest improvement would be to make operators become literal after any 
zero-width assertion, so that

  \<*

becomes (: word-start "*") instead of (* word-start), and

  xy\b*

becomes (: "xy" word-boundary "*") instead of (* "xy" word-boundary).

Suggested patch attached.

Attachment: regexp-zero-width-assertion-bug.diff
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]