bug-gnu-emacs
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#37659: rx additions: anychar, unmatchable, unordered-or


From: Mattias Engdegård
Subject: bug#37659: rx additions: anychar, unmatchable, unordered-or
Date: Thu, 24 Oct 2019 10:58:43 +0200

24 okt. 2019 kl. 01.14 skrev Paul Eggert <eggert@cs.ucla.edu>:
> 
>> how do we make it easy to match one of multiple strings --- keywords, say 
>> --- in rx?
> 
> If that's the real problem, perhaps the name should be "or-tokens" or 
> something like that, to help remind the reader of the limitations of the 
> proposed operator: it's meant only for greedy tokenization and it isn't 
> suited for regular expressions in general. A problem with the name "or-max" 
> is that it implies a more-general functionality than the implementation 
> really has.

'or-strings' then perhaps, since there is nothing really restricting it to 
'tokens' (which is a bit hazardous terminology given that regexps are commonly 
used for tokenising). In particular, there is no delimiting; (or-max "IN" 
"OUT") will match the first part of "INSPECT", which may be unexpected of 
something ostensibly matching tokens.

On the other hand, 'or-strings' sort of precludes a future relaxation of the 
argument restriction.

> What happens if you apply or-tokens to arguments that aren't strings or other 
> or-tokens? Does rx diagnose this? I hope it does.

Yes, of course. Working patch attached (it still uses the name 'or-max').

'or-max' isn't a vital addition; it just seemed to fill a gap, after experience 
with traditional regexp usage. It clearly shouldn't be added it on a whim. I 
wanted to get it in place for 27.1, but such a version rush has rarely resulted 
in good design.

> I was thinking of something more-compatible: we could say that \| is 
> left-to-right (for users who need compatibility with regexp "|"), and that 
> 'or' is not necessarily left-to-right (to make room for future extensions 
> that make 'or' greedy, or more efficient, or both).

Sorry, by '\|' I meant the string regexp operator; I take it you propose 
separate semantics for the rx '|' and 'or' operators? Maybe we should worry 
about that if we ever get near the point of replacing the engine. There are 
other concerns, such as how capture groups are set (even if two branches match 
equally long texts).

I honestly don't think much would break if '\|' (in string regexps) became 
greedy overnight, but there is plenty of room to confuse the user if we 
introduce subtle distinctions between what has hitherto been perceived as 
synonyms.

Attachment: 0003-Add-the-rx-or-max-operator.patch
Description: Binary data


reply via email to

[Prev in Thread] Current Thread [Next in Thread]