>From 076ed98ff6d7debff3929beab048c8a90e48dbb8 Mon Sep 17 00:00:00 2001 From: Paul Eggert Date: Tue, 2 Apr 2019 00:17:37 -0700 Subject: [PATCH] More regexp advice and clarifications MIME-Version: 1.0 Content-Type: text/plain; charset=UTF-8 Content-Transfer-Encoding: 8bit * doc/lispref/searching.texi (Regexp Special): Simplify style advice for order of ], ^, and - in character alternatives. Stick with saying that it’s not a good idea to put ‘-’ after a range. Remove the special case about raw 8-bit bytes and unibyte characters, as this documentation is confusing and seems to be incorrect in some cases. Say that z-a is the preferred style for reversed ranges, since it’s clearer and is typically what’s used in practice. Mention some bad styles: duplicates in character alternatives, ranges that denote <=3 characters, and ‘-’ as the first character. --- doc/lispref/searching.texi | 52 +++++++++++++++++++++++--------------- 1 file changed, 31 insertions(+), 21 deletions(-) diff --git a/doc/lispref/searching.texi b/doc/lispref/searching.texi index 748ab586af..72ee9233a3 100644 --- a/doc/lispref/searching.texi +++ b/doc/lispref/searching.texi @@ -398,17 +398,11 @@ Regexp Special The usual regexp special characters are not special inside a character alternative. A completely different set of characters is special inside character alternatives: @samp{]}, @samp{-} and @samp{^}. - -To include a @samp{]} in a character alternative, you must make it the first -character. For example, @samp{[]a]} matches @samp{]} or @samp{a}. To include -a @samp{-}, write @samp{-} as the last character of the character alternative, -tho you can also put it first or after a range. Thus, @samp{[]-]} matches both address@hidden and @samp{-}. (As explained below, you cannot use @samp{\]} to -include a @samp{]} inside a character alternative, since @samp{\} is not -special there.) - -To include @samp{^} in a character alternative, put it anywhere but at -the beginning. +To include @samp{]} in a character alternative, put it at the +beginning. To include @samp{^}, put it anywhere but at the beginning. +To include @samp{-}, put it at the end. Thus, @samp{[]^-]} matches +all three of these special characters. You cannot use @samp{\} to +escape these three characters, since @samp{\} is not special here. The following aspects of ranges are specific to Emacs, in that POSIX allows but does not require this behavior and programs other than @@ -426,17 +420,33 @@ Regexp Special outside the C or POSIX locale. @item -As a special case, if either bound of a range is a raw 8-bit byte, the -other bound should be a unibyte character, and the range matches only -unibyte characters. +If the lower bound of a range is greater than its upper bound, the +range is empty and represents no characters. Thus, @samp{[z-a]} +always fails to match, and @samp{[^z-a]} matches any character, +including newline. However, a reversed range should always be from +the letter @samp{z} to the letter @samp{a} to make it clear that it is +not a typo; for example, @samp{[+-*/]} should be avoided, because it +matches only @samp{/} rather than the likely-intended four characters. address@hidden enumerate + +Some kinds of character alternatives are not the best style even +though they are standardized by POSIX and are portable. They include: address@hidden @item -If the lower bound of a range is greater than its upper bound, the -range is empty and represents no characters. Thus, @samp{[b-a]} -always fails to match, and @samp{[^b-a]} matches any character, -including newline. However, the lower bound should be at most one -greater than the upper bound; for example, @samp{[c-a]} should be -avoided. +A character alternative can include duplicates. For example, address@hidden is less clear than @samp{[XYa-z]}. + address@hidden +A range can denote just one, two, or three characters. For example, address@hidden(-(]} is less clear than @samp{[(]}, @samp{[*-+]} is less clear +than @samp{[*+]}, and @samp{[*-,]} is less clear than @samp{[*+,]}. + address@hidden +A @samp{-} also appear at the beginning of a character alternative, or +as the upper bound of a range. For example, although @samp{[-a-z]} is +valid, @samp{[a-z-]} is better style; and although @samp{[!--/]} is +valid, @samp{[!-,/-]} is clearer. @end enumerate A character alternative can also specify named character classes @@ -452,7 +462,7 @@ Regexp Special @cindex @samp{^} in regexp @samp{[^} begins a @dfn{complemented character alternative}. This matches any character except the ones specified. Thus, address@hidden matches all characters @emph{except} letters and address@hidden matches all characters @emph{except} ASCII letters and digits. @samp{^} is not special in a character alternative unless it is the first -- 2.17.1