[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
grep branch, master, updated. v2.16-7-g1078b64
From: |
Paul Eggert |
Subject: |
grep branch, master, updated. v2.16-7-g1078b64 |
Date: |
Fri, 17 Jan 2014 22:32:44 +0000 |
This is an automated email from the git hooks/post-receive script. It was
generated because a ref change was pushed to the repository containing
the project "grep".
The branch, master has been updated
via 1078b64302bbf5c0a46635772808ff7f75171dbc (commit)
via 45284e38cfb07343ab50d20b116375c8a1d64196 (commit)
from 97d3430c75a9dd82d871eca170b13c1f8d895fad (commit)
Those revisions listed above that are new to this repository have
not appeared on any other notification email; so we list those
revisions in full, below.
- Log -----------------------------------------------------------------
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=1078b64302bbf5c0a46635772808ff7f75171dbc
commit 1078b64302bbf5c0a46635772808ff7f75171dbc
Author: Paul Eggert <address@hidden>
Date: Fri Jan 17 14:32:10 2014 -0800
grep: DFA now uses rational ranges in unibyte locales
Problem reported by Aharon Robbins in <http://bugs.gnu.org/16481>.
* NEWS:
* doc/grep.texi (Environment Variables)
(Character Classes and Bracket Expressions):
Document this.
* src/dfa.c (parse_bracket_exp): Treat unibyte locales like multibyte.
diff --git a/NEWS b/NEWS
index 6e46684..589b2ac 100644
--- a/NEWS
+++ b/NEWS
@@ -7,6 +7,14 @@ GNU grep NEWS -*- outline
-*-
grep -i in a multibyte locale is now typically 10 times faster
for patterns that do not contain \ or [.
+ Range expressions in unibyte locales now ordinarily use the rational
+ range interpretation, in which [a-z] matches only lower-case ASCII
+ letters regardless of locale, and similarly for other ranges. (This
+ was already true for multibyte locales.) Portable programs should
+ continue to specify the C locale when using range expressions, since
+ these expressions have unspecified behavior in non-GNU systems and
+ are not yet guaranteed to use the rational range interpretation even
+ in GNU systems.
* Noteworthy changes in release 2.16 (2014-01-01) [stable]
diff --git a/doc/grep.texi b/doc/grep.texi
index 473a181..42fb9a2 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -960,8 +960,8 @@ They are omitted (i.e., false) by default and become true
when specified.
@cindex national language support
@cindex NLS
These variables specify the locale for the @code{LC_COLLATE} category,
-which determines the collating sequence
-used to interpret range expressions like @samp{[a-z]}.
+which might affect how range expressions like @samp{[a-z]} are
+interpreted.
@item LC_ALL
@itemx LC_CTYPE
@@ -1223,14 +1223,13 @@ For example, the regular expression
Within a bracket expression, a @dfn{range expression} consists of two
characters separated by a hyphen.
It matches any single character that
-sorts between the two characters, inclusive, using the locale's
-collating sequence and character set.
-For example, in the default C
-locale, @samp{[a-d]} is equivalent to @samp{[abcd]}.
-Many locales sort
-characters in dictionary order, and in these locales @samp{[a-d]} is
-typically not equivalent to @samp{[abcd]};
-it might be equivalent to @samp{[aBbCcDd]}, for example.
+sorts between the two characters, inclusive.
+In the default C locale, the sorting sequence is the native character
+order; for example, @samp{[a-d]} is equivalent to @samp{[abcd]}.
+In other locales, the sorting sequence is not specified, and
address@hidden might be equivalent to @samp{[abcd]} or to
address@hidden, or it might fail to match any character, or the set of
+characters that it matches might even be erratic.
To obtain the traditional interpretation
of bracket expressions, you can use the @samp{C} locale by setting the
@env{LC_ALL} environment variable to the value @samp{C}.
diff --git a/src/dfa.c b/src/dfa.c
index 6ab4e05..5e3140d 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -1108,30 +1108,14 @@ parse_bracket_exp (void)
}
else
{
- /* Defer to the system regex library about the meaning
- of range expressions. */
- regex_t re;
- char pattern[6] = { '[', 0, '-', 0, ']', 0 };
- char subject[2] = { 0, 0 };
c1 = c;
if (case_fold)
{
c1 = tolower (c1);
c2 = tolower (c2);
}
-
- pattern[1] = c1;
- pattern[3] = c2;
- regcomp (&re, pattern, REG_NOSUB);
- for (c = 0; c < NOTCHAR; ++c)
- {
- if ((case_fold && isupper (c)))
- continue;
- subject[0] = c;
- if (regexec (&re, subject, 0, NULL, 0) != REG_NOMATCH)
- setbit_case_fold_c (c, ccl);
- }
- regfree (&re);
+ for (c = c1; c <= c2; c++)
+ setbit_case_fold_c (c, ccl);
}
colon_warning_state |= 8;
http://git.savannah.gnu.org/cgit/grep.git/commit/?id=45284e38cfb07343ab50d20b116375c8a1d64196
commit 1078b64302bbf5c0a46635772808ff7f75171dbc
Author: Paul Eggert <address@hidden>
Date: Fri Jan 17 14:32:10 2014 -0800
grep: DFA now uses rational ranges in unibyte locales
Problem reported by Aharon Robbins in <http://bugs.gnu.org/16481>.
* NEWS:
* doc/grep.texi (Environment Variables)
(Character Classes and Bracket Expressions):
Document this.
* src/dfa.c (parse_bracket_exp): Treat unibyte locales like multibyte.
diff --git a/NEWS b/NEWS
index 6e46684..589b2ac 100644
--- a/NEWS
+++ b/NEWS
@@ -7,6 +7,14 @@ GNU grep NEWS -*- outline
-*-
grep -i in a multibyte locale is now typically 10 times faster
for patterns that do not contain \ or [.
+ Range expressions in unibyte locales now ordinarily use the rational
+ range interpretation, in which [a-z] matches only lower-case ASCII
+ letters regardless of locale, and similarly for other ranges. (This
+ was already true for multibyte locales.) Portable programs should
+ continue to specify the C locale when using range expressions, since
+ these expressions have unspecified behavior in non-GNU systems and
+ are not yet guaranteed to use the rational range interpretation even
+ in GNU systems.
* Noteworthy changes in release 2.16 (2014-01-01) [stable]
diff --git a/doc/grep.texi b/doc/grep.texi
index 473a181..42fb9a2 100644
--- a/doc/grep.texi
+++ b/doc/grep.texi
@@ -960,8 +960,8 @@ They are omitted (i.e., false) by default and become true
when specified.
@cindex national language support
@cindex NLS
These variables specify the locale for the @code{LC_COLLATE} category,
-which determines the collating sequence
-used to interpret range expressions like @samp{[a-z]}.
+which might affect how range expressions like @samp{[a-z]} are
+interpreted.
@item LC_ALL
@itemx LC_CTYPE
@@ -1223,14 +1223,13 @@ For example, the regular expression
Within a bracket expression, a @dfn{range expression} consists of two
characters separated by a hyphen.
It matches any single character that
-sorts between the two characters, inclusive, using the locale's
-collating sequence and character set.
-For example, in the default C
-locale, @samp{[a-d]} is equivalent to @samp{[abcd]}.
-Many locales sort
-characters in dictionary order, and in these locales @samp{[a-d]} is
-typically not equivalent to @samp{[abcd]};
-it might be equivalent to @samp{[aBbCcDd]}, for example.
+sorts between the two characters, inclusive.
+In the default C locale, the sorting sequence is the native character
+order; for example, @samp{[a-d]} is equivalent to @samp{[abcd]}.
+In other locales, the sorting sequence is not specified, and
address@hidden might be equivalent to @samp{[abcd]} or to
address@hidden, or it might fail to match any character, or the set of
+characters that it matches might even be erratic.
To obtain the traditional interpretation
of bracket expressions, you can use the @samp{C} locale by setting the
@env{LC_ALL} environment variable to the value @samp{C}.
diff --git a/src/dfa.c b/src/dfa.c
index 6ab4e05..5e3140d 100644
--- a/src/dfa.c
+++ b/src/dfa.c
@@ -1108,30 +1108,14 @@ parse_bracket_exp (void)
}
else
{
- /* Defer to the system regex library about the meaning
- of range expressions. */
- regex_t re;
- char pattern[6] = { '[', 0, '-', 0, ']', 0 };
- char subject[2] = { 0, 0 };
c1 = c;
if (case_fold)
{
c1 = tolower (c1);
c2 = tolower (c2);
}
-
- pattern[1] = c1;
- pattern[3] = c2;
- regcomp (&re, pattern, REG_NOSUB);
- for (c = 0; c < NOTCHAR; ++c)
- {
- if ((case_fold && isupper (c)))
- continue;
- subject[0] = c;
- if (regexec (&re, subject, 0, NULL, 0) != REG_NOMATCH)
- setbit_case_fold_c (c, ccl);
- }
- regfree (&re);
+ for (c = c1; c <= c2; c++)
+ setbit_case_fold_c (c, ccl);
}
colon_warning_state |= 8;
-----------------------------------------------------------------------
Summary of changes:
NEWS | 8 ++++++++
doc/grep.texi | 19 +++++++++----------
src/dfa.c | 20 ++------------------
src/grep.c | 14 ++++++++++++++
4 files changed, 33 insertions(+), 28 deletions(-)
hooks/post-receive
--
grep
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- grep branch, master, updated. v2.16-7-g1078b64,
Paul Eggert <=