[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: bug introduced in gawk 3.1.1, still in 3.1.3
From: |
Aharon Robbins |
Subject: |
Re: bug introduced in gawk 3.1.1, still in 3.1.3 |
Date: |
Thu, 29 Jan 2004 17:15:02 +0200 |
Greetings. Re this:
> To: address@hidden
> Subject: bug introduced in gawk 3.1.1, still in 3.1.3
> Date: Wed, 28 Jan 2004 16:23:30 -0700 (MST)
> From: address@hidden (Bill Bruno)
>
> Here's the bug:
>
> cpg[95]% ./gawk '{sub(/[a-z]/,"&"); print}'
> aaa
> &aa
>
> I get this in 3.1.1 and 3.1.3. In 3.1.0 I get the correct
> behavior:
>
> motif[5]% gawk '{sub ( /[a-z]/, "&"); print}'
> aaa
> aaa
>
> I find the same problem in gsub, where it is more relevant
> because this command can be used to count occurences of
> a regexp without changing the string. If there is a more
> standard way to do that, please tell me.
>
> I guess the work around is to duplicate the string first...
> Bill
As I said in my earlier mail, this is related to the locale in use.
With LC_ALL=C, it doesn't happen. The fix is included below. For free,
you get a bonus bug fix: with --posix gawk will now follow the 2001
POSIX standard for sub and gsub. Thank you for shopping at gnu.org. (:-)
Enjoy.
Arnold
-------------------------------------------
Thu Jan 29 17:04:51 2004 Arnold D. Robbins <address@hidden>
* builtin.c (sub_common): Fix logic for `&' in replacement for
multibyte case. Simplify code a bit.
Sun Jan 18 12:01:29 2004 Arnold D. Robbins <address@hidden>
* builtin.c (sub_common): Add comment and support for 2001 POSIX
behavior when --posix in effect.
--- ../gawk-3.1.3/builtin.c 2003-07-07 01:08:08.000000000 +0300
+++ builtin.c 2004-01-29 17:04:28.000000000 +0200
@@ -1956,6 +2001,33 @@
*/
/*
+ * 1/2004: The gawk sub/gsub behavior dates from 1996, when we proposed it
+ * for POSIX. The proposal fell through the cracks, and the 2001 POSIX
+ * standard chose a more simple behavior.
+ *
+ * The relevant text is to be found on lines 6394-6407 (pages 166, 167) of the
+ * 2001 standard:
+ *
+ * sub(ere, repl[, in ])
+ * Substitute the string repl in place of the first instance of the
extended regular
+ * expression ERE in string in and return the number of substitutions. An
ampersand
+ * ('&') appearing in the string repl shall be replaced by the string from
in that
+ * matches the ERE. An ampersand preceded with a backslash ('\') shall be
+ * interpreted as the literal ampersand character. An occurrence of two
consecutive
+ * backslashes shall be interpreted as just a single literal backslash
character. Any
+ * other occurrence of a backslash (for example, preceding any other
character) shall
+ * be treated as a literal backslash character. Note that if repl is a
string literal (the
+ * lexical token STRING; see Grammar (on page 170)), the handling of the
+ * ampersand character occurs after any lexical processing, including any
lexical
+ * backslash escape sequence processing. If in is specified and it is not
an lvalue (see
+ * Expressions in awk (on page 156)), the behavior is undefined. If in is
omitted, awk
+ * shall use the current record ($0) in its place.
+ *
+ * Because gawk has had its behavior for 7+ years, that behavior is remaining
as
+ * the default, with the POSIX behavior available for do_posix. Fun, fun, fun.
+ */
+
+/*
* NB: `howmany' conflicts with a SunOS 4.x macro in <sys/param.h>.
*/
@@ -2068,7 +2140,15 @@
repllen--;
scan++;
}
- } else { /* (proposed) posix '96 mode */
+ } else if (do_posix) {
+ /* \& --> &, \\ --> \ */
+ if (scan[1] == '&' || scan[1] == '\\') {
+ repllen--;
+ scan++;
+ } /* else
+ leave alone, it goes into the output */
+ } else {
+ /* gawk default behavior since 1996 */
if (strncmp(scan, "\\\\\\&", 4) == 0) {
/* \\\& --> \& */
repllen -= 2;
@@ -2130,22 +2210,24 @@
* making substitutions as we go.
*/
for (scan = repl; scan < replend; scan++)
+ if (*scan == '&'
#ifdef MBS_SUPPORT
- if ((gawk_mb_cur_max == 1
- || (repllen > 0 && mb_indices[scan -
repl] == 1))
- && (*scan == '&'))
-#else
- if (*scan == '&')
+ /*
+ * Don't test repllen here. A simple "&"
could
+ * end up with repllen == 0.
+ */
+ && (gawk_mb_cur_max == 1
+ || mb_indices[scan - repl] == 1)
#endif
+ ) {
for (cp = matchstart; cp < matchend;
cp++)
*bp++ = *cp;
+ } else if (*scan == '\\'
#ifdef MBS_SUPPORT
- else if ((gawk_mb_cur_max == 1
+ && (gawk_mb_cur_max == 1
|| (repllen > 0 && mb_indices[scan -
repl] == 1))
- && (*scan == '\\')) {
-#else
- else if (*scan == '\\') {
#endif
+ ) {
if (backdigs) { /* gensub, behave
sanely */
if (ISDIGIT(scan[1])) {
int dig = scan[1] - '0';
@@ -2161,7 +2243,13 @@
scan++;
} else /* \q for any q --> q */
*bp++ = *++scan;
- } else { /* posix '96 mode,
bleah */
+ } else if (do_posix) {
+ /* \& --> &, \\ --> \ */
+ if (scan[1] == '&' || scan[1]
== '\\')
+ scan++;
+ *bp++ = *scan;
+ } else {
+ /* gawk default behavior since
1996 */
if (strncmp(scan, "\\\\\\&", 4)
== 0) {
/* \\\& --> \& */
*bp++ = '\\';