[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Oddity in regular expressions
From: |
Paul Eggert |
Subject: |
Re: Oddity in regular expressions |
Date: |
Fri, 13 Apr 2001 13:37:25 -0700 (PDT) |
> From: Andrew Koenig <address@hidden>
> Date: Fri, 13 Apr 2001 10:20:29 -0400 (EDT)
>
> BEGIN {
> s = "[::]"
> print "Before: " s
> gsub(/\[/, "\\\\&\\f(CW", s)
> print "After: " s
> }
>
> Running under GNU Awk 3.0.6 yields this output:
>
> Before: [::]
> After: \[\f(CW::]
>
> Running under AT&T awk yields this output
>
> Before: [::]
> After: \&\f(CW::]
My reading of POSIX.2-1992 section 4.1.7.6.2.2 page 178-179 lines
647-653 is that it requires the gawk-style behavior, and that AT&T awk
does not conform.
For what it's worth, Solaris 8 /usr/xpg4/bin/awk agrees with gawk
whereas /usr/bin/nawk agrees with AT&T awk, so perhaps Sun has noticed
the incompatibility and has decided that POSIX does indeed require the
gawk-style behavior. Also, mawk agrees with gawk.
HOWEVER......
The latest draft of the next POSIX revision agrees with AT&T awk!
It says:
sub(ere, repl[, in ])
Substitute the string repl in place of the first instance of
the extended regular expression ERE in string in and return
the number of substitutions. An ampersand ('&') appearing in
the string repl shall be replaced by the string from in that
matches the ERE. An ampersand preceded with a backslash ('\')
shall be interpreted as the literal ampersand character. Any
other occurrence of a backslash (for example, preceding any
other character) shall be treated as a literal backslash
character....
So the gawk behavior, while required by the current POSIX standard, will be
disallowed by the next standard (assuming the draft is approved as-is).
And the draft rationale says:
In sub and gsub, if repl is a string literal (the lexical
token STRING), then two consecutive backslash characters
should be used in the string to ensure a single backslash will
precede the ampersand when the resultant string is passed to
the function. (For example, to specify one literal ampersand
in the replacement string, use gsub(ERE, "\\&").)
Historically the only special character in the repl argument
of sub and gsub string functions was the ampersand ('&')
character and preceding it with the backslash character was
used to turn off its special meaning.
The description in the ISO POSIX-2: 1993 standard introduced
behavior such that the backslash character was another special
character and it was unspecified whether there were any other
special characters. This description introduced several
portability problems, some of which are described below, and
so it has been replaced with the more historical
description. Some of the problems include:
* Historically, to create the replacement string, a script
could use gsub(ERE, "\\&"), but with the ISO POSIX-2: 1993
standard wording, it was necessary to use gsub(ERE,
"\\\\&"). Backslash characters are doubled here because
all string literals are subject to lexical analysis, which
would reduce each pair of backslash characters to a single
backslash before being passed to gsub.
* Since it was unspecified what the special characters were,
for portable scripts to guarantee that characters are
printed literally, each character had to be preceded with a
backslash. (For example, a portable script had to use
gsub(ERE, "\\h\\i") to produce a replacement string of
"hi".)
(Confused enough yet? I am. :-)