bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Oddity in regular expressions


From: Paul Eggert
Subject: Re: Oddity in regular expressions
Date: Fri, 13 Apr 2001 13:37:25 -0700 (PDT)

> From: Andrew Koenig <address@hidden>
> Date: Fri, 13 Apr 2001 10:20:29 -0400 (EDT)
> 
>        BEGIN {
>            s = "[::]"
>          print "Before: " s
>          gsub(/\[/, "\\\\&\\f(CW", s)
>          print "After: " s
>        }
> 
> Running under GNU Awk 3.0.6 yields this output:
> 
>        Before: [::]
>        After: \[\f(CW::]
> 
> Running under AT&T awk yields this output
> 
>        Before: [::]
>        After: \&\f(CW::]

My reading of POSIX.2-1992 section 4.1.7.6.2.2 page 178-179 lines
647-653 is that it requires the gawk-style behavior, and that AT&T awk
does not conform.

For what it's worth, Solaris 8 /usr/xpg4/bin/awk agrees with gawk
whereas /usr/bin/nawk agrees with AT&T awk, so perhaps Sun has noticed
the incompatibility and has decided that POSIX does indeed require the
gawk-style behavior.  Also, mawk agrees with gawk.


HOWEVER......

The latest draft of the next POSIX revision agrees with AT&T awk!
It says:

   sub(ere,  repl[,  in  ])

        Substitute the string repl in place of the first instance of
        the extended regular expression ERE in string in and return
        the number of substitutions. An ampersand ('&') appearing in
        the string repl shall be replaced by the string from in that
        matches the ERE. An ampersand preceded with a backslash ('\')
        shall be interpreted as the literal ampersand character. Any
        other occurrence of a backslash (for example, preceding any
        other character) shall be treated as a literal backslash
        character....

So the gawk behavior, while required by the current POSIX standard, will be
disallowed by the next standard (assuming the draft is approved as-is).

And the draft rationale says:

        In sub and gsub, if repl is a string literal (the lexical
        token STRING), then two consecutive backslash characters
        should be used in the string to ensure a single backslash will
        precede the ampersand when the resultant string is passed to
        the function. (For example, to specify one literal ampersand
        in the replacement string, use gsub(ERE, "\\&").)
        
        Historically the only special character in the repl argument
        of sub and gsub string functions was the ampersand ('&')
        character and preceding it with the backslash character was
        used to turn off its special meaning.

        The description in the ISO POSIX-2: 1993 standard introduced
        behavior such that the backslash character was another special
        character and it was unspecified whether there were any other
        special characters. This description introduced several
        portability problems, some of which are described below, and
        so it has been replaced with the more historical
        description. Some of the problems include:

         * Historically, to create the replacement string, a script
           could use gsub(ERE, "\\&"), but with the ISO POSIX-2: 1993
           standard wording, it was necessary to use gsub(ERE,
           "\\\\&").  Backslash characters are doubled here because
           all string literals are subject to lexical analysis, which
           would reduce each pair of backslash characters to a single
           backslash before being passed to gsub.

         * Since it was unspecified what the special characters were,
           for portable scripts to guarantee that characters are
           printed literally, each character had to be preceded with a
           backslash. (For example, a portable script had to use
           gsub(ERE, "\\h\\i") to produce a replacement string of
           "hi".)

(Confused enough yet?  I am.  :-)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]