bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Oddity in regular expressions


From: Aharon Robbins
Subject: Re: Oddity in regular expressions
Date: Wed, 18 Apr 2001 15:56:10 +0300

Hi Paul & Andrew.

Paul is correct that the draft requires the previous "AT&T" behavior.
(I prefer to refer to Brian's awk as the "Bell Labs awk", since BWK
hasn't worked at AT&T for several years.)

HOWEVER, I managed to fix this.  In turns out that a critical sentence
was dropped back in 1995 when the text was changed, which also made \\
special.  I got an interpretation request into the POSIX cycle in time for
the next (and last?) recirculation draft, which changes the wording such
that

        input   output
        -----   ------
        \\        \
        \&        &
        \q        \q

In other words, \ is special before either a \ or an &.  Otherwise the \
is left alone and appears in the output.

This is, I believe, exactly what mawk currently does, so Mike Brennan
is in good shape here.  It's not quite what gawk 3.0 and 3.1 do, but
I'll fix it down the road for gawk 3.2.

This just happened within the past month, so it's no suprise that it's
not particularly publicized.

(And before anyone asks, gawk 3.1 has hit code and documentation freeze.
I'm not changing this for 3.1.)

Thanks,

Arnold

> Date: Fri, 13 Apr 2001 13:37:25 -0700 (PDT)
> From: Paul Eggert <address@hidden>
> To: address@hidden
> CC: address@hidden
> Subject: Re: Oddity in regular expressions
>
> > From: Andrew Koenig <address@hidden>
> > Date: Fri, 13 Apr 2001 10:20:29 -0400 (EDT)
> > 
> >        BEGIN {
> >            s = "[::]"
> >        print "Before: " s
> >        gsub(/\[/, "\\\\&\\f(CW", s)
> >        print "After: " s
> >        }
> > 
> > Running under GNU Awk 3.0.6 yields this output:
> > 
> >        Before: [::]
> >        After: \[\f(CW::]
> > 
> > Running under AT&T awk yields this output
> > 
> >        Before: [::]
> >        After: \&\f(CW::]
>
> My reading of POSIX.2-1992 section 4.1.7.6.2.2 page 178-179 lines
> 647-653 is that it requires the gawk-style behavior, and that AT&T awk
> does not conform.
>
> For what it's worth, Solaris 8 /usr/xpg4/bin/awk agrees with gawk
> whereas /usr/bin/nawk agrees with AT&T awk, so perhaps Sun has noticed
> the incompatibility and has decided that POSIX does indeed require the
> gawk-style behavior.  Also, mawk agrees with gawk.
>
> HOWEVER......
>
> The latest draft of the next POSIX revision agrees with AT&T awk!
> It says:
>
>    sub(ere,  repl[,  in  ])
>
>       Substitute the string repl in place of the first instance of
>       the extended regular expression ERE in string in and return
>       the number of substitutions. An ampersand ('&') appearing in
>       the string repl shall be replaced by the string from in that
>       matches the ERE. An ampersand preceded with a backslash ('\')
>       shall be interpreted as the literal ampersand character. Any
>       other occurrence of a backslash (for example, preceding any
>       other character) shall be treated as a literal backslash
>       character....
>
> So the gawk behavior, while required by the current POSIX standard, will be
> disallowed by the next standard (assuming the draft is approved as-is).
>
> And the draft rationale says:
>
>       In sub and gsub, if repl is a string literal (the lexical
>       token STRING), then two consecutive backslash characters
>       should be used in the string to ensure a single backslash will
>       precede the ampersand when the resultant string is passed to
>       the function. (For example, to specify one literal ampersand
>       in the replacement string, use gsub(ERE, "\\&").)
>       
>       Historically the only special character in the repl argument
>       of sub and gsub string functions was the ampersand ('&')
>       character and preceding it with the backslash character was
>       used to turn off its special meaning.
>
>       The description in the ISO POSIX-2: 1993 standard introduced
>       behavior such that the backslash character was another special
>       character and it was unspecified whether there were any other
>       special characters. This description introduced several
>       portability problems, some of which are described below, and
>       so it has been replaced with the more historical
>       description. Some of the problems include:
>
>        * Historically, to create the replacement string, a script
>          could use gsub(ERE, "\\&"), but with the ISO POSIX-2: 1993
>          standard wording, it was necessary to use gsub(ERE,
>          "\\\\&").  Backslash characters are doubled here because
>          all string literals are subject to lexical analysis, which
>          would reduce each pair of backslash characters to a single
>          backslash before being passed to gsub.
>
>        * Since it was unspecified what the special characters were,
>          for portable scripts to guarantee that characters are
>          printed literally, each character had to be preceded with a
>          backslash. (For example, a portable script had to use
>          gsub(ERE, "\\h\\i") to produce a replacement string of
>          "hi".)
>
> (Confused enough yet?  I am.  :-)



reply via email to

[Prev in Thread] Current Thread [Next in Thread]