bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk/POSIX regex metacharacter bug


From: Aharon Robbins
Subject: Re: gawk/POSIX regex metacharacter bug
Date: Sun, 13 Jun 2004 13:19:13 +0300

The documentation has been there for many years.  Possibly the indexing
could use improvement, but a straight-through read of the regex chapter
should have found it.

I won't disagree that the incompatibilities are problematic.  There is a
tension between historical compatibility, where { and } are NOT special,
and standards compliance, where they are.

Things seem to be shifting to a state where standards compliance is
more important, so possibly for the next major release I may make
inclusion of interval expressions the default behavior.

Arnold

> Date: Fri, 11 Jun 2004 10:17:44 -0700 (PDT)
> From: Shawn Smout <address@hidden>
> Subject: Re: gawk/POSIX regex metacharacter bug
> To: Aharon Robbins <address@hidden>, address@hidden
>
> I haven't looked at this in several months now, but at
> the time I spent a great deal of time searching for
> such documentation and couldn't find it, so if it was
> documented it certainly wasn't documented well enough.
> I don't know if anything has changed since then or
> what.  However, such incompatibilities in general are
> problematic.
>
> --- Aharon Robbins <address@hidden> wrote:
> > Greetings.
> > 
> > I am catching up on my gawk work.  In regards the
> > email quoted below,
> > I note the following text in the
> > gawk-3.1.3/doc/gawk.texi, starting at
> > line 3194:
> > 
> > | @cindex interval expressions
> > | @item @address@hidden@}
> > | @itemx @address@hidden,@}
> > | @itemx @address@hidden,@address@hidden
> > | One or two numbers inside braces denote an
> > @dfn{interval expression}.
> > | If there is one number in the braces, the
> > preceding regexp is repeated
> > | @var{n} times.
> > | If there are two numbers separated by a comma, the
> > preceding regexp is
> > | repeated @var{n} to @var{m} times.
> > | If there is one number followed by a comma, then
> > the preceding regexp
> > | is repeated at least @var{n} times:
> > | 
> > | @table @code
> > | @item address@hidden@}y
> > | Matches @samp{whhhy}, but not @samp{why} or
> > @samp{whhhhy}.
> > | 
> > | @item address@hidden,address@hidden
> > | Matches @samp{whhhy}, @samp{whhhhy}, or
> > @samp{whhhhhy}, only.
> > | 
> > | @item address@hidden,@}y
> > | Matches @samp{whhy} or @samp{whhhy}, and so on.
> > | @end table
> > | 
> > | @cindex POSIX @command{awk}, interval expressions
> > in
> > | Interval expressions were not traditionally
> > available in @command{awk}.
> > | They were added as part of the POSIX standard to
> > make @command{awk}
> > | and @command{egrep} consistent with each other.
> > | 
> > | @cindex @command{gawk}, interval expressions and
> > | However, because old programs may use @address@hidden
> > and @address@hidden in regexp
> > | constants, by default @command{gawk} does
> > @emph{not} match interval expressions
> > | in regexps.  If either @option{--posix} or
> > @option{--re-interval} are specified
> > | (@pxref{Options}), then interval expressions
> > | are allowed in regexps.
> > | 
> > | For new programs that use @address@hidden and @address@hidden
> > in regexp constants,
> > | it is good practice to always escape them with a
> > backslash.  Then the
> > | regexp constants are valid and work the way you
> > want them to, using
> > | any version of @address@hidden two
> > backslashes if you're
> > | using a string constant with a regexp operator or
> > function.}
> > 
> > This seems pretty clear to me.  What about this
> > makes it seem that
> > gawk's treatment of { and } depending upon the
> > use/absence of --posix is
> > "undocumented"?
> > 
> > Arnold Robbins
> > 
> > > Date: Sun, 28 Sep 2003 13:36:04 -0700 (PDT)
> > > From: Shawn Smout <address@hidden>
> > > Subject: gawk/POSIX regex metacharacter bug
> > > To: address@hidden
> > >
> > > I am running Slackware 9.1 with Linux 2.4.22 on a
> > > Pentium 4.  My gawk version is 3.1.3, and I am
> > > reasonably certain it was compiled with gcc 3.2.3.
> > >
> > > Gawk apparently handles metacharacters specially
> > based
> > > on context normally, but does not in POSIX
> > > compatability mode.  This is not listed in the
> > > documentation (info or man) as one of the
> > POSIX/GNU
> > > differences.
> > >
> > > For this example, the file "file" contains one
> > line:
> > >     {s}
> > >
> > > Ordinarily,
> > >     gawk '/{.}/' file
> > > will print:
> > >     {s}
> > > However,
> > >     gawk --posix '/{.}/' file
> > > fails with an invalid regular expression error. 
> > > Apparently gawk normally decides based on context
> > > whether the {} characters are metacharacters or
> > > literal characters; since they are not valid as
> > > metacharacters in this example, gawk interprets
> > them
> > > as literal characters.  In POSIX mode, gawk does
> > not
> > > change its interpretation of the metacharacters
> > based
> > > on context.
> > >
> > > The correct POSIX awk syntax is
> > >     awk '/\{.\}/' file
> > > with the metacharacters escaped so they are
> > > interpreted as literals.  This prints
> > >     {s}
> > > This syntax works in gawk in both normal and POSIX
> > > modes.
> > >
> > > The problem here is not the discrepancy between
> > normal
> > > and POSIX modes; I am fully aware that most such
> > > discrepancies are deliberate.  However, this
> > > particular one is not documented, which is a major
> > > problem.  I discovered this gawk issue while
> > compiling
> > > third-party software (specifically, the ALSA
> > drivers)
> > > that uses gawk.  I had the POSIXLY_CORRECT
> > environment
> > > variable set, which causes gawk to behave in POSIX
> > > mode, and the compilation failed; it took me a
> > long
> > > time to figure out why.  This problem may never
> > have
> > > existed if the discrepancy was documented; even if
> > it
> > > did exist, it would then become the fault of the
> > > developers for either (a) not checking the
> > > documentation and making sure their code was
> > > compatible with either mode of gawk, or (b) not
> > > informing the user that gawk needed to run in
> > > non-compatible mode.  However, it was not
> > documented,
> > > so there was nothing the developers could have
> > done
> > > about it.
> > >
> > > It is bad enough that so much GNU software allows
> > lax
> > > syntax like this.  Allowing context-based
> > > interpretation of metacharacters doesn't add any
> > > functionality at all, because the developer can
> > always
> > > escape the metacharacters to achieve the same
> > result;
> > > it only allows harmful ambiguity, which in turn
> > causes
> > > hard-to-find bugs that never should have been
> > there to
> > > start out with.  If we are ever to have good
> > bug-free
> > > code, we should try to eliminate ambiguity, not
> > > promote it.  However, I would consider the
> > ambiguity
> > > tolerable in the software of others who choose to
> > use
> > > it, if it were documented properly.
> > >
> > > __________________________________
> > > Do you Yahoo!?
> > > The New Yahoo! Shopping - with improved product
> > search
> > > http://shopping.yahoo.com
>
>
>
>       
>               
> __________________________________
> Do you Yahoo!?
> Friends.  Fun.  Try the all-new Yahoo! Messenger.
> http://messenger.yahoo.com/ 
>
>
> #####################################################################################
> This Mail Was Scanned by 012.net Anti Virus Service - Powered by TrendMicro 
> Interscan
>




reply via email to

[Prev in Thread] Current Thread [Next in Thread]