bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: gawk/POSIX regex metacharacter bug


From: Aharon Robbins
Subject: Re: gawk/POSIX regex metacharacter bug
Date: Thu, 10 Jun 2004 14:11:32 +0300

Greetings.

I am catching up on my gawk work.  In regards the email quoted below,
I note the following text in the gawk-3.1.3/doc/gawk.texi, starting at
line 3194:

| @cindex interval expressions
| @item @address@hidden@}
| @itemx @address@hidden,@}
| @itemx @address@hidden,@address@hidden
| One or two numbers inside braces denote an @dfn{interval expression}.
| If there is one number in the braces, the preceding regexp is repeated
| @var{n} times.
| If there are two numbers separated by a comma, the preceding regexp is
| repeated @var{n} to @var{m} times.
| If there is one number followed by a comma, then the preceding regexp
| is repeated at least @var{n} times:
| 
| @table @code
| @item address@hidden@}y
| Matches @samp{whhhy}, but not @samp{why} or @samp{whhhhy}.
| 
| @item address@hidden,address@hidden
| Matches @samp{whhhy}, @samp{whhhhy}, or @samp{whhhhhy}, only.
| 
| @item address@hidden,@}y
| Matches @samp{whhy} or @samp{whhhy}, and so on.
| @end table
| 
| @cindex POSIX @command{awk}, interval expressions in
| Interval expressions were not traditionally available in @command{awk}.
| They were added as part of the POSIX standard to make @command{awk}
| and @command{egrep} consistent with each other.
| 
| @cindex @command{gawk}, interval expressions and
| However, because old programs may use @address@hidden and @address@hidden in 
regexp
| constants, by default @command{gawk} does @emph{not} match interval 
expressions
| in regexps.  If either @option{--posix} or @option{--re-interval} are 
specified
| (@pxref{Options}), then interval expressions
| are allowed in regexps.
| 
| For new programs that use @address@hidden and @address@hidden in regexp 
constants,
| it is good practice to always escape them with a backslash.  Then the
| regexp constants are valid and work the way you want them to, using
| any version of @address@hidden two backslashes if you're
| using a string constant with a regexp operator or function.}

This seems pretty clear to me.  What about this makes it seem that
gawk's treatment of { and } depending upon the use/absence of --posix is
"undocumented"?

Arnold Robbins

> Date: Sun, 28 Sep 2003 13:36:04 -0700 (PDT)
> From: Shawn Smout <address@hidden>
> Subject: gawk/POSIX regex metacharacter bug
> To: address@hidden
>
> I am running Slackware 9.1 with Linux 2.4.22 on a
> Pentium 4.  My gawk version is 3.1.3, and I am
> reasonably certain it was compiled with gcc 3.2.3.
>
> Gawk apparently handles metacharacters specially based
> on context normally, but does not in POSIX
> compatability mode.  This is not listed in the
> documentation (info or man) as one of the POSIX/GNU
> differences.
>
> For this example, the file "file" contains one line:
>     {s}
>
> Ordinarily,
>     gawk '/{.}/' file
> will print:
>     {s}
> However,
>     gawk --posix '/{.}/' file
> fails with an invalid regular expression error. 
> Apparently gawk normally decides based on context
> whether the {} characters are metacharacters or
> literal characters; since they are not valid as
> metacharacters in this example, gawk interprets them
> as literal characters.  In POSIX mode, gawk does not
> change its interpretation of the metacharacters based
> on context.
>
> The correct POSIX awk syntax is
>     awk '/\{.\}/' file
> with the metacharacters escaped so they are
> interpreted as literals.  This prints
>     {s}
> This syntax works in gawk in both normal and POSIX
> modes.
>
> The problem here is not the discrepancy between normal
> and POSIX modes; I am fully aware that most such
> discrepancies are deliberate.  However, this
> particular one is not documented, which is a major
> problem.  I discovered this gawk issue while compiling
> third-party software (specifically, the ALSA drivers)
> that uses gawk.  I had the POSIXLY_CORRECT environment
> variable set, which causes gawk to behave in POSIX
> mode, and the compilation failed; it took me a long
> time to figure out why.  This problem may never have
> existed if the discrepancy was documented; even if it
> did exist, it would then become the fault of the
> developers for either (a) not checking the
> documentation and making sure their code was
> compatible with either mode of gawk, or (b) not
> informing the user that gawk needed to run in
> non-compatible mode.  However, it was not documented,
> so there was nothing the developers could have done
> about it.
>
> It is bad enough that so much GNU software allows lax
> syntax like this.  Allowing context-based
> interpretation of metacharacters doesn't add any
> functionality at all, because the developer can always
> escape the metacharacters to achieve the same result;
> it only allows harmful ambiguity, which in turn causes
> hard-to-find bugs that never should have been there to
> start out with.  If we are ever to have good bug-free
> code, we should try to eliminate ambiguity, not
> promote it.  However, I would consider the ambiguity
> tolerable in the software of others who choose to use
> it, if it were documented properly.
>
> __________________________________
> Do you Yahoo!?
> The New Yahoo! Shopping - with improved product search
> http://shopping.yahoo.com




reply via email to

[Prev in Thread] Current Thread [Next in Thread]