[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: gawk/POSIX regex metacharacter bug
From: |
Shawn Smout |
Subject: |
Re: gawk/POSIX regex metacharacter bug |
Date: |
Fri, 11 Jun 2004 10:17:44 -0700 (PDT) |
I haven't looked at this in several months now, but at
the time I spent a great deal of time searching for
such documentation and couldn't find it, so if it was
documented it certainly wasn't documented well enough.
I don't know if anything has changed since then or
what. However, such incompatibilities in general are
problematic.
--- Aharon Robbins <address@hidden> wrote:
> Greetings.
>
> I am catching up on my gawk work. In regards the
> email quoted below,
> I note the following text in the
> gawk-3.1.3/doc/gawk.texi, starting at
> line 3194:
>
> | @cindex interval expressions
> | @item @address@hidden@}
> | @itemx @address@hidden,@}
> | @itemx @address@hidden,@address@hidden
> | One or two numbers inside braces denote an
> @dfn{interval expression}.
> | If there is one number in the braces, the
> preceding regexp is repeated
> | @var{n} times.
> | If there are two numbers separated by a comma, the
> preceding regexp is
> | repeated @var{n} to @var{m} times.
> | If there is one number followed by a comma, then
> the preceding regexp
> | is repeated at least @var{n} times:
> |
> | @table @code
> | @item address@hidden@}y
> | Matches @samp{whhhy}, but not @samp{why} or
> @samp{whhhhy}.
> |
> | @item address@hidden,address@hidden
> | Matches @samp{whhhy}, @samp{whhhhy}, or
> @samp{whhhhhy}, only.
> |
> | @item address@hidden,@}y
> | Matches @samp{whhy} or @samp{whhhy}, and so on.
> | @end table
> |
> | @cindex POSIX @command{awk}, interval expressions
> in
> | Interval expressions were not traditionally
> available in @command{awk}.
> | They were added as part of the POSIX standard to
> make @command{awk}
> | and @command{egrep} consistent with each other.
> |
> | @cindex @command{gawk}, interval expressions and
> | However, because old programs may use @address@hidden
> and @address@hidden in regexp
> | constants, by default @command{gawk} does
> @emph{not} match interval expressions
> | in regexps. If either @option{--posix} or
> @option{--re-interval} are specified
> | (@pxref{Options}), then interval expressions
> | are allowed in regexps.
> |
> | For new programs that use @address@hidden and @address@hidden
> in regexp constants,
> | it is good practice to always escape them with a
> backslash. Then the
> | regexp constants are valid and work the way you
> want them to, using
> | any version of @address@hidden two
> backslashes if you're
> | using a string constant with a regexp operator or
> function.}
>
> This seems pretty clear to me. What about this
> makes it seem that
> gawk's treatment of { and } depending upon the
> use/absence of --posix is
> "undocumented"?
>
> Arnold Robbins
>
> > Date: Sun, 28 Sep 2003 13:36:04 -0700 (PDT)
> > From: Shawn Smout <address@hidden>
> > Subject: gawk/POSIX regex metacharacter bug
> > To: address@hidden
> >
> > I am running Slackware 9.1 with Linux 2.4.22 on a
> > Pentium 4. My gawk version is 3.1.3, and I am
> > reasonably certain it was compiled with gcc 3.2.3.
> >
> > Gawk apparently handles metacharacters specially
> based
> > on context normally, but does not in POSIX
> > compatability mode. This is not listed in the
> > documentation (info or man) as one of the
> POSIX/GNU
> > differences.
> >
> > For this example, the file "file" contains one
> line:
> > {s}
> >
> > Ordinarily,
> > gawk '/{.}/' file
> > will print:
> > {s}
> > However,
> > gawk --posix '/{.}/' file
> > fails with an invalid regular expression error.
> > Apparently gawk normally decides based on context
> > whether the {} characters are metacharacters or
> > literal characters; since they are not valid as
> > metacharacters in this example, gawk interprets
> them
> > as literal characters. In POSIX mode, gawk does
> not
> > change its interpretation of the metacharacters
> based
> > on context.
> >
> > The correct POSIX awk syntax is
> > awk '/\{.\}/' file
> > with the metacharacters escaped so they are
> > interpreted as literals. This prints
> > {s}
> > This syntax works in gawk in both normal and POSIX
> > modes.
> >
> > The problem here is not the discrepancy between
> normal
> > and POSIX modes; I am fully aware that most such
> > discrepancies are deliberate. However, this
> > particular one is not documented, which is a major
> > problem. I discovered this gawk issue while
> compiling
> > third-party software (specifically, the ALSA
> drivers)
> > that uses gawk. I had the POSIXLY_CORRECT
> environment
> > variable set, which causes gawk to behave in POSIX
> > mode, and the compilation failed; it took me a
> long
> > time to figure out why. This problem may never
> have
> > existed if the discrepancy was documented; even if
> it
> > did exist, it would then become the fault of the
> > developers for either (a) not checking the
> > documentation and making sure their code was
> > compatible with either mode of gawk, or (b) not
> > informing the user that gawk needed to run in
> > non-compatible mode. However, it was not
> documented,
> > so there was nothing the developers could have
> done
> > about it.
> >
> > It is bad enough that so much GNU software allows
> lax
> > syntax like this. Allowing context-based
> > interpretation of metacharacters doesn't add any
> > functionality at all, because the developer can
> always
> > escape the metacharacters to achieve the same
> result;
> > it only allows harmful ambiguity, which in turn
> causes
> > hard-to-find bugs that never should have been
> there to
> > start out with. If we are ever to have good
> bug-free
> > code, we should try to eliminate ambiguity, not
> > promote it. However, I would consider the
> ambiguity
> > tolerable in the software of others who choose to
> use
> > it, if it were documented properly.
> >
> > __________________________________
> > Do you Yahoo!?
> > The New Yahoo! Shopping - with improved product
> search
> > http://shopping.yahoo.com
__________________________________
Do you Yahoo!?
Friends. Fun. Try the all-new Yahoo! Messenger.
http://messenger.yahoo.com/