bug-gawk
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: difference in RS handling for equivalent regexps with unending input


From: Ed Morton
Subject: Re: difference in RS handling for equivalent regexps with unending input stream
Date: Wed, 3 Jul 2024 06:58:27 -0500
User-agent: Mozilla Thunderbird

Arnold - thanks for the quick response. Having to look ahead for the longest regexp match makes sense. It's surprising that different chars in the bracket expression make a difference to what matches (see `[;|=]` vs `[;a=]` in my 2nd email) but I'm happy to chalk that up to regexp implementation magic. Unfortunately I don't have access to other awk variants and can't install them on the systems I have access to for IT policy reasons. I did try setting a READ_TIMEOUT now but it didn't change the results:

---
$ printf 'A;B;C;' > file
$ cat file - | awk -v RS='[;]' 'BEGIN{PROCINFO["/dev/stdin", "READ_TIMEOUT"]=100} {print $0; fflush()}'
A
B

$
---

or just led to a fatal error:

---
$ cat file - | awk -v RS='[;=]' 'BEGIN{PROCINFO["-", "READ_TIMEOUT"]=100} {print $0; fflush()}'
A
B
awk: cmd. line:1: (FILENAME=- FNR=3) fatal: error reading input file `-': Connection timed out
---

I also tried using stdbuf to disable all buffering but that didn't help either. I also thought I might be able to read 1 char at a time:

---
$ cat file - | awk -v RS='(.)' '{print RT; fflush()}'
A
;
B
;
C
---

and that got close but I'm still missing the final `;` and so can't tell from that if `C` would be a complete record or not.

I'd be interested to hear if you or anyone else reading this knows of a way to read the input 1 char at a time in a case like this where the input is unending and we can't rely on a regexp match for RS to find each character.

Regards,

    Ed.

On 7/3/2024 6:19 AM, arnold@skeeve.com wrote:
Hi Ed.

There are two separate questions here. One is why are the different
regexps handled differently? The answer is I don't know, although I
can guess that since the third one is guaranteed to only match a
single character, the matching is little smarter.

The second question is why does gawk not parse all the input when
standard input remains open.  The reason is that it has to read ahead
a bit to be sure that it has completely matched the regular expression
and can tell where the definitive end of the record is.

You can try with mawk and the One True Awk and see if the behavior
is any different. Both of those allow RS to be a regexp.

I looked at the stack overflow post. Gawk has a read timeout mechanism
(see the manual, I don't remember the details) that will likely work
on pipes, sockets and terminals; that might do the trick, it might not.

In any case, there's no real bugs here, just limits as to what are
possible.

Thanks,

Arnold

Ed Morton<mortoneccc@comcast.net>  wrote:

Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: cygwin
Compiler: gcc
Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security
-Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong
--param=ssp-buffer-size=4
-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/build=/usr/src/debug/gawk-5.3.0-1
-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/src/gawk-5.3.0=/usr/src/debug/gawk-5.3.0-1
-DNDEBUG
uname output: CYGWIN_NT-10.0-22631 TournaMart_2023 3.5.3-1.x86_64
2024-04-03 17:25 UTC x86_64 Cygwin
Machine Type: x86_64-pc-cygwin

Gawk Version: 5.3.0

Attestation 1:
          I have read
https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
          Yes

Attestation 2:
          I have not modified the sources before building gawk.
          True

Description:

     Someone asked a question on SO about handling unending input from
     netcat with a regexp delimiter that's just 2 possible chars, see
     https://stackoverflow.com/q/78700014/1745001, where gawk seems to be
     a record behind in it's processing. I'm using bash on cygwin, they
     used zsh on MacOS.

Repeat-By:

     I can reproduce the problem with this (hitting control-C to stop
     each command when it stops to wait for more input):

     $ printf 'A;B;C;\n' > file

     $ cat file - | awk -v RS='(;|=)' '{print NR, $0}'
     1 A

     $ cat file - | awk -v RS=';|=' '{print NR, $0}'
     1 A
     2 B

     $ cat file - | awk -v RS='[;=]' '{print NR, $0}'
     1 A
     2 B
     3 C

     Obviously that's 3 supposedly equivalent regexps producing 3
     different results.


reply via email to

[Prev in Thread] Current Thread [Next in Thread]