Hi Ed.
There are two separate questions here. One is why are the different
regexps handled differently? The answer is I don't know, although I
can guess that since the third one is guaranteed to only match a
single character, the matching is little smarter.
The second question is why does gawk not parse all the input when
standard input remains open. The reason is that it has to read ahead
a bit to be sure that it has completely matched the regular expression
and can tell where the definitive end of the record is.
You can try with mawk and the One True Awk and see if the behavior
is any different. Both of those allow RS to be a regexp.
I looked at the stack overflow post. Gawk has a read timeout mechanism
(see the manual, I don't remember the details) that will likely work
on pipes, sockets and terminals; that might do the trick, it might not.
In any case, there's no real bugs here, just limits as to what are
possible.
Thanks,
Arnold
Ed Morton<mortoneccc@comcast.net> wrote:
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: cygwin
Compiler: gcc
Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security
-Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong
--param=ssp-buffer-size=4
-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/build=/usr/src/debug/gawk-5.3.0-1
-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.3.0-1.x86_64/src/gawk-5.3.0=/usr/src/debug/gawk-5.3.0-1
-DNDEBUG
uname output: CYGWIN_NT-10.0-22631 TournaMart_2023 3.5.3-1.x86_64
2024-04-03 17:25 UTC x86_64 Cygwin
Machine Type: x86_64-pc-cygwin
Gawk Version: 5.3.0
Attestation 1:
I have read
https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
Yes
Attestation 2:
I have not modified the sources before building gawk.
True
Description:
Someone asked a question on SO about handling unending input from
netcat with a regexp delimiter that's just 2 possible chars, see
https://stackoverflow.com/q/78700014/1745001, where gawk seems to be
a record behind in it's processing. I'm using bash on cygwin, they
used zsh on MacOS.
Repeat-By:
I can reproduce the problem with this (hitting control-C to stop
each command when it stops to wait for more input):
$ printf 'A;B;C;\n' > file
$ cat file - | awk -v RS='(;|=)' '{print NR, $0}'
1 A
$ cat file - | awk -v RS=';|=' '{print NR, $0}'
1 A
2 B
$ cat file - | awk -v RS='[;=]' '{print NR, $0}'
1 A
2 B
3 C
Obviously that's 3 supposedly equivalent regexps producing 3
different results.