Hello, Arnold.
Parenthesized subexpressions also need fixing. Here's an alternative
patch in case it's of interest. I've also included the tests I used,
in case they're helpful.
Take care,
Miguel
#!/bin/sh
awk=${1:-gawk}
export LC_CTYPE=en_US.UTF-8
cat <<EOF
Correct Results:
1 0
1 1 1 0 2 0
2 0
2 2 3 0
================
EOF
$awk 'BEGIN {
match("\342\200\257", /^/, m)
print RSTART, RLENGTH
}'
$awk 'BEGIN {
match("\342\200\257", /^(a?)\u202F(b?)$/, m)
print RSTART, RLENGTH, m[1,"start"], m[1,"length"], m[2, "start"], m[2,
"length"]
}'
$awk 'BEGIN {
match("\342\200\257", /$/, m)
print RSTART, RLENGTH
}'
$awk 'BEGIN {
match("\342\200\257ac", /a(b?)c/, m)
print RSTART, RLENGTH, m[1,"start"], m[1,"length"]
}'
diff --git a/ChangeLog b/ChangeLog
index 2fb55e19..b484d179 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,16 @@
+2023-09-01 Miguel Pineiro Jr<mpj@pineiro.cc>
+
+ Fix the handling of zero-length matches in multibyte locales.
+ Thanks to Ed Morton<mortoneccc@comcast.net> for the report.
+
+ * builtin.c (do_match): Translate rstart (byte idx to char idx)
+ even when rlength is zero. For this we tweak the conversion of
+ rlength to keep it in bounds when rstart and rlength are both 0.
+ * node.c (str2wstr): Add an entry to the indices array for the
+ terminating null. It facilitates the tweak above and is needed
+ to translate the idx of a zero-width match at the end of the
+ string.
+
2023-08-27 Arnold D. Robbins<arnold@skeeve.com>
* re.c (make_regexp): When do_traditional and looking to see
diff --git a/builtin.c b/builtin.c
index e394cc34..2bc0aaa3 100644
--- a/builtin.c
+++ b/builtin.c
@@ -2791,9 +2791,9 @@ do_match(int nargs)
size_t *wc_indices = NULL;
rlength = REEND(rp, t1->stptr) - RESTART(rp, t1->stptr); /* byte length */
- if (rlength > 0 && gawk_mb_cur_max > 1) {
+ if (gawk_mb_cur_max > 1) {
t1 = str2wstr(t1, & wc_indices);
- rlength = wc_indices[rstart + rlength - 1] -
wc_indices[rstart] + 1;
+ rlength = wc_indices[rstart + rlength] -
wc_indices[rstart];
rstart = wc_indices[rstart];
}
@@ -2816,9 +2816,9 @@ do_match(int nargs)
start = t1->stptr + s;
subpat_start = s;
subpat_len = len = SUBPATEND(rp,
t1->stptr, ii) - s;
- if (len > 0 && gawk_mb_cur_max > 1) {
+ if (gawk_mb_cur_max > 1) {
subpat_start = wc_indices[s];
- subpat_len = wc_indices[s + len
- 1] - subpat_start + 1;
+ subpat_len = wc_indices[s +
len] - subpat_start;
}
it = make_string(start, len);
diff --git a/node.c b/node.c
index 5de4e082..bc4e777d 100644
--- a/node.c
+++ b/node.c
@@ -851,7 +851,7 @@ str2wstr(NODE *n, size_t **ptr)
* Create the array.
*/
if (ptr != NULL) {
- ezalloc(*ptr, size_t *, sizeof(size_t) * n->stlen, "str2wstr");
+ ezalloc(*ptr, size_t *, sizeof(size_t) * (n->stlen + 1),
"str2wstr");
}
sp = n->stptr;
@@ -923,6 +923,11 @@ str2wstr(NODE *n, size_t **ptr)
}
}
+ /* Needed for zero-length matches at the end of a string */
+ assert(sp - n->stptr == n->stlen);
+ if (ptr != NULL)
+ (*ptr)[sp - n->stptr] = i;
+
*wsp = L'\0';
n->wstlen = wsp - n->wstptr;
n->flags |= WSTRCUR;
On Fri, Sep 1, 2023, at 12:28 AM,arnold@skeeve.com wrote:
Hi Ed.
This was a really interesting corner case. Good catch. The fix
is attached and will be in git eventually.
Thanks for the report!
Arnold
Ed Morton<mortoneccc@comcast.net> wrote:
Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: cygwin
Compiler: gcc
Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security
-Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong
--param=ssp-buffer-size=4
-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/build=/usr/src/debug/gawk-5.2.2-1
-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/src/gawk-5.2.2=/usr/src/debug/gawk-5.2.2-1
-DNDEBUG
uname output: CYGWIN_NT-10.0-22621 TournaMart_2023 3.4.8-1.x86_64
2023-08-17 17:02 UTC x86_64 Cygwin
Machine Type: x86_64-pc-cygwin
Gawk Version: 5.2.2
Attestation 1:
I have read
https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
Yes
Attestation 2:
I have not modified the sources before building gawk.
True
Description:
Different string handling functions produce different results
for multi-byte characters.
Repeat-By:
Without "-b":
$ awk 'BEGIN{str="\342\200\257"; print length(str);
match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
1
1
4
Note that length() thinks that string is 1 character, the first
call to match() agrees, but then the 2nd call to match() thinks it's 3
characters (since RSTART tells us the "end of string" is at position 4).
Now with "-b" ("Cause gawk to treat all input data as
single-byte characters" per
https://www.gnu.org/software/gawk/manual/gawk.html#Options):
$ awk -b 'BEGIN{str="\342\200\257"; print length(str);
match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
3
3
4
Note that length() now thinks that string is 3 characters, the
first call to match() agrees again, and then the 2nd call to match() now
also agrees.
Per the manual "in gawk, length(), substr(), split(), match()
and the other string functions ... all work in terms of characters in
the local character set, and not in terms of bytes." (from
https://www.gnu.org/software/gawk/manual/html_node/Bytes-vs_002e-Characters.html)
so I was expecting more consistent results between those 3 function
calls and that they'd basically all always agree with length()s results.
It may just be "match()" that has an issue, I haven't noticed a problem
with any other function but I haven't been looking for it.
Attachments:
* fix.diff