Re: inconsistency with counting characters vs bytes for multi-byte chara

bug-gawk

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: inconsistency with counting characters vs bytes for multi-byte chara

From:	Ed Morton
Subject:	Re: inconsistency with counting characters vs bytes for multi-byte characters
Date:	Tue, 12 Sep 2023 06:59:22 -0500
User-agent:	Mozilla/5.0 (Windows NT 10.0; Win64; x64; rv:102.0) Gecko/20100101 Thunderbird/102.15.0

Arnold et al - someone on a forum just pointed out this:

    $ awk 'BEGIN{str="abc"; n=gsub(//,"X",str); print n, str }'
    4 XaXbXcX

    $ awk 'BEGIN{str="\342\200\257"; n=gsub(//,"X",str); print n, str }'
    4 X▒X▒X▒X

i.e. gsub() with an empty regexp matches around each byte in that 3-bytecharacter. I don't recall ever having wanted to match an empty regexpand can't find a reference to that in documentation so I don't know ifthat's expected behavior or undefined behavior or a similar issue to thematch() issue below so thought it best to just pass it along so you candecide what, if anything, to do about it.

In case some background would be useful, there's a discussion on this atthe bottom of https://stackoverflow.com/a/77010950/1745001 - the personwhose login there is "RARE Kpop Manifesto" advocating for not changingmatch() is the same Jason Kwan you've interacted with previously in thismailing list, e.g. athttps://lists.gnu.org/archive/html/bug-gawk/2021-09/msg00073.html.


    Ed.

On 9/1/2023 2:25 PM, arnold@skeeve.com wrote:

Thanks Miguel, I think that looks good. Thanks for catching
the subexpression case.

Arnold

"Miguel Pineiro Jr."<mpj@pineiro.cc>  wrote:

Hello, Arnold.

Parenthesized subexpressions also need fixing. Here's an alternative
patch in case it's of interest. I've also included the tests I used,
in case they're helpful.

Take care,
Miguel


#!/bin/sh

awk=${1:-gawk}
export LC_CTYPE=en_US.UTF-8

cat <<EOF
Correct Results:
1 0
1 1 1 0 2 0
2 0
2 2 3 0
================
EOF

$awk 'BEGIN {
        match("\342\200\257", /^/, m)
        print RSTART, RLENGTH
}'

$awk 'BEGIN {
        match("\342\200\257", /^(a?)\u202F(b?)$/, m)
        print RSTART, RLENGTH, m[1,"start"], m[1,"length"], m[2, "start"], m[2, 
"length"]
}'

$awk 'BEGIN {
        match("\342\200\257", /$/, m)
        print RSTART, RLENGTH
}'

$awk 'BEGIN {
        match("\342\200\257ac", /a(b?)c/, m)
        print RSTART, RLENGTH, m[1,"start"], m[1,"length"]
}'


diff --git a/ChangeLog b/ChangeLog
index 2fb55e19..b484d179 100644
--- a/ChangeLog
+++ b/ChangeLog
@@ -1,3 +1,16 @@
+2023-09-01         Miguel Pineiro Jr<mpj@pineiro.cc>
+
+       Fix the handling of zero-length matches in multibyte locales.
+       Thanks to Ed Morton<mortoneccc@comcast.net>  for the report.
+
+       * builtin.c (do_match): Translate rstart (byte idx to char idx)
+       even when rlength is zero. For this we tweak the conversion of
+       rlength to keep it in bounds when rstart and rlength are both 0.
+       * node.c (str2wstr): Add an entry to the indices array for the
+       terminating null. It facilitates the tweak above and is needed
+       to translate the idx of a zero-width match at the end of the
+       string.
+
  2023-08-27         Arnold D. Robbins<arnold@skeeve.com>

* re.c (make_regexp): When do_traditional and looking to see

diff --git a/builtin.c b/builtin.c
index e394cc34..2bc0aaa3 100644
--- a/builtin.c
+++ b/builtin.c
@@ -2791,9 +2791,9 @@ do_match(int nargs)
                size_t *wc_indices = NULL;

rlength = REEND(rp, t1->stptr) - RESTART(rp, t1->stptr); /* byte length */

-               if (rlength > 0 && gawk_mb_cur_max > 1) {
+               if (gawk_mb_cur_max > 1) {
                        t1 = str2wstr(t1, & wc_indices);
-                       rlength = wc_indices[rstart + rlength - 1] - 
wc_indices[rstart] + 1;
+                       rlength = wc_indices[rstart + rlength] - 
wc_indices[rstart];
                        rstart = wc_indices[rstart];
                }

@@ -2816,9 +2816,9 @@ do_match(int nargs)

                                        start = t1->stptr + s;
                                        subpat_start = s;
                                        subpat_len = len = SUBPATEND(rp, 
t1->stptr, ii) - s;
-                                       if (len > 0 && gawk_mb_cur_max > 1) {
+                                       if (gawk_mb_cur_max > 1) {
                                                subpat_start = wc_indices[s];
-                                               subpat_len = wc_indices[s + len 
- 1] - subpat_start + 1;
+                                               subpat_len = wc_indices[s + 
len] - subpat_start;
                                        }

it = make_string(start, len);

diff --git a/node.c b/node.c
index 5de4e082..bc4e777d 100644
--- a/node.c
+++ b/node.c
@@ -851,7 +851,7 @@ str2wstr(NODE *n, size_t **ptr)
         * Create the array.
         */
        if (ptr != NULL) {
-               ezalloc(*ptr, size_t *, sizeof(size_t) * n->stlen, "str2wstr");
+               ezalloc(*ptr, size_t *, sizeof(size_t) * (n->stlen + 1), 
"str2wstr");
        }

sp = n->stptr;

@@ -923,6 +923,11 @@ str2wstr(NODE *n, size_t **ptr)
                }
        }

+ /* Needed for zero-length matches at the end of a string */

+       assert(sp - n->stptr == n->stlen);
+       if (ptr != NULL)
+               (*ptr)[sp - n->stptr] = i;
+
        *wsp = L'\0';
        n->wstlen = wsp - n->wstptr;
        n->flags |= WSTRCUR;


On Fri, Sep 1, 2023, at 12:28 AM,arnold@skeeve.com  wrote:

Hi Ed.

This was a really interesting corner case. Good catch. The fix
is attached and will be in git eventually.

Thanks for the report!

Arnold

Ed Morton<mortoneccc@comcast.net>  wrote:

Configuration Information [Automatically generated, do not change]:
Machine: x86_64
OS: cygwin
Compiler: gcc
Compilation CFLAGS: -ggdb -O2 -pipe -Wall -Werror=format-security
-Wp,-D_FORTIFY_SOURCE=2 -fstack-protector-strong
--param=ssp-buffer-size=4
-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/build=/usr/src/debug/gawk-5.2.2-1
-fdebug-prefix-map=/cygdrive/d/a/scallywag/gawk/gawk-5.2.2-1.x86_64/src/gawk-5.2.2=/usr/src/debug/gawk-5.2.2-1
-DNDEBUG
uname output: CYGWIN_NT-10.0-22621 TournaMart_2023 3.4.8-1.x86_64
2023-08-17 17:02 UTC x86_64 Cygwin
Machine Type: x86_64-pc-cygwin

Gawk Version: 5.2.2

Attestation 1:
          I have read
https://www.gnu.org/software/gawk/manual/html_node/Bugs.html.
          Yes

Attestation 2:
          I have not modified the sources before building gawk.
          True

Description:
          Different string handling functions produce different results
for multi-byte characters.

Repeat-By:
          Without "-b":

          $ awk 'BEGIN{str="\342\200\257"; print length(str);
match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
          1
          1
          4

          Note that length() thinks that string is 1 character, the first
call to match() agrees, but then the 2nd call to match() thinks it's 3
characters (since RSTART tells us the "end of string" is at position 4).

          Now with "-b" ("Cause gawk to treat all input data as
single-byte characters" per
https://www.gnu.org/software/gawk/manual/gawk.html#Options):

          $ awk -b 'BEGIN{str="\342\200\257"; print length(str);
match(str,/.+/); print RLENGTH; match(str,/$/); print RSTART }'
          3
          3
          4

          Note that length() now thinks that string is 3 characters, the
first call to match() agrees again, and then the 2nd call to match() now
also agrees.

          Per the manual "in gawk, length(), substr(), split(), match()
and the other string functions ... all work in terms of characters in
the local character set, and not in terms of bytes." (from
https://www.gnu.org/software/gawk/manual/html_node/Bytes-vs_002e-Characters.html)
so I was expecting more consistent results between those 3 function
calls and that they'd basically all always agree with length()s results.
It may just be "match()" that has an issue, I haven't noticed a problem
with any other function but I haven't been looking for it.

Attachments:
* fix.diff

[Prev in Thread]

Current Thread

[Next in Thread]

Re: inconsistency with counting characters vs bytes for multi-byte characters, arnold, 2023/09/01
- Re: inconsistency with counting characters vs bytes for multi-byte characters, Ed Morton, 2023/09/01
- Re: inconsistency with counting characters vs bytes for multi-byte characters, Miguel Pineiro Jr., 2023/09/01
  - Re: inconsistency with counting characters vs bytes for multi-byte characters, arnold, 2023/09/01
    - Re: inconsistency with counting characters vs bytes for multi-byte characters, Ed Morton <=
    - Re: inconsistency with counting characters vs bytes for multi-byte characters, arnold, 2023/09/12
    - Re: inconsistency with counting characters vs bytes for multi-byte characters, arnold, 2023/09/12
- Re: inconsistency with counting characters vs bytes for multi-byte characters, J Naman, 2023/09/12
  - Re: inconsistency with counting characters vs bytes for multi-byte characters, arnold, 2023/09/12

Prev by Date: Re: inconsistency with counting characters vs bytes for multi-byte characters
Next by Date: Re: inconsistency with counting characters vs bytes for multi-byte characters
Previous by thread: Re: inconsistency with counting characters vs bytes for multi-byte characters
Next by thread: Re: inconsistency with counting characters vs bytes for multi-byte characters
Index(es):
- Date
- Thread