bug#60618: unicode characters are not identified as such for \w and \b w

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#60618: unicode characters are not identified as such for \w and \b w

From:	Jim Meyering
Subject:	bug#60618: unicode characters are not identified as such for \w and \b with -P
Date:	Fri, 6 Jan 2023 23:28:44 -0800

On Fri, Jan 6, 2023 at 7:49 PM Carlo Arenas <carenas@gmail.com> wrote:
> Reported to PCRE[1] with mention of GNU grep being also affected.
>
> [1] https://github.com/PCRE2Project/pcre2/issues/185

Yikes. This is a big deal.
Thank you for the patch and added test.
I made a tiny comment tweak and this test logic change that was
required to make the new test pass with the fixed version.

-grep -Po 'r\w' in > out && fail=1
+grep -Po 'r\w' in > out || fail=1

Also, make syntax-check required to change e.g.,

-compare out exp || fail=1
+compare exp out || fail=1

Every bug fix needs a NEWS entry, so I added this:

  With -P, some non-ASCII UTF8 characters were not recognized as
  word-constituent due to our omission of the PCRE_UCP flag. E.g.,
  given f(){ echo Perú|LC_ALL=en_US.UTF-8 grep -Po "$1"; } and
  this command, echo $(f 'r\w'):$(f '.\b'), before it would print ":r".
  After the fix, it prints the correct results: "rú:ú".

Finally, I expanded the ChangeLog entry and gave credit where due.

I'll push this tomorrow:

grep-pcre-fix.diff
Description: Binary data

[Prev in Thread]

Current Thread

[Next in Thread]

bug#60618: unicode characters are not identified as such for \w and \b with -P, Carlo Arenas, 2023/01/06
- bug#60618: unicode characters are not identified as such for \w and \b with -P, Jim Meyering <=
  - bug#60618: unicode characters are not identified as such for \w and \b with -P, Jim Meyering, 2023/01/07
    - bug#60618: unicode characters are not identified as such for \w and \b with -P, Jim Meyering, 2023/01/07

Prev by Date: bug#60618: unicode characters are not identified as such for \w and \b with -P
Next by Date: bug#60621: grep -P does not set PCRE2_UCP
Previous by thread: bug#60618: unicode characters are not identified as such for \w and \b with -P
Next by thread: bug#60618: unicode characters are not identified as such for \w and \b with -P
Index(es):
- Date
- Thread