[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#60690: -P '\d' in GNU and git grep
From: |
Paul Eggert |
Subject: |
bug#60690: -P '\d' in GNU and git grep |
Date: |
Fri, 7 Apr 2023 12:00:16 -0700 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:102.0) Gecko/20100101 Thunderbird/102.9.0 |
On 2023-04-06 06:39, demerphq wrote:
Unicode specifies that \d match any digit
in any script that it supports.
"Specifies" is too strong. The Unicode Regular Expressions technical
standard (UTS#18) mentions \d only in Annex C[1], next to the word
"digit" in a column labeled "Property" (even though \d is really syntax
not a property). This is at best an informal recommendation, not a
requirement, as UTS#18 0.2[2] says that UTS#18's syntax is only for
illustration and that although it's similar to Perl's, the two syntax
forms may not be exactly the same. So we can't look to UTS#18 for a
definitive way out of the \d mess, as the Unicode folks specifically
delegated matters to us.
Even ignoring the \d issue the digit situation is messy. UTS#18 Annex C
says "\p{gc=Decimal_Number}" is the standard recommended syntax
assignment for digits. However, PCRE2 does not support this syntax; it
supports another variant \p{Nd} that UTS#18 also recommends. So it
appears that PCRE2 already does not implement every recommended aspect
of UTS#18 syntax. PCRE2 also doesn't match Perl, which does support
"\p{gc=Decimal_Number}".
Anyway, since grep -P '\p{Nd}' implements Unicode's decimal digit class,
that's clearly enough for grep -P to conform to UTS#18 with respect to
digits.
A) how do you tell the regular expression
engine what semantics you want and B) how does the regular expression
library identify the encoding in the file, and how does it handle
malformed content in that file.
Here's how GNU grep does it:
* RE semantics are specified via command-line options like -P.
* Text encoding is specified by locale, e.g., LC_ALL='en_US.utf8'.
* REs do not match encoding errors.
on *nix there is no tradition of using BOM's to
distinguish the 6 different possible encodings of Unicode (UTF-8,
UTF-EBCDIC, UTF-16LE, UTF-16BE, UTF-32LE, UTF-32BE)
Yes, GNU/Linux never really experienced the joys of UTF-EBCDIC, Oracle
UTFE, UTF-16LE vs UTF-16BE etc. If you're running legacy IBM mainframe
or MS-Windows code these legacy encodings are obviously a big deal.
However, there seems little reason to force their nontrivial hassles
onto every GNU/Linux program that processes text. A few specialized apps
like 'iconv' deal with offbeat encodings, and that is probably a better
approach all around.
there seems
to be some level of desire of matching with unicode semantics against
files that are not uniformly encoded in one of these formats.
That is a use case, yes. It's what 'strings' and 'grep' do.
[1]: https://unicode.org/reports/tr18/#Compatibility_Properties
[2]: https://unicode.org/reports/tr18/#Conformance
- bug#60690: -P '\d' in GNU and git grep, (continued)
- bug#60690: -P '\d' in GNU and git grep, Junio C Hamano, 2023/04/04
- bug#60690: -P '\d' in GNU and git grep, Paul Eggert, 2023/04/05
- bug#60690: -P '\d' in GNU and git grep, Paul Eggert, 2023/04/05
- bug#60690: -P '\d' in GNU and git grep, Junio C Hamano, 2023/04/05
- bug#60690: -P '\d' in GNU and git grep, Jim Meyering, 2023/04/05
- bug#60690: -P '\d' in GNU and git grep, Paul Eggert, 2023/04/05
- bug#60690: -P '\d' in GNU and git grep, Carlo Arenas, 2023/04/05
- bug#60690: -P '\d' in GNU and git grep, demerphq, 2023/04/06
- bug#60690: -P '\d' in GNU and git grep, Paul Eggert, 2023/04/07
- bug#60690: -P '\d' in GNU and git grep, demerphq, 2023/04/06
- bug#60690: -P '\d' in GNU and git grep,
Paul Eggert <=
- bug#60690: -P '\d' in GNU and git grep, Carlo Arenas, 2023/04/08
- bug#60690: -P '\d' in GNU and git grep, Paul Eggert, 2023/04/08