bug#38503: Locale can cause incorrect number parsing in binary files

bug-grep

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#38503: Locale can cause incorrect number parsing in binary files

From:	Eric Blake
Subject:	bug#38503: Locale can cause incorrect number parsing in binary files
Date:	Thu, 5 Dec 2019 14:29:19 -0600
User-agent:	Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2

tag 38503 notabug
thanks

On 12/5/19 12:30 PM, jan h wrote:

grep 3.3

I get a few weird symbols (seems valid utf-8), along with normal
numbers with the following simple snippet (.UTF-8 and .utf8 result in
same, even .UtF---8 is the same):
LC_ALL=en_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"
wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte characters

It's important to note that POSIX says that the regex [0-9] haslocale-dependent effects. Outside of the C/POSIX locale, it matcheswhatever the locale definition says it should. For example, somelocales allow [A-Z] to match non-ASCII letters like Á. Similarly, asyou have found, on your system, the en_US.UTF-8 locale is defined tomatch non-ASCII Unicode digits when a range expression for [0-9] is inforce.

Note that the Rational Range Interpretation of ranges claims that [0-9]should have the expansion [012345689] in ALL locales; and more and moreversions of GNU utilities are starting to move to RRI (even newer glibcis trying to move towards RRI for more regex operations). If thisexample is run where RRI is in force, then it should not match non-ASCIIUnicode digits. But you didn't mention which version of grep you areusing, let alone which version of libc is providing your localedefinitions, to make that determination; and POSIX does not require RRI.

meanwhile, with LC_ALL being C.UTF-8 this is not the case,
LC_ALL=C.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|wc -c
consistently results in 1024 characters/bytes, as it's supposed to be...

Well, in the POSIX locale (C.UTF-8 is not the POSIX locale, but followsenough of the same rules), [0-9] _is_ required to match the same as[01234356789]. That's the only locale where you get RRI for free,rather than having to worry if your choice of program version and localedefinition provide it.

it's not just en_US, it seems ANY utf-8 locale, other than C results
in this bug, whereas non-utf8 versions are fine, bare en_US doesn't
show this bug, nor does en_US.iso88591...

en_US.iso88591 does not have the problem because in that encoding, therearen't any non-ASCII digits. So [0-9] will never match any non-ASCIIUnicode digits because the charset in use doesn't have such characters.


worthy of note is that [[:digit:]] works correctly, while [0-9] does
not (and 1-9 is same bug as 0-9, if you were wondering), setting -E
doesn't change anything either...

POSIX requires [[:digit:]] to expand to the same 10 characters in ALLlocales, regardless of what the implementation does with [0-9], andregardless of whether an implementation uses RRI. (This is true for[[:digit:]], but not for other named ranges; for example, [[:alpha:]] isstill locale-dependent and may expand to more than 26 characters).

Since the problem you reported is due to your locale, I'm closing thisas a non-bug. We may reopen it if additional details show that yourversion of grep was supposed to be using RRI but failed to do so. Andfeel free to continue conversation, even if we don't reopen the bug.


--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org

[Prev in Thread]

Current Thread

[Next in Thread]

bug#38503: Locale can cause incorrect number parsing in binary files, jan h, 2019/12/05
- bug#38503: Locale can cause incorrect number parsing in binary files, Eric Blake <=
  - bug#38503: Locale can cause incorrect number parsing in binary files, Eric Blake, 2019/12/05
    - bug#38503: Locale can cause incorrect number parsing in binary files, Paul Eggert, 2019/12/05
- bug#38503: Locale can cause incorrect number parsing in binary files, jan h, 2019/12/05
  - bug#38503: Locale can cause incorrect number parsing in binary files, jan h, 2019/12/05
    - bug#38503: Locale can cause incorrect number parsing in binary files, Eric Blake, 2019/12/05

Prev by Date: bug#38503: Locale can cause incorrect number parsing in binary files
Next by Date: bug#38503: Locale can cause incorrect number parsing in binary files
Previous by thread: bug#38503: Locale can cause incorrect number parsing in binary files
Next by thread: bug#38503: Locale can cause incorrect number parsing in binary files
Index(es):
- Date
- Thread