bug-grep
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

bug#38503: Locale can cause incorrect number parsing in binary files


From: Eric Blake
Subject: bug#38503: Locale can cause incorrect number parsing in binary files
Date: Thu, 5 Dec 2019 14:29:19 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:68.0) Gecko/20100101 Thunderbird/68.2.2

tag 38503 notabug
thanks

On 12/5/19 12:30 PM, jan h wrote:
grep 3.3

I get a few weird symbols (seems valid utf-8), along with normal
numbers with the following simple snippet (.UTF-8 and .utf8 result in
same, even .UtF---8 is the same):
LC_ALL=en_US.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"
wc -c counts 1047 and and 1033 and 1036 etc, so they're multi-byte characters

It's important to note that POSIX says that the regex [0-9] has locale-dependent effects. Outside of the C/POSIX locale, it matches whatever the locale definition says it should. For example, some locales allow [A-Z] to match non-ASCII letters like Á. Similarly, as you have found, on your system, the en_US.UTF-8 locale is defined to match non-ASCII Unicode digits when a range expression for [0-9] is in force.

Note that the Rational Range Interpretation of ranges claims that [0-9] should have the expansion [012345689] in ALL locales; and more and more versions of GNU utilities are starting to move to RRI (even newer glibc is trying to move towards RRI for more regex operations). If this example is run where RRI is in force, then it should not match non-ASCII Unicode digits. But you didn't mention which version of grep you are using, let alone which version of libc is providing your locale definitions, to make that determination; and POSIX does not require RRI.

meanwhile, with LC_ALL being C.UTF-8 this is not the case,
LC_ALL=C.UTF-8 grep -o "[0-9]" -a /dev/urandom|head -n 1024|tr -d "\n"|wc -c
consistently results in 1024 characters/bytes, as it's supposed to be...

Well, in the POSIX locale (C.UTF-8 is not the POSIX locale, but follows enough of the same rules), [0-9] _is_ required to match the same as [01234356789]. That's the only locale where you get RRI for free, rather than having to worry if your choice of program version and locale definition provide it.

it's not just en_US, it seems ANY utf-8 locale, other than C results
in this bug, whereas non-utf8 versions are fine, bare en_US doesn't
show this bug, nor does en_US.iso88591...

en_US.iso88591 does not have the problem because in that encoding, there aren't any non-ASCII digits. So [0-9] will never match any non-ASCII Unicode digits because the charset in use doesn't have such characters.


worthy of note is that [[:digit:]] works correctly, while [0-9] does
not (and 1-9 is same bug as 0-9, if you were wondering), setting -E
doesn't change anything either...

POSIX requires [[:digit:]] to expand to the same 10 characters in ALL locales, regardless of what the implementation does with [0-9], and regardless of whether an implementation uses RRI. (This is true for [[:digit:]], but not for other named ranges; for example, [[:alpha:]] is still locale-dependent and may expand to more than 26 characters).

Since the problem you reported is due to your locale, I'm closing this as a non-bug. We may reopen it if additional details show that your version of grep was supposed to be using RRI but failed to do so. And feel free to continue conversation, even if we don't reopen the bug.

--
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3226
Virtualization:  qemu.org | libvirt.org






reply via email to

[Prev in Thread] Current Thread [Next in Thread]