bug-gnu-utils
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Bug of grep -E


From: Eric Blake
Subject: Re: Bug of grep -E
Date: Wed, 6 Dec 2017 09:32:57 -0600
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:52.0) Gecko/20100101 Thunderbird/52.4.0

On 12/06/2017 09:02 AM, iPack wrote:
> address@hidden ~]$ cat test
> https://konachan.com/image/a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20-%20255941%20blush%20brown_eyes%20crying%20fate_kaleid_liner_prisma_illya%20fate_%28series%29%20japanese_clothes%20kimono%20long_hair%20miyu_edelfelt%20purple_hair%20tagme_%28artist%29%20tears.jpg
> 
> address@hidden ~]$ cat test | grep -Eo '[0-9a-f]{32}/[0-9A-Za-z%_\.\-]+'
> a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20-%20255941%20blush%20brown_eyes%20crying%20fate_kaleid_liner_prisma_illya%20fate_%28series%29%20japanese_clothes%20kimono%20long_hair%20miyu_edelfelt%20purple_hair%20tagme_%28artist%29%20tears.jpg
> 
> address@hidden ~]$ cat test | grep -Eo '[0-9a-f]{32}/[0-9A-Za-z\-%_\.]+'
> a4ff5caad2fa35faa2271df9badacd35/Konachan.com%20
> 
> It is bug ? or just my syntax error ?

Your syntax error.

In the C locale,

[0-9A-Za-z%_\.\-] matches digits, letters, %, _, \ (listed twice, but
the second listing is ignored), ., and -.

[0-9A-Za-z\-%_\.] matches digits, letters, the range of ASCII bytes
between \ and % (whoops - in ASCII, \ is 47 but % is 37 - you have a
backwards range, so that portion of the range expression matches nothing
at all), then _, \, and .  Hence, '-' is not one of the characters
matched, and grep's output is shorter.  POSIX permits the implementation
you saw; it also permits an implementation that refuses to grep at all
by declaring your regex invalid because of the backwards range.

In non-C locales, use of - in a [] expression that is not either the
first or the last member of the set is implementation-defined, and all
bets are off on what it matches (lately, GNU tools have been moving
towards rational-range-interpretation, which means treating the range as
the same bytes as it would match in the C locale; but other
implementations, or even older versions of GNU tools, tried to get fancy
and match any character that would collate between the two endpoints,
which gets weird fast).

It _looks_ like you were trying to use \- and \. as escape characters.
But inside [] (at least, the Extended Regular Expression syntax of 'grep
-E' as defined by POSIX), \ is not an escape character; and nothing
needs escaping (there are only special rules about where ], ^, and - are
handled).  Yes, there are other flavors of regex engines (perl, for
example) where \ DOES act as an escape even inside [].  Which is why it
is essential that you know the quirks of each regex engine you are
targetting.

By the way, bug-gnu-utils is no longer the preferred bug reporting
address for grep; it means your version of grep is probably quite
outdated.  These days, 'grep --help' suggests address@hidden for
reporting bugs.

-- 
Eric Blake, Principal Software Engineer
Red Hat, Inc.           +1-919-301-3266
Virtualization:  qemu.org | libvirt.org

Attachment: signature.asc
Description: OpenPGP digital signature


reply via email to

[Prev in Thread] Current Thread [Next in Thread]