[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
bug#22028: grep -Pc / grep -P | wc -l inconsistent results
From: |
Norihiro Tanaka |
Subject: |
bug#22028: grep -Pc / grep -P | wc -l inconsistent results |
Date: |
Sat, 28 Nov 2015 15:16:30 +0900 |
On Fri, 27 Nov 2015 06:29:31 -0500 (EST)
Jaroslav Skarvada <address@hidden> wrote:
> Hi,
>
> it seems for long files which starts with non binary data and if PCRE matcher
> is used, grep works in TEXTBIN_UNKNOWN mode until it finds binary data, then
> it
> switches to TEXTBIN_BINARY. But in -Pc mode in TEXTBIN_BINARY it exits
> on next match causing bogus -Pc results.
>
> Reproducer:
> $ grep -P -c 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt
> 1
> $ grep -P 'Blocked by (SpamAssassin|Spamfilter)' ./filtered.txt | wc -l
> 2
>
> The ./filtered.txt is long enough text file, that contains some NULLs after
> the
> first 32kB text, e.g. https://bugzilla.redhat.com/attachment.cgi?id=1080646
>
> Original downstream bugzilla:
> https://bugzilla.redhat.com/attachment.cgi?id=1080646
>
> Attached is my attempt to fix it, but it may be not the right way
> how to fix it. Especially the question is whether it should stop when
> it finds binary data or not. But at least the grep -Pc / grep -P | wc -l
> should behave the same
>
> thanks & regards
>
> Jaroslav
I see that filter.txt is binary file, as NULs are included at line 647.
However, first 32768 bytes are correctly enocoded.
If first 32768 bytes of a file are correct encoding, grep -P marks with
not TEXTBIN_TEXT but TEXTBIN_UNKNOWN, and if grep found first match,
marks with TEXTBIN_TEXT. However, grep -P -c does not do last behavior.
grep -P treats as TEXTBIN_UNKNOWN, and if grep found first match, treats
as text file. However, grep -P -c does not do it.
So you can get number of matched lines with grep -a -P -c.
Thanks,
Norihiro
0001-grep-P-grep-Pc-consistent-results.patch
Description: Text document