[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Possible bug when field-separator used
From: |
Erik Auerswald |
Subject: |
Re: Possible bug when field-separator used |
Date: |
Thu, 30 Nov 2023 14:22:18 +0100 |
Hi Jeroen,
I think this is an interaction with the locale support of GNU Datamash
and the way GNU Datamash parses numbers. You can work around it by
temporarily overwriting the locale settings:
echo -e "a,14,1\nb,1,14\na,2,1" | \
LC_ALL=C datamash --field-separator=, -s groupby 1 sum 2,3
--> a,16,2
--> b,1,14
The problem occurs as soon as the second column is summed over:
echo -e "a,14,1\nb,1,14\na,2,1" | \
datamash --field-separator=, -s groupby 1 sum 2
--> datamash: invalid numeric value in line 1 field 2: '14'
The root cause is that GNU Datamash uses the locale settings for parsing
its input, and thus treats ',' as decimal separator in some locales
(e.g., in the de_DE.UTF-8 locale). This interacts with using ',' as
field separator.
I have not looked into the code and thus do not know how involved it
would be to fix this. (I do think this is a bug.)
Best regards,
Erik
--
The computing scientist’s main challenge is not to get confused by
the complexities of his own making.
-- Edsger W. Dijkstra
On Thu, Nov 30, 2023 at 09:51:33AM +0100, Jeroen Hoek wrote:
> Hello!
>
> I was trying to sum up values from a CSV input, and I am seeing
> something odd.
>
> Tested in datamash 1.4 and 1.8.
>
>
> This works:
>
>
> # Tab separated input, sum column 2 and 3.
>
> echo -e "a\t14\t1\nb\t1\t14\na\t2\t1" | \
> datamash -s groupby 1 sum 2,3
>
>
> # Space separated input, sum column 2 and 3.
>
> echo -e "a 14 1\nb 1 14\na 2 1" | \
> datamash --field-separator=' ' -s groupby 1 sum 2,3
>
>
> # Semicolon separated input, sum column 2 and 3.
>
> echo -e "a;14;1\nb;1;14\na;2;1" | \
> datamash --field-separator=';' -s groupby 1 sum 2,3
>
>
> # Comma separated input, sum ONLY column 3.
>
> echo -e "a,14,1\nb,1,14\na,2,1" | \
> datamash --field-separator=, -s groupby 1 sum 3
>
>
> This fails:
>
>
> # Comma separated input, sum column 2 and 3.
>
> echo -e "a,14,1\nb,1,14\na,2,1" | \
> datamash --field-separator=, -s groupby 1 sum 2,3
>
> datamash: invalid numeric value in line 1 field 2: '14'
>
>
> Is this a bug or Am I overlooking something?
>
> Kind regards,
>
> Jeroen Hoek