bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Possible bug when field-separator used


From: Erik Auerswald
Subject: Re: Possible bug when field-separator used
Date: Thu, 30 Nov 2023 14:22:18 +0100

Hi Jeroen,

I think this is an interaction with the locale support of GNU Datamash
and the way GNU Datamash parses numbers.  You can work around it by
temporarily overwriting the locale settings:

    echo -e "a,14,1\nb,1,14\na,2,1" | \
      LC_ALL=C datamash --field-separator=, -s groupby 1 sum 2,3
    --> a,16,2
    --> b,1,14

The problem occurs as soon as the second column is summed over:

    echo -e "a,14,1\nb,1,14\na,2,1" | \
      datamash --field-separator=, -s groupby 1 sum 2
    --> datamash: invalid numeric value in line 1 field 2: '14'

The root cause is that GNU Datamash uses the locale settings for parsing
its input, and thus treats ',' as decimal separator in some locales
(e.g., in the de_DE.UTF-8 locale).  This interacts with using ',' as
field separator.

I have not looked into the code and thus do not know how involved it
would be to fix this.  (I do think this is a bug.)

Best regards,
Erik
-- 
The computing scientist’s main challenge is not to get confused by
the complexities of his own making.
                        -- Edsger W. Dijkstra


On Thu, Nov 30, 2023 at 09:51:33AM +0100, Jeroen Hoek wrote:
> Hello!
> 
> I was trying to sum up values from a CSV input, and I am seeing
> something odd.
> 
> Tested in datamash 1.4 and 1.8.
> 
> 
> This works:
> 
> 
> # Tab separated input, sum column 2 and 3.
> 
> echo -e "a\t14\t1\nb\t1\t14\na\t2\t1" | \
>     datamash -s groupby 1 sum 2,3
> 
> 
> # Space separated input, sum column 2 and 3.
> 
> echo -e "a 14 1\nb 1 14\na 2 1" | \
>     datamash --field-separator=' ' -s groupby 1 sum 2,3
> 
> 
> # Semicolon separated input, sum column 2 and 3.
> 
> echo -e "a;14;1\nb;1;14\na;2;1" | \
>     datamash --field-separator=';' -s groupby 1 sum 2,3
> 
> 
> # Comma separated input, sum ONLY column 3.
> 
> echo -e "a,14,1\nb,1,14\na,2,1" | \
>     datamash --field-separator=, -s groupby 1 sum 3
> 
> 
> This fails:
> 
> 
> # Comma separated input, sum column 2 and 3.
> 
> echo -e "a,14,1\nb,1,14\na,2,1" | \
>     datamash --field-separator=, -s groupby 1 sum 2,3
> 
> datamash: invalid numeric value in line 1 field 2: '14'
> 
> 
> Is this a bug or Am I overlooking something?
> 
> Kind regards,
> 
> Jeroen Hoek



reply via email to

[Prev in Thread] Current Thread [Next in Thread]