bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Possible bug when field-separator used


From: Tim Rice
Subject: Re: Possible bug when field-separator used
Date: Thu, 30 Nov 2023 20:49:44 +0000

Hi Jeroen,

Thanks for the bug report. I struggle to find time these days, but I've added 
fixing the bug to my todo list.

I agree that when there are interactions between the locale and the field 
separator, the input should be considered faulty. In this case, do you think it 
would be sufficient to simply report an error and stop further processing?

If you have some cycles to contribute, it would help me get started if there 
was a test available. The tests are written in perl, so you may find it fairly 
easy to add a new one, by copying from another test that already exists. It 
would be great to see a patch which implements a test for whether GNU Datamash 
handles this faulty data correctly.

Once we agree on a clear test case, it should make it a lot easier to work on 
solving the problem.

~ Tim


On Thu, Nov 30, 2023 at 02:52:30PM +0100, Jeroen Hoek wrote:
Hi Erik,

Good catch on the locale, that is indeed part of the reproduction case.
I didn't consider that.


Works:

echo -e "a,14,1\nb,1,14\na,2,1" | \
    LC_NUMERIC=en_GB.UTF-8 \
    datamash --field-separator=, -s groupby 1 sum 2,3


Fails:

echo -e "a,14,1\nb,1,14\na,2,1" | \
    LC_NUMERIC=nl_NL.UTF-8 \
    datamash --field-separator=, -s groupby 1 sum 2,3


And indeed, in Dutch as in German, the decimal separator is the comma.
Using . as separator reverses this: en_GB fails and nl_NL works.

As an end user I wouldn't mind if decimal separators which happen to
match the specified field separator do not get interpreted as decimal
separators at all. I would consider such input as faulty. (This goes for
periods too in relevant locales I suppose).

Thanks for looking into this.


On 30-11-2023 14:22, Erik Auerswald wrote:
Hi Jeroen,

I think this is an interaction with the locale support of GNU Datamash
and the way GNU Datamash parses numbers.  You can work around it by
temporarily overwriting the locale settings:

     echo -e "a,14,1\nb,1,14\na,2,1" | \
       LC_ALL=C datamash --field-separator=, -s groupby 1 sum 2,3
     --> a,16,2
     --> b,1,14

The problem occurs as soon as the second column is summed over:

     echo -e "a,14,1\nb,1,14\na,2,1" | \
       datamash --field-separator=, -s groupby 1 sum 2
     --> datamash: invalid numeric value in line 1 field 2: '14'

The root cause is that GNU Datamash uses the locale settings for parsing
its input, and thus treats ',' as decimal separator in some locales
(e.g., in the de_DE.UTF-8 locale).  This interacts with using ',' as
field separator.

I have not looked into the code and thus do not know how involved it
would be to fix this.  (I do think this is a bug.)

Best regards,
Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]