bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: Possible bug when field-separator used


From: Erik Auerswald
Subject: Re: Possible bug when field-separator used
Date: Fri, 1 Dec 2023 09:16:20 +0100

Hi all,

I had looked into this last year, see the thread starting at
<https://lists.gnu.org/archive/html/bug-datamash/2022-07/msg00015.html>.

I had added a few failing tests, commented out, as documentation
for the issues, to the file "tests/datamash-tests-2.pl"
in commit 2b181b38e86f067fa68a7b174171a24e80313361
<https://git.savannah.gnu.org/gitweb/?p=datamash.git;a=blob;f=tests/datamash-tests-2.pl;h=3d09c19138f6a83d19074c4aa8c122964f4b24ea;hb=HEAD#l692>.

The root cause is that datamash uses strtold() on the unchanged input
line.  Thus strtold() continues parsing over the field separator, which
also is the decimal separator.  There are further edge cases, e.g.,
with exponent indicators or special keywords (e.g., nan).

In <https://lists.gnu.org/archive/html/bug-datamash/2022-07/msg00018.html>
I wrote:

> This is done in the field_op_collect() function in the src/field-ops.c
> file.
> 
> This function contains code to copy the field contents into a temporary
> string with NUL termination that is only used if the system's strtold()
> does not stop at the default field separator TAB.  Always using this
> code should fix this problem.

I did continue back then, but I'd expect this could work.

Best regards,
Erik


On Thu, Nov 30, 2023 at 08:49:44PM +0000, Tim Rice wrote:
> Hi Jeroen,
> 
> Thanks for the bug report. I struggle to find time these days, but I've added 
> fixing the bug to my todo list.
> 
> I agree that when there are interactions between the locale and the field 
> separator, the input should be considered faulty. In this case, do you think 
> it would be sufficient to simply report an error and stop further processing?
> 
> If you have some cycles to contribute, it would help me get started if there 
> was a test available. The tests are written in perl, so you may find it 
> fairly easy to add a new one, by copying from another test that already 
> exists. It would be great to see a patch which implements a test for whether 
> GNU Datamash handles this faulty data correctly.
> 
> Once we agree on a clear test case, it should make it a lot easier to work on 
> solving the problem.
> 
> ~ Tim
> 
> 
> On Thu, Nov 30, 2023 at 02:52:30PM +0100, Jeroen Hoek wrote:
> >Hi Erik,
> >
> >Good catch on the locale, that is indeed part of the reproduction case.
> >I didn't consider that.
> >
> >
> >Works:
> >
> >echo -e "a,14,1\nb,1,14\na,2,1" | \
> >    LC_NUMERIC=en_GB.UTF-8 \
> >    datamash --field-separator=, -s groupby 1 sum 2,3
> >
> >
> >Fails:
> >
> >echo -e "a,14,1\nb,1,14\na,2,1" | \
> >    LC_NUMERIC=nl_NL.UTF-8 \
> >    datamash --field-separator=, -s groupby 1 sum 2,3
> >
> >
> >And indeed, in Dutch as in German, the decimal separator is the comma.
> >Using . as separator reverses this: en_GB fails and nl_NL works.
> >
> >As an end user I wouldn't mind if decimal separators which happen to
> >match the specified field separator do not get interpreted as decimal
> >separators at all. I would consider such input as faulty. (This goes for
> >periods too in relevant locales I suppose).
> >
> >Thanks for looking into this.
> >
> >
> >On 30-11-2023 14:22, Erik Auerswald wrote:
> >>Hi Jeroen,
> >>
> >>I think this is an interaction with the locale support of GNU Datamash
> >>and the way GNU Datamash parses numbers.  You can work around it by
> >>temporarily overwriting the locale settings:
> >>
> >>     echo -e "a,14,1\nb,1,14\na,2,1" | \
> >>       LC_ALL=C datamash --field-separator=, -s groupby 1 sum 2,3
> >>     --> a,16,2
> >>     --> b,1,14
> >>
> >>The problem occurs as soon as the second column is summed over:
> >>
> >>     echo -e "a,14,1\nb,1,14\na,2,1" | \
> >>       datamash --field-separator=, -s groupby 1 sum 2
> >>     --> datamash: invalid numeric value in line 1 field 2: '14'
> >>
> >>The root cause is that GNU Datamash uses the locale settings for parsing
> >>its input, and thus treats ',' as decimal separator in some locales
> >>(e.g., in the de_DE.UTF-8 locale).  This interacts with using ',' as
> >>field separator.
> >>
> >>I have not looked into the code and thus do not know how involved it
> >>would be to fix this.  (I do think this is a bug.)
> >>
> >>Best regards,
> >>Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]