bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-datamash] Unable to use comma as field separator when the local


From: Erik Auerswald
Subject: Re: [Bug-datamash] Unable to use comma as field separator when the locale uses it as a decimal separator too
Date: Sun, 17 Jul 2022 13:42:33 +0200
User-agent: Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0

Hi,

On 14.07.22 14:41, Erik Auerswald wrote:
On Wed, Sep 12, 2018 at 05:42:09PM +0200, Jérémie Roquet wrote:
[...]
With datamash using the locale to determine which decimal separator to
use, the behavior becomes inconsistent when the field separator and
the decimal separator are the same.

For example:

   $ printf '1,2\n' | LC_NUMERIC=fr_FR.UTF-8 datamash -t, sum 1
   datamash: invalid numeric value in line 1 field 1: '1'
   $ printf '1,2\n' | LC_NUMERIC=fr_FR.UTF-8 datamash -t, sum 2
   2

whereas:

   $ printf '1,2\n' | LC_NUMERIC=en_US.UTF-8 datamash -t, sum 1
   1
   $ printf '1,2\n' | LC_NUMERIC=en_US.UTF-8 datamash -t, sum 2
   2

I can reproduce this with a German locale and current datamash, too:

     $ echo $LC_NUMERIC
     de_DE.UTF-8
     $ ./datamash --version | head -n1
     datamash (GNU datamash) 1.7.34-4187
     $ echo 1,2 | ./datamash -t, sum 1
     ./datamash: invalid numeric value in line 1 field 1: '1'
     $ echo 1,2 | ./datamash -t, sum 2
     2

Not all operations are affected:

     $ echo 1,2 | ./datamash -t, count 1
     1
     $ echo 1,2 | ./datamash -t, count 2
     1

This issue affects using a period ('.') as field separator in locales
where a period is used as decimal point, too:

     $ echo 1.2 | env LC_ALL=C ./datamash -t. sum 1
     ./datamash: invalid numeric value in line 1 field 1: '1'
     $ echo 1.2 | env LC_ALL=C ./datamash -t. sum 2
     2

     $ echo 1.2 | env LC_NUMERIC=enUS.UTF-8  ./datamash -t. sum 1
     ./datamash: invalid numeric value in line 1 field 1: '1'
     $ echo 1.2 | env LC_NUMERIC=enUS.UTF-8  ./datamash -t. sum 2
     2

In my opinion, this is surprising because:
  - working on “raw” tabular data, you don't expect “smart” handling of
anything (that would have been less surprising from Microsoft Excel,
because it implements escaping of separators in formats like CSV);
  - parsing works well for the last column, but not for the others;
  - the error message reports a well-parsed numeric value, not an
obviously invalid one.

I concur that the error message does not make sense.

I think that this is a bug.  I have not yet looked at the code regarding
this issue, thus I cannot say how effortful it would be to fix it.

This problem results from the way GNU Datamash parses its input.
The input data is kept as-is, and the individual fields are designated
via pointers to the beginning and the field length.  Thus the input
field separator is part of the string containing the field data unless
processing the last field of a line.

The start pointer of the field is given to the floating point parsing
function strtold().  If the field separator is the same as the decimal
separator, strtold() does not stop at the field separator, but continues
parsing a number.

After strtold() has possibly returned a number, it is checked if
parsing continued after the end of the field, which is interpreted as
an "invalid number".

This is done in the field_op_collect() function in the src/field-ops.c
file.

This function contains code to copy the field contents into a temporary
string with NUL termination that is only used if the system's strtold()
does not stop at the default field separator TAB.  Always using this
code should fix this problem.

It would also introduce additional data copying that is unnecessary
in most use-cases.

As an optimization, the data copy could be used only if it is needed,
i.e., when either the field separator is the same as the (locale
specific) decimal separator, or when the field separator is a TAB
and strtold() is known not to stop at a TAB (this is checked during
./configure).

I would say that the best interpretation of using the decimal separator
as field separator is to remove the ability to use floating point input
data as floating point numbers (it might be intended to split the input
into two integers, or one might work with a list of IPv4 addresses).

The possible changes described above would implement this.

[...]
[...]

Br,
Erik



reply via email to

[Prev in Thread] Current Thread [Next in Thread]