|
From: | Erik Auerswald |
Subject: | Re: [Bug-datamash] Unable to use comma as field separator when the locale uses it as a decimal separator too |
Date: | Sun, 17 Jul 2022 13:42:33 +0200 |
User-agent: | Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 |
Hi, On 14.07.22 14:41, Erik Auerswald wrote:
On Wed, Sep 12, 2018 at 05:42:09PM +0200, Jérémie Roquet wrote:[...] With datamash using the locale to determine which decimal separator to use, the behavior becomes inconsistent when the field separator and the decimal separator are the same. For example: $ printf '1,2\n' | LC_NUMERIC=fr_FR.UTF-8 datamash -t, sum 1 datamash: invalid numeric value in line 1 field 1: '1' $ printf '1,2\n' | LC_NUMERIC=fr_FR.UTF-8 datamash -t, sum 2 2 whereas: $ printf '1,2\n' | LC_NUMERIC=en_US.UTF-8 datamash -t, sum 1 1 $ printf '1,2\n' | LC_NUMERIC=en_US.UTF-8 datamash -t, sum 2 2I can reproduce this with a German locale and current datamash, too: $ echo $LC_NUMERIC de_DE.UTF-8 $ ./datamash --version | head -n1 datamash (GNU datamash) 1.7.34-4187 $ echo 1,2 | ./datamash -t, sum 1 ./datamash: invalid numeric value in line 1 field 1: '1' $ echo 1,2 | ./datamash -t, sum 2 2 Not all operations are affected: $ echo 1,2 | ./datamash -t, count 1 1 $ echo 1,2 | ./datamash -t, count 2 1 This issue affects using a period ('.') as field separator in locales where a period is used as decimal point, too: $ echo 1.2 | env LC_ALL=C ./datamash -t. sum 1 ./datamash: invalid numeric value in line 1 field 1: '1' $ echo 1.2 | env LC_ALL=C ./datamash -t. sum 2 2 $ echo 1.2 | env LC_NUMERIC=enUS.UTF-8 ./datamash -t. sum 1 ./datamash: invalid numeric value in line 1 field 1: '1' $ echo 1.2 | env LC_NUMERIC=enUS.UTF-8 ./datamash -t. sum 2 2In my opinion, this is surprising because: - working on “raw” tabular data, you don't expect “smart” handling of anything (that would have been less surprising from Microsoft Excel, because it implements escaping of separators in formats like CSV); - parsing works well for the last column, but not for the others; - the error message reports a well-parsed numeric value, not an obviously invalid one.I concur that the error message does not make sense. I think that this is a bug. I have not yet looked at the code regarding this issue, thus I cannot say how effortful it would be to fix it.
This problem results from the way GNU Datamash parses its input. The input data is kept as-is, and the individual fields are designated via pointers to the beginning and the field length. Thus the input field separator is part of the string containing the field data unless processing the last field of a line. The start pointer of the field is given to the floating point parsing function strtold(). If the field separator is the same as the decimal separator, strtold() does not stop at the field separator, but continues parsing a number. After strtold() has possibly returned a number, it is checked if parsing continued after the end of the field, which is interpreted as an "invalid number". This is done in the field_op_collect() function in the src/field-ops.c file. This function contains code to copy the field contents into a temporary string with NUL termination that is only used if the system's strtold() does not stop at the default field separator TAB. Always using this code should fix this problem. It would also introduce additional data copying that is unnecessary in most use-cases. As an optimization, the data copy could be used only if it is needed, i.e., when either the field separator is the same as the (locale specific) decimal separator, or when the field separator is a TAB and strtold() is known not to stop at a TAB (this is checked during ./configure).
I would say that the best interpretation of using the decimal separator as field separator is to remove the ability to use floating point input data as floating point numbers (it might be intended to split the input into two integers, or one might work with a list of IPv4 addresses).
The possible changes described above would implement this.
[...][...]
Br, Erik
[Prev in Thread] | Current Thread | [Next in Thread] |