bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New bugs discovered when adding vnlog support


From: Tim Rice
Subject: Re: New bugs discovered when adding vnlog support
Date: Tue, 5 Jul 2022 03:12:30 +0000

Hey Dima,

Thanks for the reports.

On Mon, Jul 04, 2022 at 05:21:25PM -0700, Dima Kogan wrote:
A crosstab operation does this:

 $ < /tmp/data ./datamash -W --header-in --header-out crosstab 1,2

 GroupBy(x)     GroupBy(y)      count(x)
         0.5    1       1.5     2
 1      1       N/A     N/A     N/A
 2      N/A     1       N/A     N/A
 3      N/A     N/A     1       N/A
 4      N/A     N/A     N/A     1

Here "1, 2, 3, 4" labels the rows and "0.5, 1, 1.5, 2" labels the
columns. These are present regardless of --header-out. The "GroupBy()"
stuff is added by --header-out, and doesn't label the rows or the
columns. There are always 3 of these. These appear to label the crosstab
y axis, the crosstab x axis, and the operation respectively. This is
very confusing because normally --header-out produces labels for each
column. Proposal: if "--header-out -C" is given, put the "GroupBy" stuff
into a comment, with more text describing what they mean. If
"--header-out" is given without -C, don't produce this output at all.

I think it makes more sense if you think of the entire table as a single 
"block" of output. Without --header-out, you still see the column and row 
names. If --header-out also showed the column names, it would be redundant.

The existing output of --header-out does show how to interpret the block that 
immediately follows. If someone doesn't want that interpretation displayed, and 
simply wants the column and row names, they need only elide --header-out from 
their command.


Next. If I tweak this command to ask for the field by name (which should
workd because we passed --header-in), then this happens instead:

 $ < /tmp/data ./datamash -W --header-in --header-out crosstab x,y

 datamash: src/column-headers.c:62: get_input_field_name: Assertion `field_num > 0 
&& field_num <= num_input_column_headers' failed.
 zsh: IOT instruction  ./datamash -W --header-in --header-out crosstab x,y < 
/tmp/data

Yep, that definitely looks like a bug to me. I'd like to fix this before 1.8 if 
possible, and have added it to my todo list. Thanks for bringing it to my 
attention.



 $ < /tmp/data ./datamash --header-in --header-out -C -W pcov x:y

 pcov(y)
 0.625

Note the field label. For the self-covariance, it maybe makes sense. But
for the cross-covariance, it's definitely wrong: Cov(x,y) cannot be
labelled "pcov(y)".

Agreed, I'd like to fix this for the release too. I think it could say 
"pcov(x,y)". This is now todo.


Finally, there's an unexpected implementation detail with "reverse" and
"rmdup" that could be described as a bug. In all other modes, passing
--header-out ends up calling the print_column_headers() function before
writing out the data. But the "reverse" and "rmdup" have a separate code
path, which doesn't work the same way. This is a likely source of bugs.
I haven't looked very hard, but I just found one: "rmdup" appears to
ignore the output delimiter:

 $ < /tmp/data ./datamash --output-delimiter=: -H -W rmdup x

 x y
 1 0.5
 2 1
 3 1.5
 4 2

Yeah, I see. If it requires significant refactoring, it might need to wait for 
the next release after this. In the meantime, I should be able to make 
`--output-delimiter` work correctly in time for 1.8.

~ Tim



reply via email to

[Prev in Thread] Current Thread [Next in Thread]