[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: New bugs discovered when adding vnlog support
From: |
Tim Rice |
Subject: |
Re: New bugs discovered when adding vnlog support |
Date: |
Tue, 5 Jul 2022 03:12:30 +0000 |
Hey Dima,
Thanks for the reports.
On Mon, Jul 04, 2022 at 05:21:25PM -0700, Dima Kogan wrote:
A crosstab operation does this:
$ < /tmp/data ./datamash -W --header-in --header-out crosstab 1,2
GroupBy(x) GroupBy(y) count(x)
0.5 1 1.5 2
1 1 N/A N/A N/A
2 N/A 1 N/A N/A
3 N/A N/A 1 N/A
4 N/A N/A N/A 1
Here "1, 2, 3, 4" labels the rows and "0.5, 1, 1.5, 2" labels the
columns. These are present regardless of --header-out. The "GroupBy()"
stuff is added by --header-out, and doesn't label the rows or the
columns. There are always 3 of these. These appear to label the crosstab
y axis, the crosstab x axis, and the operation respectively. This is
very confusing because normally --header-out produces labels for each
column. Proposal: if "--header-out -C" is given, put the "GroupBy" stuff
into a comment, with more text describing what they mean. If
"--header-out" is given without -C, don't produce this output at all.
I think it makes more sense if you think of the entire table as a single
"block" of output. Without --header-out, you still see the column and row
names. If --header-out also showed the column names, it would be redundant.
The existing output of --header-out does show how to interpret the block that
immediately follows. If someone doesn't want that interpretation displayed, and
simply wants the column and row names, they need only elide --header-out from
their command.
Next. If I tweak this command to ask for the field by name (which should
workd because we passed --header-in), then this happens instead:
$ < /tmp/data ./datamash -W --header-in --header-out crosstab x,y
datamash: src/column-headers.c:62: get_input_field_name: Assertion `field_num > 0
&& field_num <= num_input_column_headers' failed.
zsh: IOT instruction ./datamash -W --header-in --header-out crosstab x,y <
/tmp/data
Yep, that definitely looks like a bug to me. I'd like to fix this before 1.8 if
possible, and have added it to my todo list. Thanks for bringing it to my
attention.
$ < /tmp/data ./datamash --header-in --header-out -C -W pcov x:y
pcov(y)
0.625
Note the field label. For the self-covariance, it maybe makes sense. But
for the cross-covariance, it's definitely wrong: Cov(x,y) cannot be
labelled "pcov(y)".
Agreed, I'd like to fix this for the release too. I think it could say
"pcov(x,y)". This is now todo.
Finally, there's an unexpected implementation detail with "reverse" and
"rmdup" that could be described as a bug. In all other modes, passing
--header-out ends up calling the print_column_headers() function before
writing out the data. But the "reverse" and "rmdup" have a separate code
path, which doesn't work the same way. This is a likely source of bugs.
I haven't looked very hard, but I just found one: "rmdup" appears to
ignore the output delimiter:
$ < /tmp/data ./datamash --output-delimiter=: -H -W rmdup x
x y
1 0.5
2 1
3 1.5
4 2
Yeah, I see. If it requires significant refactoring, it might need to wait for
the next release after this. In the meantime, I should be able to make
`--output-delimiter` work correctly in time for 1.8.
~ Tim