[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
New bugs discovered when adding vnlog support
From: |
Dima Kogan |
Subject: |
New bugs discovered when adding vnlog support |
Date: |
Mon, 04 Jul 2022 17:21:25 -0700 |
User-agent: |
mu4e 1.6.10; emacs 29.0.50 |
Hi.
I just looked into adding tests for the vnlog support. Almost done. In
the process I found a few bugs that we should fix.
I'm using the latest datamash in version control. I have /tmp/data:
x y
1 0.5
2 1
3 1.5
4 2
A crosstab operation does this:
$ < /tmp/data ./datamash -W --header-in --header-out crosstab 1,2
GroupBy(x) GroupBy(y) count(x)
0.5 1 1.5 2
1 1 N/A N/A N/A
2 N/A 1 N/A N/A
3 N/A N/A 1 N/A
4 N/A N/A N/A 1
Here "1, 2, 3, 4" labels the rows and "0.5, 1, 1.5, 2" labels the
columns. These are present regardless of --header-out. The "GroupBy()"
stuff is added by --header-out, and doesn't label the rows or the
columns. There are always 3 of these. These appear to label the crosstab
y axis, the crosstab x axis, and the operation respectively. This is
very confusing because normally --header-out produces labels for each
column. Proposal: if "--header-out -C" is given, put the "GroupBy" stuff
into a comment, with more text describing what they mean. If
"--header-out" is given without -C, don't produce this output at all.
Next. If I tweak this command to ask for the field by name (which should
workd because we passed --header-in), then this happens instead:
$ < /tmp/data ./datamash -W --header-in --header-out crosstab x,y
datamash: src/column-headers.c:62: get_input_field_name: Assertion `field_num
> 0 && field_num <= num_input_column_headers' failed.
zsh: IOT instruction ./datamash -W --header-in --header-out crosstab x,y <
/tmp/data
Next. There's an issue with the covariance operations. Here I compute
Cov(x,x) and Cov(y,y) and Cov(x,y):
$ < /tmp/data ./datamash --header-in --header-out -C -W pcov x:x
pcov(x)
1.25
$ < /tmp/data ./datamash --header-in --header-out -C -W pcov y:y
pcov(y)
0.3125
$ < /tmp/data ./datamash --header-in --header-out -C -W pcov x:y
pcov(y)
0.625
Note the field label. For the self-covariance, it maybe makes sense. But
for the cross-covariance, it's definitely wrong: Cov(x,y) cannot be
labelled "pcov(y)".
Finally, there's an unexpected implementation detail with "reverse" and
"rmdup" that could be described as a bug. In all other modes, passing
--header-out ends up calling the print_column_headers() function before
writing out the data. But the "reverse" and "rmdup" have a separate code
path, which doesn't work the same way. This is a likely source of bugs.
I haven't looked very hard, but I just found one: "rmdup" appears to
ignore the output delimiter:
$ < /tmp/data ./datamash --output-delimiter=: -H -W rmdup x
x y
1 0.5
2 1
3 1.5
4 2
Probably more bugs exist, and the extra code paths should be eliminated
to be sure.
Thanks for maintaining datamash!
- New bugs discovered when adding vnlog support,
Dima Kogan <=