bug-datamash
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: New bugs discovered when adding vnlog support


From: Tim Rice
Subject: Re: New bugs discovered when adding vnlog support
Date: Sat, 9 Jul 2022 05:07:51 +0000

Hi Dima,

I've done some work today, and I think most of the issues you've reported have 
been solved. The following commits are for your bug reports:

```
5111445 datamash: bugfix: Ensure rmdup respects --output-delimiter
4a3b349 datamash: bugfix: Allow crosstab to be called by header field name.
6dc46b8 tests: Add the test for pcov header in/out
e3462c5 datamash: Print all operation columns in output header
```

I've made additional todo items to solve the harder problems later (v2.0).

I'll wait a few days for feedback in case I've made any mistakes, and then it 
should be time for the final release of v1.8.

Thanks again for the reports!

~ Tim


On Tue, Jul 05, 2022 at 03:12:30AM +0000, Tim Rice wrote:
Hey Dima,

Thanks for the reports.

On Mon, Jul 04, 2022 at 05:21:25PM -0700, Dima Kogan wrote:
A crosstab operation does this:

$ < /tmp/data ./datamash -W --header-in --header-out crosstab 1,2

GroupBy(x)      GroupBy(y)      count(x)
        0.5     1       1.5     2
1       1       N/A     N/A     N/A
2       N/A     1       N/A     N/A
3       N/A     N/A     1       N/A
4       N/A     N/A     N/A     1

Here "1, 2, 3, 4" labels the rows and "0.5, 1, 1.5, 2" labels the
columns. These are present regardless of --header-out. The "GroupBy()"
stuff is added by --header-out, and doesn't label the rows or the
columns. There are always 3 of these. These appear to label the crosstab
y axis, the crosstab x axis, and the operation respectively. This is
very confusing because normally --header-out produces labels for each
column. Proposal: if "--header-out -C" is given, put the "GroupBy" stuff
into a comment, with more text describing what they mean. If
"--header-out" is given without -C, don't produce this output at all.

I think it makes more sense if you think of the entire table as a single 
"block" of output. Without --header-out, you still see the column and row 
names. If --header-out also showed the column names, it would be redundant.

The existing output of --header-out does show how to interpret the block that 
immediately follows. If someone doesn't want that interpretation displayed, and 
simply wants the column and row names, they need only elide --header-out from 
their command.


Next. If I tweak this command to ask for the field by name (which should
workd because we passed --header-in), then this happens instead:

$ < /tmp/data ./datamash -W --header-in --header-out crosstab x,y

datamash: src/column-headers.c:62: get_input_field_name: Assertion `field_num > 0 
&& field_num <= num_input_column_headers' failed.
zsh: IOT instruction  ./datamash -W --header-in --header-out crosstab x,y < 
/tmp/data

Yep, that definitely looks like a bug to me. I'd like to fix this before 1.8 if 
possible, and have added it to my todo list. Thanks for bringing it to my 
attention.



$ < /tmp/data ./datamash --header-in --header-out -C -W pcov x:y

pcov(y)
0.625

Note the field label. For the self-covariance, it maybe makes sense. But
for the cross-covariance, it's definitely wrong: Cov(x,y) cannot be
labelled "pcov(y)".

Agreed, I'd like to fix this for the release too. I think it could say 
"pcov(x,y)". This is now todo.


Finally, there's an unexpected implementation detail with "reverse" and
"rmdup" that could be described as a bug. In all other modes, passing
--header-out ends up calling the print_column_headers() function before
writing out the data. But the "reverse" and "rmdup" have a separate code
path, which doesn't work the same way. This is a likely source of bugs.
I haven't looked very hard, but I just found one: "rmdup" appears to
ignore the output delimiter:

$ < /tmp/data ./datamash --output-delimiter=: -H -W rmdup x

x y
1 0.5
2 1
3 1.5
4 2

Yeah, I see. If it requires significant refactoring, it might need to wait for 
the next release after this. In the meantime, I should be able to make 
`--output-delimiter` work correctly in time for 1.8.

~ Tim




reply via email to

[Prev in Thread] Current Thread [Next in Thread]