[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Adding dot product operation to GNU Datamash
From: |
Erik Auerswald |
Subject: |
Re: Adding dot product operation to GNU Datamash |
Date: |
Sat, 6 Aug 2022 19:57:28 +0200 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:91.0) Gecko/20100101 Thunderbird/91.11.0 |
Hi,
On 06.08.22 03:30, Tim Rice wrote:
I've been thinking about this for a while: it would be nice to have an
operation which multiplies the corresponding records of two columns and
returns the sum of these products. Aka the dot product or scalar product
of the two columns.
At the moment, you could do something similar by combining GNU Datamash
with GNU Awk:
```
$ awk '{print $1 * $2}' /tmp/data.txt | datamash sum 1
```
Or you could do it all in gawk if you want:
```
$ awk '{sum += $1 * $2} END{print sum}' /tmp/data.txt
```
But I think doing it all in GNU Datamash allows a more intuitive command:
```
$ datamash -W dotprod 1:2 < /tmp/data.txt
```
A proposed implementation is attached. Please let me know if you see any
problems with it.
I looked at the diff and did not see any obvious problems. I do
not see a reason not to add that operation either.
If this looks good, then it should be trivial to also add a weighted
mean. That will just be like the dot product except for dividing the
result by one of the column sums. (But which column should be preferred
for that? Maybe need to pass an extra option?)
It might suffice to always divide by the sum of the first column,
if the code keeps the order of the given fields. I think it does,
but I did not verify this.
This would allow to use "weighted_mean 1:2" resp. "weighted_mean 2:1"
to divide by the sum of column 1 resp. 2.
("weighted_mean" is just a placeholder, of course, I just needed
some name to illustrate the idea.)
Br,
Erik