[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: moments.c
From: |
Ben Pfaff |
Subject: |
Re: moments.c |
Date: |
Wed, 21 Mar 2007 20:43:37 -0700 |
User-agent: |
Gnus/5.110006 (No Gnus v0.6) Emacs/21.4 (gnu/linux) |
Jason Stover <address@hidden> writes:
> On Wed, Mar 21, 2007 at 02:27:22PM -0700, Ben Pfaff wrote:
>> If correlations and moments are separate, you can compute them in
>> a single pass like this:
>>
>> for each case C do:
>> moments1_add(C)
>> correlations_add(C)
>
> But correlations_add() would need to know what happened in
> moments1_add(). Otherwise its computations would be redundant because
> it would have to compute the mean on its own.
>
> Correlations depend on the means. Computing correlations requires
> either computing the means first, then the correlation (meaning an
> extra data pass), or using a special algorithm that computes both
> simultaneously. Since the computations of correlations and means are
> intertwined in a one-pass algorithm, I thought the computation of
> correlation should be in moments.c. If it isn't in moments.c, then
> anyone who wants to compute correlations would have to do so by
> writing their own functions to manipulate the moments struct in the
> loop, or forgoing the use of moments.c and re-writing their own
> functions to compute means.
>
> An algorithm for computing correlations with one pass was recently
> added to the GSL cvs repository. You can see it at
> http://sourceware.org/cgi-bin/cvsweb.cgi/gsl/?cvsroot=gsl in
> statistics/covariance.c.
OK, here's the core of the correlation code from that file:
long double sum_xsq = 0.0;
long double sum_ysq = 0.0;
long double sum_cross = 0.0;
...
mean_x = data1[0 * stride1];
mean_y = data2[0 * stride2];
for (i = 1; i < n; ++i)
{
ratio = i / (i + 1.0);
delta_x = data1[i * stride1] - mean_x;
delta_y = data2[i * stride2] - mean_y;
sum_xsq += delta_x * delta_x * ratio;
sum_ysq += delta_y * delta_y * ratio;
sum_cross += delta_x * delta_y * ratio;
mean_x += delta_x / (i + 1.0);
mean_y += delta_y / (i + 1.0);
}
r = sum_cross / (sqrt(sum_xsq) * sqrt(sum_ysq));
We can break this into three functions, one for initializing
variables, one for processing each case, one for producing the
output, pretty easily:
correlation_create():
long double sum_xsq = 0.0;
long double sum_ysq = 0.0;
long double sum_cross = 0.0;
i = 0;
correlation_add(x, m1_x, y, m1_y):
i++;
ratio = i / (i + 1.0);
delta_x = data1[i * stride1] - moments1_mean(m1_x);
delta_y = data2[i * stride2] - moments1_mean(m1_y);
sum_xsq += delta_x * delta_x * ratio;
sum_ysq += delta_y * delta_y * ratio;
sum_cross += delta_x * delta_y * ratio;
correlation_calculate():
return sum_cross / (sqrt(sum_xsq) * sqrt(sum_ysq));
(I'm assuming that we write a function moments1_mean() that does
the same as moments1_calculate() except that it just returns the
mean.)
And then the main loop becomes:
moments1_create()
correlation_create()
for each case C do:
get x and y from case
moments1_add(&m1_x, x)
moments1_add(&m1_y, y)
correlations_add(x, &m1_x, y, &m1_y)
moments1_calculate()
correlation_calculate()
This works, no? And then correlations doesn't have a dependency
on moments1 except as a user. We could even get rid of that
minor dependency by changing the interface from taking the
moments1 objects to taking the value of the mean.
Let me know what you think.
--
Ben Pfaff
address@hidden
http://benpfaff.org