[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
Re: Kolmogorov-Smirnov test 2
From: |
Kai Torben Ohlhus |
Subject: |
Re: Kolmogorov-Smirnov test 2 |
Date: |
Tue, 2 Jul 2019 19:47:26 +0900 |
User-agent: |
Mozilla/5.0 (X11; Linux x86_64; rv:60.0) Gecko/20100101 Thunderbird/60.7.2 |
On 6/28/19 6:35 AM, Tommy McCann wrote:
>
> On Thu, Jun 27, 2019 at 1:09 AM Kai Torben Ohlhus <address@hidden
> <mailto:address@hidden>> wrote:
>
> On 6/22/19 8:15 AM, tmac017 wrote:
> > I was trying to use the kolmogorov_smirnov_test_2 and I got this error
> >
> > warning: kolmogorov_smirnov_test_2: cannot compute correct
> p-values with
> > ties
> > warning: called from
> > kolmogorov_smirnov_test_2 at line 79 column 5
> >
> > I saw there was another thread about this but it didn't answer the
> question
> > and that thread is closed. Since I spent sometime looking at the
> code I'm
> > re-posting.
> >
> > The warning means that some values in each set are exactly the
> same. The
> > reason this is a problem is because the code sorts the values from
> both sets
> > and the sorted values can't occupy the same place in an ordered
> series. In
> > order to avoid an error caused by the sorting the function deletes
> the D
> > value at that point. I don't think this should cause any problems
> but it
> > still prints a warning.
> >
> > The reason I got this error is because I was using the function
> > empirical_cdf to generate a cdf for each data set along the same range
> > because the HELP info said the function required cdf inputs.
> Based on the
> > code it seems like the function takes in two data sets not CDFs.
> Because
> > CDFs alter the size of the set it messes with the results.
> >
> > Note: in the other thread Hamish was having a hard time using the
> KS-test
> > for
> > a = randn(2000,1);
> > b = randn(2000,1);
> > p = kolmogorov_smirnov_test_2(a,b)
> >
> > she got the same error and the results weren't consistent. This is
> > ironically BECAUSE of the large set size. The test statistic is
> sqrt (n_x *
> > n_y / (n_x + n_y)) * d. Since the curves were randomly generated some
> > deviation was expected, the large sample size made the test more
> sensitive
> > to deviation, increasing the sample size just made the test even more
> > sensitive.
> >
> Please can you tell the version of Octave and the version of the
> statistics package you are using? In version 4.4.0 many statistics
> functions moved to the statistics package of Octave Forge [1].
>
> Additionally, it was nice to provide a reproducible test for this
> warning message. The example of Hamish from 2005 [2]
>
>
> N = 1e6; while 1, a = randn(N,1); b = randn(N,1); p =
> kolmogorov_smirnov_test_2(a,b), endwhile
>
> did not throw the warning you described N=2000 or N=1e6 for 5 minutes.
>
> Best,
> Kai
>
>
> [1] https://octave.sourceforge.io/statistics/NEWS.html
> [2] https://lists.gnu.org/archive/html/help-octave/2005-11/msg00232.html
>
>
> The version of octave I'm using is 4.2.0 it looks like the statistics
> package is 1.3.0
>
> True, in the original thread the error only occurred during the first
> run and her primary question was why the p value wasn't consistent. I
> got the error because I was comparing two cdfs instead of two data sets.
> Here is a piece of code similar to the one I used to troubleshoot.
>
> %%begin script to test KS error...
>
> %create random vectors
> d1 = rand(100,1);
> d2 = rand(100,1);
>
> x = 0:0.05:1;
>
> %create cdfs to visualize kolmogorov_smirnov_test results
> d1_cdf = empirical_cdf(x,d1);
> d2_cdf = empirical_cdf(x,d2);
>
> %plot cdfs
> figure(1);
> plot(x,d1_cdf,'-',x,d2_cdf,'-');
>
> % run KS-test
> disp('first test, without duplicate does not produce warning: ');
> P = kolmogorov_smirnov_test_2(d1, d2)
>
> %add duplicate value to vectors then re-run test
> disp('second test wcontaining duplicate produces warning : ');
> next = length(d1)+1;
> d1(next) = 0.5;
> d2(next) = 0.5;
>
> P = kolmogorov_smirnov_test_2(d1, d2)
>
> %attempt to use cdfs instead of raw data as specified in help info
> disp('using cdf produces warning and incorrect results:');
> P = kolmogorov_smirnov_test_2(d1_cdf, d2_cdf)
>
Thank you for your example. Bug #56572 [3] was created by me, that your
example is not forgotten.
I do not fully understand, why the tie values are treated this way. If
you can help with this bug, please post your ideas to improve the
function at [3]. But it seems, that the statistics package does not
have a maintainer right now. Thus your changes might not immediately be
merged.
Best regards,
Kai
[3] https://savannah.gnu.org/bugs/?56572
[Prev in Thread] |
Current Thread |
[Next in Thread] |
- Re: Kolmogorov-Smirnov test 2,
Kai Torben Ohlhus <=