[Top][All Lists]
[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]
[gnugo-devel] statistical regression
From: |
Douglas Ridgway |
Subject: |
[gnugo-devel] statistical regression |
Date: |
Tue, 2 Mar 2004 16:26:44 -0700 (MST) |
Hi all!
After reading some of the discussion on r.g.g. as to whether --level 15
is any improvement over --level 10, I did some work on statistics. The
question is, based on the results of a series of games, is player A
stronger than player B. From the point of view of setting up the test, the
question is how many games are necessary to identify a difference in
strength of a given size. I think people here have also run such tests.
I constructed [1] a table using KGS's formula for converting a strength
difference in stones to probability of victory, allowing a 5% chance of
falsely identifying a difference when there is none, and a 10% chance of
missing a real difference at the stated mismatch. N is the number of games
that need to be played, and Nw is the number of games that the stronger
player must win to get declared stronger.
Stones p N Nw
0.5 0.60 264 148
1.0 0.69 67 42
1.5 0.77 30 21
2.0 0.83 18 14
2.5 0.88 12 10
3.0 0.92 9 8
The results are interesting. For a short series, <=10 games, nothing less
than a complete blowout is statistically significant, and we wouldn't
expect to see that without a major difference in strength, perhaps 3
stones. To identify a substantial strength difference, 1.5-2.0 stones,
requires 20 or 30 games, and winning 2/3s of them. To be sure of a
strength difference of less than a stone requires hundreds of games.
One idea is to check that a change at least hasn't made the program worse.
The short series are so dominated by noise that they may not be worth
running at all. A run of 20 or 30 games, on the other hand, with a
required margin of victory of 2/3's, makes some sense. That at least gives
a 90% chance of catching a mistake that costs 1.5 to 2.0 stones, and some
chance of identifying smaller changes, positive or negative.
I tried 3.5.3 at --level 15 (always white, receiving 6.5 komi) against
--level 10. Assuming I did it right [2], they split the series 10-10,
indicating a strength difference of a stone or less, and no clue which one
is stronger.
doug.
address@hidden
[1] For people who'd like to check the math, here's the Matlab code:
p = 1./(1+exp(-0.8*[0.5:0.5:3.0]))
Ns = ceil(((1.96*sqrt(p.*(1-p))+1.28*sqrt(.5*(1-.5)))./(p-.5)).^2)
Nw = binoinv(0.975, Ns, 0.5)+1
[2] Does the command line
perl twogtp --white '/usr/local/bin/gnugo --mode
gtp --level 15' --black '/usr/local/bin/gnugo --mode gtp --level 10'
--komi 6.5 --games 20 --sgffile filename.sgf
look about right?
- [gnugo-devel] statistical regression,
Douglas Ridgway <=