bug-gnubg
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

[Bug-gnubg] Re: GnuBG: Fractional-ply evaluators


From: Nis Jorgensen
Subject: [Bug-gnubg] Re: GnuBG: Fractional-ply evaluators
Date: Tue, 21 Oct 2003 01:54:20 +0200

On 17 Oct 2003 17:28:55 -0700
address@hidden (Tom Keith) wrote in rec.games.backgammon :

> Let me describe an experiment I did comparing zero-ply and one-ply
> evaluations in GnuBG, and follow it up with a request for the GnuBG
> developers (as if they don't have enough on their plate already).

Reply sent to rec.games.backgammon, cc'ed to address@hidden If
possible, reply to both.
 
> ---
> 
> When GnuBG evaluates a position, you can tell it how far ahead you
> want it to look.  A zero-ply evaluation does no lookahead -- you just
> get the output of the program's neural net.  A one-ply evaluation
> looks ahead one roll:  it looks at all 21 possible rolls, makes what
> it believes the best play for each, and takes a weighted average of
> the resulting positions.  Each additional ply of lookahead takes about
> 21 times as long as the previous level.
> 
> When you do rollouts in GnuBG, one of the parameters you can set is
> what level of evaluation to use for checker plays.  Presumably one-ply
> evaluation plays better than zero-ply, and two-ply plays better than
> one-ply, etc.  However, there has been some discussion over the years
> about whether odd-ply evaluations are as reliable as even-ply.  (See
> http://www.bkgm.com/rgb/rgb.cgi?view+1061).


A little while ago, I suggested on the bug-gnubg list an explanation of
why 1-ply is off the mark. Here
is the important part:

Nis wrote:

> If 0-ply is unbiased but imprecise (as in having average error 0) then
> the value of the best move will be overrated. Example
> 
> Move True Equity 0-ply equity
> A    0.4         0.35
> B    0.4         0.45
> C    0.5         0.45
> D    0.5         0.55 (*BEST MOVE*)
> 
> Note that the average error is 0, but the best move is off by 0.05.
> 
> The result of this should be that 1-ply, which is the average of 21
> BestMoves for the opponent, is underrated by some amount. This will be
> added to the (negated) 0-ply, so if 0-ply is overrated, 1-ply is even
> more underrated.

Note that this was written in the context of someone claiming that 0-ply
is overrating positions. I do not think this is the case. The general
principle holds: On average, 1-ply rates positions lower than 0-ply 

> I thought I'd try an experiment comparing zero-ply and one-ply
> evaluations.  Here's what I did:
> 
> 1.  I collected a large number backgammon games between good players,
>     some human-vs-human, some human-vs-computer.  From these I took
>     a representative sample of positions. (However duplicate positions
>     were deleted so early game positions are under-represented.)

It would be nice to know the size of your sample. 

> 2.  I rolled out each position to the end of the game thirty-six times
>     using cubelss zero-ply evaluation. Variance reduction was applied.
>
> 3.  I took the root-mean-square average of the differences between
>     GnuBG's zero-ply evaluation and the rollout results, and between
>     GnuBG's one-play evaluation and the rollout results.  I looked
>     only at game-winning chances; I didn't look at gammons or
>     backgammons.

Any specific reason for using the root-mean-square? I would probably go
for the average absolute error as the indicator.

> These are the results:
> 
>     Zero-ply evaluation:  Average error = 0.0300
>     One-ply evaluation:   Average error = 0.0284
> 
> So one-ply evaluation does do better on average.  This is to be
> expected; being able to look ahead one ply should be a help,
> especially in volatile positions.
>
> In certain games GnuBG's evaluation seems to oscillate back and forth
> according to which side's turn it is to play.  When this happens, a
> one-ply evaluation (which essentially looks at the game from the other
> player's side) can give quite different numbers than a zero-ply
> evaluation.  You might expect when zero-ply and one-ply evaluations
> differ by a lot that the true value of the position is probably
> somewhere in between.  I thought it would be interesting to see what
> would happen if you had an evaluator that used the average of zero-ply
> and one-ply.  I called this a "0.5-ply evaluation."
> 
>     0.5-ply evaluation:  Average error = 0.0245
> 
> So 0.5-ply does do better!  

This seems to agree nicely the results of Joseph Heled:

http://mail.gnu.org/archive/html/bug-gnubg/2003-02/msg00218.html

which examine the ability of 0.5-ply to make actual game-decisions.
(Both cube and checker decisions, if I am not mistaken)

> In fact, it does enough better to make you
> wonder if it does even better than two-ply. (I didn't look into this.)
 
I am very interested in this as well. It would be great to add 1.5-ply
and 2-ply to the list. I think I aske  Joseph to make his
benchmark available some time ago, and I would like to repeat the
request. If possible in some semi-readable format ...

The same goes for your sample of positions.

> Can we do even better?  Something I noticed is that you can often
> predict whether zero-ply or one-ply is better for a particular
> position by looking at the relative pipcount.  (The relative pipcount
> is your own pipcount minus your opponent's pipcount.)  When the
> relative pipcount is between -160 and -40, one-ply usually does
> better; when the relative pipcount is between 40 and 150, zero-ply
> usually does better.  `

The pipcount is strongly related to the gwc. Could you perhaps check
if the correlation  between the 0-ply eval and  BestPly is stronger or
weaker than between pipcount and BestPly? (BestPly is 0 if 0-ply is
best, 1 if 1-ply is).

There might be some specific reason for 1-ply being better when you are
behind - I guess it has to do with the distribution of errors on 0-ply
evals (and thus the size of the 1-ply bias).

Bonus question: Find a best fit between "true equity" and p1 + (a * p0 +
b)(p0- p1)

> Let's call an evaluator based on this idea a
> "hybrid evaluator."  How well does the hybrid evaluator perform?

Just to clarify: What does the hybrid do when the pip-count is between
-40 and 40? Is it then using 0.5-ply?

What happens above 150 and below -160? 
 
>     Hybrid evaluator:  Average error = 0.0225
>
`> It should be noted that these tests show how well GnuBG performs at
> computing the ABSOLUTE equity of a position.  They may or may not
> indicate an improvement in GnuBG's ability to *play* a position, since
> playing depends on having accurate RELATIVE equities.  

The importance of this can not be stressed enough. The above
"improvement" says very little about how hybrid would fare in
actual play. My guess is that you would face blunders neither made by 0-
nor 1-ply - in cases where different plies are  compared (in
a hit/no-hit situation for instance)

> Nevertheless,
> I'm guessing that the 0.5-ply and hybrid evaluators play better than
> the integer-ply evaluators too.

As I write above, this has been tested for 0.5-ply, and it
does indeed score better on the benchmarkthan both 0 and 1-ply.

> -  Would it be possible to build a fractional-ply evaluator into
>    GnuBG that could evaluate positions at 0.5-ply, 1.5-ply, etc.?

I have implemented fractional plies for gnubg - in a slightly different
way than the straight average used for both your tests and Joseph's. If
you are compiling your own gnu, I'll be happy to send you the patch (the
one I sent to the bug-gnubg earlier was broken). Unfortunately, it only
does fractional plies above 1-ply :-( I think will look at implementing
0.5-ply soon.

> -  Would it be possible to build a hybrid evalator into GnuBG like the
>    one described above that would use a combination of zero-ply,
>    one-ply, and 0.5-ply evaluation depending on the relative pipcount?

I think this is a little to specialized for what I would want to put
into gnubg - at least until it has been more rigorously tested. Since it
is hard to test that which is not there, I volunteer to implement
it IF I can get someone to actually test it against Joseph's benchmarks.

> At least for cube decisions, they should be an improvement over
> zero-ply or one-ply evaluations.  And maybe checker-play will be 
> better too, making them better for rollouts.

My hope is that the standard settings of gnu will one day be fractional
- at least for doubles.

-- 
MVH
Nis Jørgensen
Live from Hoofddorp




reply via email to

[Prev in Thread] Current Thread [Next in Thread]