bug-gnubg
[Top][All Lists]
Advanced

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-gnubg] Training neural nets: How does size matter?


From: Douglas Zare
Subject: Re: [Bug-gnubg] Training neural nets: How does size matter?
Date: Fri, 30 Aug 2002 16:22:23 -0400
User-agent: Internet Messaging Program (IMP) 3.1

Quoting Øystein O Johansen <address@hidden>:

> > I'm training some neural nets other than gnu, and would love to
> > exchange some ideas on training, architecture, etc. with the gnubg
> > developers, among others.
> 
> I guess you allow me to ask you some questions about your nets as well
> then?

Yes, but of course I might not answer them all. 

I'm happy to help the gnu effort, particularly since I don't want you to be 
beaten badly by Snowie 4. However, I also don't want my ideas to be duplicated 
and made freely available even before my program is released.

For those in the NY area, I will give a talk at IBM in the near future in which 
I will give some more details.

>[...]
> About the above statement: 1000K parameters? 250K parameters? This
> sounds like a lot to me. The networks gnubg is using, we have 250
> input nodes and 128 hidden nodes. That's 32640 weights. Is that
> what you call parameters?

Basically. It does mean the weight files I'm using are too large to fit on a 
floppy disk.
 
> Many years ago, I spoke to Fredrik Dahl. He doesn't say much about
> the JellyFish development, but the one thing he said was, that it
> wasn't much point in having to many nodes -- the training process
> will just be slower.

It is valuable on multiple levels to be able to evaluate the network rapidly. 
However, more weights does not necessarily mean slower evaluations. I much 
prefer the crispness of Jellyfish's rapid play to Snowie's sluggishness, 
particularly given that Snowie does not seem to have a big advantage in money 
play. (I look forward to seeing more data on this.) I think FD mentioned that 
Jellyfish uses about 20K weights.

Computing power has improved quite a bit since then, of course.

> I think this is Joseph's experience as well. When he started to work
> on the gnubg networks he actually removed some of the input nodes
> that he believed didn't contribute to the training. I have also asked
> him about adding specific input nodes, but after some training with
> these input nodes, he concludes that the new input nodes doesn't
> contribute or the weights connected to this input don't converge.

Do you mean training from scratch using the new inputs, or adding the new input 
with initially low weight to an existing trained network?
 
> Check also the history of eval.c [ref. 2] and look for changes made
> by Joseph.
> 
> > First, roughly what level of improvement do you expect with mature
> > networks of different numbers of hidden nodes?
> 
> No idea! I have never seen a ppg vs. hidden node chart either. I think
> Tesauro gradually increased the number of hidden nodes started at only
> 40 hidden nodes and increased this to 80 hidden nodes, and then used
> 160 hidden nodes in TD-gammon 3.1 [ref. 1].

I'm familiar with his descriptions in earlier articles. However, I don't know 
what the corresponding improvements in playing strength are supposed to have 
been (the performance in short sessions is inconclusive, of course), and 
whether he felt that the networks were fully trained. 

It would probably be worth training networks with only a few hidden nodes and a 
fixed input set to see how well they perform. It wouldn't take much computing 
time to train the networks, but I had hoped that you all had already done it.

> > The quality of a neural net is hard to quantify abstractly, so one
> > could pin it down to, say, correct absolute evaluations in
> > non-contact positions for the racing net, or elo, or cubeless ppg
> > against a decent standard.
> 
> Yes, this is one of the problems, yes!

This is more medicine than science. I think one should pick a few benchmarks 
and use them, and if they aren't enough, add more. Which benchmarks are you set 
up to use so far?

> > I don't think Snowie 3's nets were mature, but if they
> > and Snowie 4's nets are, then how much of an improvement should one
> > expect to see if Snowie 4 has neural nets with twice as many hidden
> > nodes?
> 
> Same answer as above: I have no idea! Maybe Joseph has an idea.

I don't believe the assumptions, but my guess is that the answer is more than 
0.02 ppg.

> > Second, how many fewer nodes can you use for the same quality, if
> > you release the net from predicting what is covered in the racing
> > database?
> 
> You don't train a network to evaluate something it is not supposed to
> evaluate in the future, do you?

Of course. You do, too, from what you write below. 

> I noticed a jump in the performance of the contact network after the
> crashed position was separated, and the network was only trained on
> "contact" position. It was like some brain capacity was released, and
> this brain capacity was used to improve the game in the contact
> positions.

> > Third, Tesauro mentions that a neural network seems to learn a
> > linear regression first. Are there other describable qualitative
> > phases that one encounters? For example, does a neural network with
> > 50 nodes first imitate the linear regression, then a typical mature
> > 5 node network, then 10 node?
> 
> I have no idea what so ever!

It's probably worth taking some time to understand these smaller nets. From the 
time of TD-Gammon onwards, backgammon programs have been better than almost all 
of the human players, restricting the critiques of their play too much. 
However, a network with intermediate play can be analyzed in a helpful fashion 
by any human expert. 
 
> > It might be wishful thinking, but if it is the case, it might be
> > possible to retain most of the information by training a smaller
> > network to imitate the larger network's evaluations. The smaller
> > network might be faster to train, and then one could pass the
> > information back.
> 
> It's about this Joseph is doing in fibs2html and mgnu_zp and other
> friends. He has a very small network with only 5 hidden nodes. This
> network is not only faster to train, but of course also faster to
> evaluate. As I understand it, this net is used to prune candidates for
> the real network. Joseph says this a huge speed improvement.

That's what Jellyfish level 3 is (though not specifically 5 nodes), right? 
Though its play seems laughable to me now, playing primarily against JF level 3 
took me from the novice level (I learned that an opening 6-1 should not be 
played 13/6 in July or August of 1999) to 1800 on FIBS in a few months. I don't 
think that I would have learned as quickly from a slower program that played 
more accurately.

> > Are there thresholds for the number of nodes necessary with one
> > hidden layer before particular backgammon concepts begin to be
> > understood?
> 
> Again, in the Tesauro article [Ref. 1], he writes:
> 
>    "The largest network examined in the raw encoding experiments had
>    40 hidden units, and its performance appeared to saturate after
>    about 200,000 games. This network achieved a strong intermediate
>    level of play approximately equal to Neurogammon."

That doesn't say that the raw encoding (which is already quite clever, 
including a lot of backgammon understanding) understood the same concepts that 
Neurogammon understood. Further, I'm more interested in the performance of 
networks that have more complicated inputs than the raw encoding. In my 
experience intermediate level play is achievable in a few GHz-minutes. 

> > In chess, people say that with enough lookahead, strategy becomes
> > tactics, but how many nodes do you need before the timing issues
> > of a high anchor holding game are understood by static evaluations?
> > How many for a deep anchor holding game?
> 
> Hard to say. But I must also say, I believe (and I might be wrong)
> this does not necessarily depend on the size of the net that much,
> but rather depend on the training of it.

I think it clearly does depend on the size of the net, though of course a huge 
net might not play optimally for its size. By the size of the net, I don't just 
mean possible increases in size, but also possible decreases. So I mean that if 
you shrink the net too much, it won't be able to understand, e.g., what a safe 
contact bearoff structure looks like, or how to build 3 stacked points into a 
prime on 0-ply. Of course, asymptotically, perfect play is achievable with 
sufficiently many nodes.

It may well be the case that gnu's network is large enough to play much better 
than it does, and that training is the key to improvement. 

> Now to my questions:

I'm going to skip most of these. Suffice it to say that I and others working 
with me have introduced some innovations to most of these, but I don't want to 
describe them yet.

> How do you evaluate races? Do you have different inputs for your
> race network? Or do you use a race database? If you use a NN,
> how did you train this net? TD is completely useless here?

I'll answer here. First, from one-sided databases we have constructed some look-
up tables that are used as inputs. For example, the lookup table can include 
the exact chances of winning at DMP against a pure n-roll position for each n. 
Second, one can apply these lookup tables even in contact positions. 

> How do you benchmark your nets?

I use a variety of methods. One is through checking evaluations of reference 
positions. Another is the level of disagreement between plies (see my "Bot 
Confusion" column). I expect to include rollouts of positions of one-sided 
errors and variance reduced play against opponents of fixed strength soon. 

Douglas Zare









reply via email to

[Prev in Thread] Current Thread [Next in Thread]