Re: [Bug-gnubg] Training neural nets: How does size matter?

bug-gnubg

[Date Prev][Date Next][Thread Prev][Thread Next][Date Index][Thread Index]

Re: [Bug-gnubg] Training neural nets: How does size matter?

From:	Øystein O Johansen
Subject:	Re: [Bug-gnubg] Training neural nets: How does size matter?
Date:	Fri, 30 Aug 2002 09:56:28 +0200

Hi,

I sent this mail to Doug yesterday, and I forgot to copy the list. Sorry!
Please correct me if something is wrong.

-Øystein

----- Forwarded by Øystein O Johansen on 30.08.2002 09:53 -----

                    Øystein O                                                   

                    Johansen             To:     Douglas Zare 
<address@hidden>@INTERNET                
                                         cc:                                    

                    29.08.2002           Subject:     Re: [Bug-gnubg] Training 
neural nets: How does size matter?  
                    11:27                                                       

Hi,

> I'm training some neural nets other than gnu, and would love to
> exchange some ideas on training, architecture, etc. with the gnubg
> developers, among others.

I guess you allow me to ask you some questions about your nets as well
then?

> I have a few questions I hope some on this mailing list have the
> experience to answer. Some were prompted when a test network with
> 250K parameters that I was training surpassed (on some benchmarks,
> but perhaps not playing strength) a network with 1000K parameters,
> to my surprise.

(Your mail is probably best answered by Joseph. He is the one who
trains the nets. Unfortunately he's on a vacation or something these
days, and won't be back until mid-october I believe.)

About the above statement: 1000K parameters? 250K parameters? This
sounds like a lot to me. The networks gnubg is using, we have 250
input nodes and 128 hidden nodes. That's 32640 weights. Is that
what you call parameters?

Many years ago, I spoke to Fredrik Dahl. He doesn't say much about
the JellyFish development, but the one thing he said was, that it
wasn't much point in having to many nodes -- the training process
will just be slower.

I think this is Joseph's experience as well. When he started to work
on the gnubg networks he actually removed some of the input nodes
that he believed didn't contribute to the training. I have also asked
him about adding specific input nodes, but after some training with
these input nodes, he concludes that the new input nodes doesn't
contribute or the weights connected to this input don't converge.

Check also the history of eval.c [ref. 2] and look for changes made
by Joseph.

> First, roughly what level of improvement do you expect with mature
> networks of different numbers of hidden nodes?

No idea! I have never seen a ppg vs. hidden node chart either. I think
Tesauro gradually increased the number of hidden nodes started at only
40 hidden nodes and increased this to 80 hidden nodes, and then used
160 hidden nodes in TD-gammon 3.1 [ref. 1].

> The quality of a neural net is hard to quantify abstractly, so one
> could pin it down to, say, correct absolute evaluations in
> non-contact positions for the racing net, or elo, or cubeless ppg
> against a decent standard.

Yes, this is one of the problems, yes!

> I don't think Snowie 3's nets were mature, but if they
> and Snowie 4's nets are, then how much of an improvement should one
> expect to see if Snowie 4 has neural nets with twice as many hidden
> nodes?

Same answer as above: I have no idea! Maybe Joseph has an idea.

> Second, how many fewer nodes can you use for the same quality, if
> you release the net from predicting what is covered in the racing
> database?

You don't train a network to evaluate something it is not supposed to
evaluate in the future, do you?

I noticed a jump in the performance of the contact network after the
crashed position was separated, and the network was only trained on
"contact" position. It was like some brain capacity was released, and
this brain capacity was used to improve the game in the contact
positions.

I can't give a quantity of the above question, but I believe the
best thing is to limit the size of the nets, and have several nets for
different position classes. We have also discussed a meta-pi scheme
interpolating between position classes.

gnubg uses a race network, a contact network and a crashed network.
I think these scheme works OK. The crashed network is a not mature
yet, but there is work in progress. There has also been discussions
about splitting into more classes. I guess this will be done
eventually, but we must take one step at the time. Two years ago
there was also a network called BPG, (Backgame and Prime). There
was really some problems with this net, so is was removed.

> Third, Tesauro mentions that a neural network seems to learn a
> linear regression first. Are there other describable qualitative
> phases that one encounters? For example, does a neural network with
> 50 nodes first imitate the linear regression, then a typical mature
> 5 node network, then 10 node?

I have no idea what so ever!

> It might be wishful thinking, but if it is the case, it might be
> possible to retain most of the information by training a smaller
> network to imitate the larger network's evaluations. The smaller
> network might be faster to train, and then one could pass the
> information back.

It's about this Joseph is doing in fibs2html and mgnu_zp and other
friends. He has a very small network with only 5 hidden nodes. This
network is not only faster to train, but of course also faster to
evaluate. As I understand it, this net is used to prune candidates for
the real network. Joseph says this a huge speed improvement.

> Are there thresholds for the number of nodes necessary with one
> hidden layer before particular backgammon concepts begin to be
> understood?

Again, in the Tesauro article [Ref. 1], he writes:

   "The largest network examined in the raw encoding experiments had
   40 hidden units, and its performance appeared to saturate after
   about 200,000 games. This network achieved a strong intermediate
   level of play approximately equal to Neurogammon."

> In chess, people say that with enough lookahead, strategy becomes
> tactics, but how many nodes do you need before the timing issues
> of a high anchor holding game are understood by static evaluations?
> How many for a deep anchor holding game?

Hard to say. But I must also say, I believe (and I might be wrong)
this does not necessarily depend on the size of the net that much,
but rather depend on the training of it.

Now to my questions:
I guess you also use a normal network scheme. MLP, backpropagation
training and sigmoid squashing? Just one hidden layer, I guess?
Do you use one neural net for all kinds of positions? It sounds
to me that you're using a lot of input nodes? Basically, what's
all the input nodes? (TD-gammon 3.1 uses approx. 300 inputs)

How do you train your networks? TD(lambda)? Supervised training
based on rollouts or deeper ply search? gnubg is not trained with
TD(lambda) the last years, but trained against rollouts and deeper
ply evaluations. TD training took it to about 1650 fibs rating, but
it was not able to advance from this point.

How do you evaluate races? Do you have different inputs for your
race network? Or do you use a race database? If you use a NN,
how did you train this net? TD is completely useless here?

You don't have special net for different match scores, do you? A
net trained for DMP and GG and GS, would be a good improvement.

Are you using the same basic five outputs? Or have you tried
to get the network to estimate some kind of cube parameters as
well?

How do you benchmark your nets?

> Douglas Zare

regs,
Øystein Johansen

References:
[1]  G. Tesauro / Artificial Intelligence 134 (2002) 181-199
[2] http://savannah.gnu.org/cgi-bin/viewcvs/gnubg/gnubg/eval.c

-------------------------------------------------------------------
The information contained in this message may be CONFIDENTIAL and is
intended for the addressee only. Any unauthorised use, dissemination of the
information or copying of this message is prohibited. If you are not the
addressee, please notify the sender immediately by return e-mail and delete
this message.
Thank you.

[Prev in Thread]

Current Thread

[Next in Thread]

[Bug-gnubg] Training neural nets: How does size matter?, Douglas Zare, 2002/08/28
- Re: [Bug-gnubg] Training neural nets: How does size matter?, Øystein O Johansen <=
  - Re: [Bug-gnubg] Training neural nets: How does size matter?, Douglas Zare, 2002/08/30
  - Re: [Bug-gnubg] Training neural nets: How does size matter?, Joern Thyssen, 2002/08/31

Prev by Date: [Bug-gnubg] Training neural nets: How does size matter?
Next by Date: Re: [Bug-gnubg] blowfish dll vs gnubg
Previous by thread: [Bug-gnubg] Training neural nets: How does size matter?
Next by thread: Re: [Bug-gnubg] Training neural nets: How does size matter?
Index(es):
- Date
- Thread