Temporal Difference Learning of Position Evaluation in the Game of Go

N. N. Schraudolph, P. Dayan, and T. J. Sejnowski. Temporal Difference Learning of Position Evaluation in the Game of Go. In Advances in Neural Information Processing Systems (NIPS), pp. 817–824, Morgan Kaufmann, San Francisco, CA, 1994.
Latest version

Download


175.1kB	117.8kB	67.4kB

Abstract

Despite their facility at most other games of strategy, computers remain inept players of the game of Go. Its high branching factor defeats the tree search approach used in computer chess, while its long-range spatiotemporal interactions make position evaluation extremely challenging. Further development of conventional Go programs is hampered by their knowledge-intensive nature. We demonstrate a viable alternative by training neural networks to evaluate Go positions via temporal difference (TD) learning. We developed network architectures that reflect the spatial organisation of both input and reinforcement signals on the Go board, and training protocols that provide exposure to competent (though unlabelled) play. These techniques yield far better performance than undifferentiated networks trained by self-play alone. A network with less than 500 weights learned within 3000 games of 9x9 Go a position evaluation function that enables a primitive one-ply search to defeat a commercial Go program at a low playing level.

Additional Information

Jay Scott has reviewed this paper for his archive on Machine Learning in Games.

BibTeX Entry

@inproceedings{SchDaySej94,
     author = {Nicol N. Schraudolph and Peter Dayan
               and Terrence J. Sejnowski},
      title = {\href{http://nic.schraudolph.org/pubs/SchDaySej94.pdf}{
               Temporal Difference Learning of Position Evaluation
               in the Game of Go}},
      pages = {817--824},
     editor = {Jack D. Cowan and Gerald Tesauro and Joshua Alspector},
  booktitle =  nips,
  publisher = {Morgan Kaufmann, San Francisco, CA},
     volume =  6,
       year =  1994,
   b2h_type = {Top Conferences},
  b2h_topic = {Reinforcement Learning},
   b2h_note = {<a href="b2hd-SchDaySej01.html">Latest version</a>},
   b2h_info = {<a href="http://satirist.org/">Jay Scott</a> has <a href="http://satirist.org/learn-game/systems/go-net.html">review</a>ed this paper for his archive on <a href="http://satirist.org/learn-game/">Machine Learning in Games</a>.},
   abstract = {
    Despite their facility at most other games of strategy, computers remain
    inept players of the game of Go.  Its high branching factor defeats the
    tree search approach used in computer chess, while its long-range
    spatiotemporal interactions make position evaluation extremely
    challenging.  Further development of conventional Go programs is hampered
    by their knowledge-intensive nature.  We demonstrate a viable alternative
    by training neural networks to evaluate Go positions via {\em temporal
    difference}\/ (TD) learning.
    We developed network architectures that reflect the spatial organisation
    of both input and reinforcement signals on the Go board, and training
    protocols that provide exposure to competent (though unlabelled) play.
    These techniques yield far better performance than undifferentiated
    networks trained by self-play alone.  A network with less than 500 weights
    learned within 3\,000 games of 9x9 Go a position evaluation function that
    enables a primitive one-ply search to defeat a commercial Go program at a
    low playing level.
}}