Paper - Mastering the Game of Go with Deep Neural Networks and Tree Search

Link

This was the paper that introduced AlphaGo Zero, which was the first Go computer program to beat a human professional player in a full-sized game of Go.

It’s a lot harder to get a computer to play Go because the search space is much larger (the average branching factor is around ~250, compared to ~35 for chess), and because it’s difficult to accurately evaluate the board.

The main innovation of this paper was augmenting Monte-Carlo tree search with two deep neural networks: a “value network”, which evaluates board positions, and a “policy network” that selects moves, which were trained through a combination of supervised and reinforcement learning.

More notes

All games of perfect information have an optimal value function $v^\ast (s)$ which determines the outcome under perfect play by all players.
Minimax is an algorithm for computing this optimal value function, but this is often intractable.
The search space is typically reduced by two principles:
- Reducing the depth by cutting off search when a depth limit is reached, and then using an approximation, or
- Reducing the breadth by sampling actions from a policy $p(a \mid s)$, a distribution of the possible moves in a given state, rather than considering all actions in a given state.
Monte-Carlo rollouts evaluates nodes by randomly sampling actions until the game is over.
Monte-Carlo tree search combines traditional search with Monte-Carlo rollouts, and the policy is improved over time.
AlphaZero uses a value network for evaluating states, and a policy network for sampling actions.
Initially, two supervised learning policy networks $p _ \sigma$ and $p _ \pi$ are trained from expert human moves. $p _ \sigma$ is a much larger network than $p _ \pi$.
Then a reinforcement learning policy network $p _ \rho$ is trained which improves the $p _ \sigma$ network by optimising for the final outcome of games of self-play. This means that the network predicts actions which will lead to winning outcomes, rather than the most likely move a human would play.
Then a value network $v _ \theta$ is trained that predicts the winner of games played by $p _ \rho$ against itself.
These are all integrated into the standard Monte-Carlo tree search algorithm, which is modified to also be able to run at scale.

More notes

Related posts