# Paper - Mastering the Game of Go with Deep Neural Networks and Tree Search

> Source: https://ollybritton.com/notes/papers/mastering-the-game-of-go/ · Updated: 2025-01-08

- [Link](https://www.davidsilver.uk/wp-content/uploads/2020/03/unformatted_final_mastering_go.pdf)

This was the paper that introduced AlphaGo Zero, which was the first Go computer program to beat a human professional player in a full-sized game of Go.

It's a lot harder to get a computer to play Go because the search space is much larger (the average branching factor is around ~250, compared to ~35 for chess), and because it's difficult to accurately evaluate the board.

The main innovation of this paper was augmenting Monte-Carlo tree search with two deep neural networks: a "value network", which evaluates board positions, and a "policy network" that selects moves, which were trained through a combination of supervised and reinforcement learning.

### More notes
- All games of perfect information have an optimal value function $v^\ast (s)$ which determines the outcome under perfect play by all players.
- Minimax is an algorithm for computing this optimal value function, but this is often intractable.
- The search space is typically reduced by two principles:
	- Reducing the depth by cutting off search when a depth limit is reached, and then using an approximation, or
	- Reducing the breadth by sampling actions from a policy $p(a \mid s)$, a distribution of the possible moves in a given state, rather than considering all actions in a given state.
- Monte-Carlo rollouts evaluates nodes by randomly sampling actions until the game is over.
- Monte-Carlo tree search combines traditional search with Monte-Carlo rollouts, and the policy is improved over time.
- AlphaZero uses a value network for evaluating states, and a policy network for sampling actions.
- Initially, two supervised learning policy networks $p_\sigma$ and $p_\pi$ are trained from expert human moves. $p_\sigma$ is a much larger network than $p_\pi$.
- Then a reinforcement learning policy network $p_\rho$ is trained which improves the $p_\sigma$ network by optimising for the final outcome of games of self-play. This means that the network predicts actions which will lead to winning outcomes, rather than the most likely move a human would play.
- Then a value network $v_\theta$ is trained that predicts the winner of games played by $p_\rho$ against itself.
- These are all integrated into the standard Monte-Carlo tree search algorithm, which is modified to also be able to run at scale.

---
Olly Britton — https://ollybritton.com. Machine-readable index: https://ollybritton.com/llms.txt