`AlphaZero.AbstractPlayer`

— Type`AbstractPlayer`

Abstract type for a game player.

`AlphaZero.AbstractSchedule`

— Type`AbstractSchedule{R}`

Abstract type for a parameter schedule, which represents a function from nonnegative integers to numbers of type `R`

. Subtypes must implement the `getindex(s::AbstractSchedule, i::Int)`

operator.

`AlphaZero.ArenaParams`

— Type`ArenaParams`

Parameters that govern the evaluation process that compares the current neural network with the best one seen so far (which is used to generate data).

Parameter | Type | Default |
---|---|---|

`mcts` | `MctsParams` | - |

`sim` | `SimParams` | - |

`update_threshold` | `Float64` | - |

**Explanation (two-player games)**

- The two competing networks are instantiated into two MCTS players of parameter
`mcts`

and then play`sim.num_games`

games. - The evaluated network replaces the current best one if its average collected reward is greater or equal than
`update_threshold`

.

**Explanation (single-player games)**

- The two competing networks play
`sim.num_games`

games each. - The evaluated network replaces the current best one if its average collected rewards exceeds the average collected reward of the old one by
`update_threshold`

at least.

**Remarks**

- See
`necessary_samples`

to make an informed choice for`sim.num_games`

.

**AlphaGo Zero Parameters**

In the original AlphaGo Zero paper, 400 games are played to evaluate a network and the `update_threshold`

parameter is set to a value that corresponds to a 55% win rate.

`AlphaZero.Env`

— Type`Env`

Type for an AlphZero environment.

The environment features the current neural network, the best neural network seen so far that is used for data generation, a memory buffer and an iteration counter.

**Constructor**

`Env(game_spec, params, curnn, bestnn=copy(curnn), experience=[], itc=0)`

Construct a new AlphaZero environment:

`game_spec`

specified the game being played`params`

has type`Params`

`curnn`

is the current neural network and has type`AbstractNetwork`

`bestnn`

is the best neural network so far, which is used for data generation`experience`

is the initial content of the memory buffer as a vector of`TrainingSample`

`itc`

is the value of the iteration counter (0 at the start of training)

`AlphaZero.EpsilonGreedyPlayer`

— Type`EpsilonGreedyPlayer{Player} <: AbstractPlayer`

A wrapper on a player that makes it choose a random move with a fixed $ϵ$ probability.

`AlphaZero.Human`

— Type`Human <: AbstractPlayer`

Human player that queries the standard input for actions.

Does not implement `think`

but instead implements `select_move`

directly.

`AlphaZero.LearningParams`

— Type`LearningParams`

Parameters governing the learning phase of a training iteration, where the neural network is updated to fit the data in the memory buffer.

Parameter | Type | Default |
---|---|---|

`use_gpu` | `Bool` | `false` |

`use_position_averaging` | `Bool` | `true` |

`samples_weighing_policy` | `SamplesWeighingPolicy` | - |

`optimiser` | `OptimiserSpec` | - |

`l2_regularization` | `Float32` | - |

`rewards_renormalization` | `Float32` | `1f0` |

`nonvalidity_penalty` | `Float32` | `1f0` |

`batch_size` | `Int` | - |

`loss_computation_batch_size` | `Int` | - |

`min_checkpoints_per_epoch` | `Float64` | - |

`max_batches_per_checkpoint` | `Int` | - |

`num_checkpoints` | `Int` | - |

**Description**

The neural network goes through `num_checkpoints`

series of `n`

updates using batches of size `batch_size`

drawn from memory, where `n`

is defined as follows:

`n = min(max_batches_per_checkpoint, ntotal ÷ min_checkpoints_per_epoch)`

with `ntotal`

the total number of batches in memory. Between each series, the current network is evaluated against the best network so far (see `ArenaParams`

).

`nonvalidity_penalty`

is the multiplicative constant of a loss term that corresponds to the average probability weight that the network puts on invalid actions.`batch_size`

is the batch size used for gradient descent.`loss_computation_batch_size`

is the batch size that is used to compute the loss between each epochs.- All rewards are divided by
`rewards_renormalization`

before the MSE loss is computed. - If
`use_position_averaging`

is set to true, samples in memory that correspond to the same board position are averaged together. The merged sample is reweighted according to`samples_weighing_policy`

.

**AlphaGo Zero Parameters**

In the original AlphaGo Zero paper:

- The batch size for gradient updates is $2048$.
- The L2 regularization parameter is set to $10^{-4}$.
- Checkpoints are produced every 1000 training steps, which corresponds to seeing about 20% of the samples in the memory buffer: $(1000 × 2048) / 10^7 ≈ 0.2$.
- It is unclear how many checkpoints are taken or how many training steps are performed in total.

`AlphaZero.MctsParams`

— TypeParameters of an MCTS player.

Parameter | Type | Default |
---|---|---|

`num_iters_per_turn` | `Int` | - |

`gamma` | `Float64` | `1.` |

`cpuct` | `Float64` | `1.` |

`temperature` | `AbstractSchedule{Float64}` | `ConstSchedule(1.)` |

`dirichlet_noise_ϵ` | `Float64` | - |

`dirichlet_noise_α` | `Float64` | - |

`prior_temperature` | `Float64` | `1.` |

**Explanation**

An MCTS player picks an action as follows. Given a game state, it launches `num_iters_per_turn`

MCTS iterations, with UCT exploration constant `cpuct`

. Rewards are discounted using the `gamma`

factor.

Then, an action is picked according to the distribution $π$ where $π_i ∝ n_i^{1/τ}$ with $n_i$ the number of times that the $i^{\text{th}}$ action was visited and $τ$ the `temperature`

parameter.

It is typical to use a high value of the temperature parameter $τ$ during the first moves of a game to increase exploration and then switch to a small value. Therefore, `temperature`

is am `AbstractSchedule`

.

For information on parameters `cpuct`

, `dirichlet_noise_ϵ`

, `dirichlet_noise_α`

and `prior_temperature`

, see `MCTS.Env`

.

**AlphaGo Zero Parameters**

In the original AlphaGo Zero paper:

- The discount factor
`gamma`

is set to 1. - The number of MCTS iterations per move is 1600, which corresponds to 0.4s of computation time.
- The temperature is set to 1 for the 30 first moves and then to an infinitesimal value.
- The $ϵ$ parameter for the Dirichlet noise is set to $0.25$ and the $α$ parameter to $0.03$, which is consistent with the heuristic of using $α = 10/n$ with $n$ the maximum number of possibles moves, which is $19 × 19 + 1 = 362$ in the case of Go.

`AlphaZero.MctsPlayer`

— Type`MctsPlayer{MctsEnv} <: AbstractPlayer`

A player that selects actions using MCTS.

**Constructors**

`MctsPlayer(mcts::MCTS.Env; τ, niters, timeout=nothing)`

Construct a player from an MCTS environment. When computing each move:

- if
`timeout`

is provided, MCTS simulations are executed for`timeout`

seconds by groups of`niters`

- otherwise,
`niters`

MCTS simulations are run

The temperature parameter `τ`

can be either a real number or a `AbstractSchedule`

.

```
MctsPlayer(game_spec::AbstractGameSpec, oracle,
params::MctsParams; timeout=nothing)
```

Construct an MCTS player from an oracle and an `MctsParams`

structure.

`AlphaZero.MemAnalysisParams`

— Type`MemAnalysisParams`

Parameters governing the analysis of the memory buffer (for debugging and profiling purposes).

Parameter | Type | Default |
---|---|---|

`num_game_stages` | `Int` | - |

**Explanation**

The memory analysis consists in partitioning the memory buffer in `num_game_stages`

parts of equal size, according to the number of remaining moves until the end of the game for each sample. Then, the quality of the predictions of the current neural network is evaluated on each subset (see `Report.Memory`

).

This is useful to get an idea of how the neural network performance varies depending on the game stage (typically, good value estimates for endgame board positions are available earlier in the training process than good values for middlegame positions).

`AlphaZero.MemoryBuffer`

— Type`MemoryBuffer(game_spec, size, experience=[])`

A circular buffer to hold memory samples.

`AlphaZero.NetworkPlayer`

— Type`NetworkPlayer{Net} <: AbstractPlayer`

A player that uses the policy output by a neural network directly, instead of relying on MCTS. The given neural network must be in test mode.

`AlphaZero.PLSchedule`

— Type`PLSchedule{R} <: AbstractSchedule{R}`

Type for piecewise linear schedules.

**Constructors**

`PLSchedule(cst)`

Return a schedule with a constant value `cst`

.

`PLSchedule(xs, ys)`

Return a piecewise linear schedule such that:

- For all
`i`

,`(xs[i], ys[i])`

belongs to the schedule's graph. - Before
`xs[1]`

, the schedule has value`ys[1]`

. - After
`xs[end]`

, the schedule has value`ys[end]`

.

`AlphaZero.Params`

— Type`Params`

The AlphaZero training hyperparameters.

Parameter | Type | Default |
---|---|---|

`self_play` | `SelfPlayParams` | - |

`learning` | `LearningParams` | - |

`arena` | `Union{Nothing, ArenaParams` } | - |

`memory_analysis` | `Union{Nothing, MemAnalysisParams}` | `nothing` |

`num_iters` | `Int` | - |

`use_symmetries` | `Bool` | `false` |

`ternary_rewards` | `Bool` | `false` |

`mem_buffer_size` | `PLSchedule{Int}` | - |

**Explanation**

The AlphaZero training process consists in `num_iters`

iterations. Each iteration can be decomposed into a self-play phase (see `SelfPlayParams`

) and a learning phase (see `LearningParams`

).

`ternary_rewards`

: set to`true`

if the rewards issued by the game environment always belong to $\{-1, 0, 1\}$ so that the logging and profiling tools can take advantage of this property.`use_symmetries`

: if set to`true`

, board symmetries are used for data augmentation before learning.`mem_buffer_size`

: size schedule of the memory buffer, in terms of number of samples. It is typical to start with a small memory buffer that is grown progressively so as to wash out the initial low-quality self-play data more quickly.`memory_analysis`

: parameters for the memory analysis step that is performed at each iteration (see`MemAnalysisParams`

), or`nothing`

if no analysis is to be performed.

**AlphaGo Zero Parameters**

In the original AlphaGo Zero paper:

- About 5 millions games of self-play are played across 200 iterations.
- The memory buffer contains 500K games, which makes about 100M samples as an average game of Go lasts about 200 turns.

`AlphaZero.PlayerWithTemperature`

— Type`PlayerWithTemperature{Player} <: AbstractPlayer`

A wrapper on a player that enables overwriting the temperature schedule.

`AlphaZero.RandomPlayer`

— Type`RandomPlayer <: AbstractPlayer`

A player that picks actions uniformly at random.

`AlphaZero.SamplesWeighingPolicy`

— Type`SamplesWeighingPolicy`

During self-play, early board positions are possibly encountered many times across several games. The corresponding samples can be merged together and given a weight $W$ that is a nondecreasing function of the number $n$ of merged samples:

`CONSTANT_WEIGHT`

: $W(n) = 1$`LOG_WEIGHT`

: $W(n) = \log_2(n) + 1$`LINEAR_WEIGHT`

: $W(n) = n$

`AlphaZero.SelfPlayParams`

— Type`SelfPlayParams`

Parameters governing self-play.

Parameter | Type | Default |
---|---|---|

`mcts` | `MctsParams` | - |

`sim` | `SimParams` | - |

**AlphaGo Zero Parameters**

In the original AlphaGo Zero paper, `sim.num_games=25_000`

(5 millions games of self-play across 200 iterations).

`AlphaZero.SimParams`

— Type`SimParams`

Parameters for parallel game simulations.

These parameters are common to self-play data generation, neural network evaluation and benchmarking.

Parameter | Type | Default |
---|---|---|

`num_games` | `Int` | - |

`num_workers` | `Int` | - |

`batch_size` | `Int` | - |

`use_gpu` | `Bool` | `false` |

`fill_batches` | `Bool` | `true` |

`flip_probability` | `Float64` | `0.` |

`reset_every` | `Union{Nothing, Int}` | `1` |

`alternate_colors` | `Float64` | `false` |

**Explanations**

- On each machine (process),
`num_workers`

simulation tasks are spawned. Inference requests are processed by an inference server by batch of size`batch_size`

. Note that we must have`batch_size <= num_workers`

. - If
`fill_batches`

is set to`true`

, we make sure that batches sent to the neural network for inference have constant size. - Both players are reset (e.g. their MCTS trees are emptied) every
`reset_every`

games (or never if`nothing`

is passed). - To add randomization and before every game turn, the game board is "flipped" according to a symmetric transformation with probability
`flip_probability`

. - In the case of (symmetric) two-player games and if
`alternate_colors`

is set to`true`

, then the colors of both players are swapped between each simulated game.

`AlphaZero.Simulator`

— Type`Simulator(make_player, make_oracles, measure)`

A distributed simulator that encapsulates the details of running simulations across multiple threads and multiple machines.

**Arguments**

`make_oracles`

: a function that takes no argument and returns the oracles used by the player, which can be either`nothing`

, a single oracle or a pair of oracles.`make_player`

: a function that takes as an argument the result of`make_oracles`

and builds a player from it. In practice, an oracle returned by`make_oracles`

may be replaced by a`BatchedOracle`

before it is passed to`make_player`

, which is why these two functions are specified separately.`measure(trace, colors_flipped, player)`

: the function that is used to take measurements after each game simulation.

`AlphaZero.StepSchedule`

— Type`StepSchedule{R} <: AbstractSchedule{R}`

Type for step function schedules.

**Constructor**

`StepSchedule(;start, change_at, values)`

Return a schedule that has initial value `start`

. For all `i`

, the schedule takes value `values[i]`

at step `change_at[i]`

.

`AlphaZero.Trace`

— Type`Trace{State}`

An object that collects all states visited during a game, along with the rewards obtained at each step and the successive player policies to be used as targets for the neural network.

**Constructor**

`Trace(initial_state)`

`AlphaZero.TrainingSample`

— Type`TrainingSample{State}`

Type of a training sample. A sample features the following fields:

`s::State`

is the state`π::Vector{Float64}`

is the recorded MCTS policy for this position`z::Float64`

is the discounted reward cumulated from state`s`

`t::Float64`

is the (average) number of moves remaining before the end of the game`n::Int`

is the number of times the state`s`

was recorded

As revealed by the last field `n`

, several samples that correspond to the same state can be merged, in which case the `π`

, `z`

and `t`

fields are averaged together.

`AlphaZero.TwoPlayers`

— Type`TwoPlayers <: AbstractPlayer`

If `white`

and `black`

are two `AbstractPlayer`

, then `TwoPlayers(white, black)`

is a player that behaves as `white`

when `white`

is to play and as `black`

when `black`

is to play.

`AlphaZero.AlphaZeroPlayer`

— Method`AlphaZeroPlayer(::Env; [timeout, mcts_params, use_gpu])`

Create an AlphaZero player from the current training environment.

Note that the returned player may be slow as it does not batch MCTS requests.

`AlphaZero.CyclicSchedule`

— Method`CyclicSchedule(base, mid, term; n, xmid=0.45, xback=0.90)`

Return the `PLSchedule`

that is typically used for cyclic learning rate scheduling.

`AlphaZero.get_experience`

— Method`get_experience(env::Env)`

Return the content of the agent's memory as a vector of `TrainingSample`

.

`AlphaZero.get_experience`

— Method`get_experience(::MemoryBuffer) :: Vector{<:TrainingSample}`

Return all samples in the memory buffer.

`AlphaZero.initial_report`

— Method`initial_report(env::Env)`

Return a report summarizing the configuration of agent before training starts, as an object of type `Report.Initial`

.

`AlphaZero.interactive!`

— Function```
interactive!(game)
interactive!(gspec)
interactive!(game, player)
interactive!(gspec, player)
interactive!(game, white, black)
interactive!(gspec, white, black)
```

Launch a possibly interactive game session.

This function takes either an `AbstractGameSpec`

or `AbstractGameEnv`

as an argument.

`AlphaZero.necessary_samples`

— Method`necessary_samples(ϵ, β) = log(1 / β) / (2 * ϵ^2)`

Compute the number of times $N$ that a random variable $X \sim \text{Ber}(p)$ has to be sampled so that if the empirical average of $X$ is greather than $1/2 + ϵ$, then $p > 1/2$ with probability at least $1-β$.

This bound is based on Hoeffding's inequality .

`AlphaZero.play_game`

— Method`play_game(gspec::AbstractGameSpec, player; flip_probability=0.) :: Trace`

Simulate a game by an `AbstractPlayer`

.

- For two-player games, please use
`TwoPlayers`

. - If the
`flip_probability`

argument is set to $p$, the board is*flipped*randomly at every turn with probability $p$, using`GI.apply_random_symmetry!`

.

`AlphaZero.player_temperature`

— Method`player_temperature(::AbstractPlayer, game, turn_number)`

Return the player temperature, given the number of actions that have been played before by both players in the current game.

A default implementation is provided that always returns 1.

`AlphaZero.push_trace!`

— Method`push_trace!(mem::MemoryBuffer, trace::Trace, gamma)`

Collect samples out of a game trace and add them to the memory buffer.

Here, `gamma`

is the reward discount factor.

`AlphaZero.record_trace`

— Method`record_trace`

A measurement function to be passed to a `Simulator`

that produces named tuples with two fields: `trace::Trace`

and `colors_flipped::Bool`

.

`AlphaZero.reset_player!`

— Method`reset_player!(::AbstractPlayer)`

Reset the internal memory of a player (e.g. the MCTS tree). The default implementation does nothing.

`AlphaZero.select_move`

— Method`select_move(player::AbstractPlayer, game, turn_number)`

Return a single action. A default implementation is provided that samples an action according to the distribution computed by `think`

, with a temperature given by `player_temperature`

.

`AlphaZero.simulate`

— Method`simulate(::Simulator, ::AbstractGameSpec; ::SimParams; <kwargs>)`

Play a series of games using a given `Simulator`

.

**Keyword Arguments**

`game_simulated`

is called every time a game simulation is completed (with no arguments)

**Return**

Return a vector of objects computed by `simulator.measure`

.

`AlphaZero.simulate_distributed`

— Method`simulate_distributed(::Simulator, ::AbstractGameSpec, ::SimParams; <kwargs>)`

Identical to `simulate`

but splits the work across all available distributed workers, whose number is given by `Distributed.nworkers()`

.

`AlphaZero.think`

— Function`think(::AbstractPlayer, game)`

Return a probability distribution over available actions as a `(actions, π)`

pair.

`AlphaZero.train!`

— Function`train!(env::Env, handler=nothing)`

Start or resume the training of an AlphaZero agent.

A `handler`

object can be passed that implements a subset of the callback functions defined in `Handlers`

.

`Base.push!`

— Method`Base.push!(t::Trace, π, r, s)`

Add a (target policy, reward, new state) quadruple to a trace.