A combinatorial instance, also called an environment in reinforcement learning. An instance is the combination of a combinatorial structure (i.e. what the bandit is allowed to make in the environment) and of the statistical properties of the arms (i.e. what reward the bandit gets).

Two properties are mandatory:

  • n_arms: the number of arms that are available in the instance (i.e. the dimension of the problem)
  • optimal_average_reward: the best reward that can be obtained at any round, in expectation

Memorise the (global) evolution of the bandit in the environment. This structure mostly contains information required by most bandit algorithms to decide their next step.

  • round: the number of rounds (time steps) the bandit has already played
  • regret: the total regret the bandit faced (i.e. how much more reward it would have got if it always played the optimum solution)
  • reward: the total reward the bandit managed to gather
  • arm_counts: the number of times each arm (and not each solution, which is a set of arms) has been played. This information is used by most bandit algorithms
  • arm_reward: the total reward obtained by this arm
  • arm_average_reward: the average reward obtained by this arm. This information is used by most bandit algorithms
  • policy_extension: if a policy needs more information than this structure contains, it can store, in a key that uniquely identifies the policy, anything it might require

Memorise all the details about the evolution of the bandit in the environment. This object can therefore become very heavy for long episodes.

The information is stored in several vectors, all of them having the same length equal to the number of rounds the bandit was used:

  • states: a list of states the bandit entered into over the episode
  • arms: a list of sets of arms that have been played over the episode
  • reward: a list of reward the policy received over the episode
  • policy_details: a list of policy-defined details over the episode (typically, computation times)
  • time_choose_action: a list of times (in milliseconds) to take a round decision over the episode
push!(trace::Trace{T}, state::State{T}, arms::Vector{T}, reward::Vector{Float64}, policy_details::PolicyDetails,
      time_choose_action::Int) where T

Appends the arguments to the execution trace of the bandit algorithm. More specifically, trace's data structures are updated to also include state, arms, reward, policy_details, and time_choose_action (expressed in milliseconds).

All of these arguments are copied, exceptpolicy_details. (Indeed, the usual scenario is to keep updating the state, the chosen solution (set of arms), and the reward objects, but to build the detail object at each round from the ground up.)


Returns a list of arm indices for the given reward. For instance, for a vector of arms, it returns the list of indices in that vector: each index is associated to an arm.

all_arm_indices(instance::CombinatorialInstance{T}) where T

Returns a list of arm indices for the given combinatorial instance.

is_feasible(instance::CombinatorialInstance{T}, arms::Vector{T})

Returns whether the set of arms is a solution that can be played for the given combinatorial instance.

is_partially_acceptable(instance::CombinatorialInstance{T}, arms::Vector{T})

Returns whether the set of arms is either a solution or the subset of a solution that can be played for the given combinatorial instance. In some cases, a subset of a solution is not a solution: the distinction between the two states is made by calling is_feasible.

Typically, is_partially_acceptable returns true for an empty set of arms: even if this is not an acceptable solution, adding elements to the empty set may yield a perfectly acceptable solution.


function optimiselinearsqrtlinear(instance::CombinatorialInstance{T}, algo::ESCB2OptimisationAlgorithm, linear::Dict{T, Float64}, sqrtlinear::Dict{T, Float64}, sqrtlinearweight::Float64, banditround::Int; with_trace::Bool=false) where T

Optimise ESCB2's objective function, with a linear term (with coefficients in linear) and a square root (with coefficients in sqrtlinear):

max linear^T x + sqrtlinear_weight * sqrt(sqrtlinear^T x) s.t. x belongs to the combinatorial set defined by instance

Several implementations are provided, and can be chosen with the appropriate subtype of ESCB2OptimisationAlgorithm.

Returns one (approximately) optimum solution. The exact guarantees depend on the chosen algorithm. If with_trace=true, a second value is returned with comprehensive details of the behaviour of the algorithm.

pull(instance::CombinatorialInstance{T}, arms::Vector{T}) where T

For the given bandit, plays the given solution (i.e. set of arms). This function returns both the reward and the regret this action caused (zero if the action is optimum, greater than zero otherwise).