CombinatorialBandits.CombinatorialInstance
— TypeCombinatorialInstance{T}
A combinatorial instance, also called an environment in reinforcement learning. An instance is the combination of a combinatorial structure (i.e. what the bandit is allowed to make in the environment) and of the statistical properties of the arms (i.e. what reward the bandit gets).
Two properties are mandatory:
n_arms
: the number of arms that are available in the instance (i.e. the dimension of the problem)optimal_average_reward
: the best reward that can be obtained at any round, in expectation
CombinatorialBandits.State
— TypeState{T}
Memorise the (global) evolution of the bandit in the environment. This structure mostly contains information required by most bandit algorithms to decide their next step.
round
: the number of rounds (time steps) the bandit has already playedregret
: the total regret the bandit faced (i.e. how much more reward it would have got if it always played the optimum solution)reward
: the total reward the bandit managed to gatherarm_counts
: the number of times each arm (and not each solution, which is a set of arms) has been played. This information is used by most bandit algorithmsarm_reward
: the total reward obtained by this armarm_average_reward
: the average reward obtained by this arm. This information is used by most bandit algorithmspolicy_extension
: if a policy needs more information than this structure contains, it can store, in a key that uniquely identifies the policy, anything it might require
CombinatorialBandits.Trace
— TypeTrace{T}
Memorise all the details about the evolution of the bandit in the environment. This object can therefore become very heavy for long episodes.
The information is stored in several vectors, all of them having the same length equal to the number of rounds the bandit was used:
states
: a list of states the bandit entered into over the episodearms
: a list of sets of arms that have been played over the episodereward
: a list of reward the policy received over the episodepolicy_details
: a list of policy-defined details over the episode (typically, computation times)time_choose_action
: a list of times (in milliseconds) to take a round decision over the episode
Base.push!
— Methodpush!(trace::Trace{T}, state::State{T}, arms::Vector{T}, reward::Vector{Float64}, policy_details::PolicyDetails,
time_choose_action::Int) where T
Appends the arguments to the execution trace of the bandit algorithm. More specifically, trace
's data structures are updated to also include state
, arms
, reward
, policy_details
, and time_choose_action
(expressed in milliseconds).
All of these arguments are copied, exceptpolicy_details
. (Indeed, the usual scenario is to keep updating the state, the chosen solution (set of arms), and the reward objects, but to build the detail object at each round from the ground up.)
CombinatorialBandits.Kombinator.dimension
— Methoddimension(instance::CombinatorialInstance{T}) where T
Returns the dimension of a solution.
CombinatorialBandits.all_arm_indices
— Functionall_arm_indices(reward)
Returns a list of arm indices for the given reward. For instance, for a vector of arms, it returns the list of indices in that vector: each index is associated to an arm.
all_arm_indices(instance::CombinatorialInstance{T}) where T
Returns a list of arm indices for the given combinatorial instance.
CombinatorialBandits.initial_state
— Methodinitial_state(instance::CombinatorialInstance{Int})
Returns a new empty State
object for the given combinatorial instance.
CombinatorialBandits.initial_trace
— Methodinitial_state(instance::CombinatorialInstance{Int})
Returns a new empty Trace
object for the given combinatorial instance.
CombinatorialBandits.is_feasible
— Methodis_feasible(instance::CombinatorialInstance{T}, arms::Vector{T})
Returns whether the set of arms is a solution that can be played for the given combinatorial instance.
CombinatorialBandits.is_partially_acceptable
— Methodis_partially_acceptable(instance::CombinatorialInstance{T}, arms::Vector{T})
Returns whether the set of arms is either a solution or the subset of a solution that can be played for the given combinatorial instance. In some cases, a subset of a solution is not a solution: the distinction between the two states is made by calling is_feasible
.
Typically, is_partially_acceptable
returns true
for an empty set of arms: even if this is not an acceptable solution, adding elements to the empty set may yield a perfectly acceptable solution.
CombinatorialBandits.optimise_linear_sqrtlinear
— Methodfunction optimiselinearsqrtlinear(instance::CombinatorialInstance{T}, algo::ESCB2OptimisationAlgorithm, linear::Dict{T, Float64}, sqrtlinear::Dict{T, Float64}, sqrtlinearweight::Float64, banditround::Int; with_trace::Bool=false) where T
Optimise ESCB2's objective function, with a linear term (with coefficients in linear
) and a square root (with coefficients in sqrtlinear
):
max linear^T x + sqrtlinear_weight * sqrt(sqrtlinear^T x) s.t. x belongs to the combinatorial set defined by instance
Several implementations are provided, and can be chosen with the appropriate subtype of ESCB2OptimisationAlgorithm
.
Returns one (approximately) optimum solution. The exact guarantees depend on the chosen algorithm. If with_trace=true
, a second value is returned with comprehensive details of the behaviour of the algorithm.
CombinatorialBandits.pull
— Methodpull(instance::CombinatorialInstance{T}, arms::Vector{T}) where T
For the given bandit, plays the given solution (i.e. set of arms). This function returns both the reward and the regret this action caused (zero if the action is optimum, greater than zero otherwise).