MCMC Estimation

Sampler

BayesFlux.MCMCStateType
abstract type MCMCState end

Every MCMC method must be implemented via a MCMCState which keeps track of all important information.

Mandatory Fields

  • samples::Matrix of dimension numtotalparameter×num_samples
  • nsampled the number of thus far sampled samples

Mandatory Functions

  • update!(sampler, θ, bnn, ∇θ) where θ is the current parameter vector and ∇θ(θ) is a function providing gradients. The function must return θ and num_samples so far sampled.
  • initialise!(sampler, θ, numsamples; continue_sampling) which initialises the sampler. If continue_sampling is true, then the final goal is to obtain numsamples samples and thus only the remaining ones still need to be sampled.
  • calculate_epochs(sampler, numbatches, numsamples; continue_sampling) which calculates the number of epochs that must be run through in order to obtain numsamples samples if numbatches batches are used. The number of epochs must be returned. If continue_sampling is true, then the goal is to obtain in total numsamples samples and thus we only need the number of epochs that still need to be run to obtain this total and NOT the number of epochs to sample numsamples new samples.
BayesFlux.mcmcFunction
mcmc(args...; kwargs...)

Sample from a BNN using MCMC

Arguments

  • bnn: a Bayesian Neural Network
  • batchsize: batchsize
  • numsamples: number of samples to take
  • sampler: sampler to use

Keyword Arguments

  • shuffle::Bool=true: should data be shuffled after each epoch such that batches are different in each epoch?
  • partial::Bool=true: are partial batches allowed? If true, some batches might be smaller than batchsize
  • showprogress::Bool=true: should a progress bar be shown?
  • continue_sampling::Bool=false: If true and numsamples is larger than sampler.nsampled then additional samples will be taken
  • θstart::AbstractVector{T}=vcat(bnn.init()...): starting parameter vector
BayesFlux.SGLDType

Stochastic Gradient Langevin Dynamics as proposed in Welling, M., & Teh, Y. W. (n.d.). Bayesian Learning via Stochastic Gradient Langevin Dynamics. 8.

Fields

  • θ::AbstractVector: Current sample
  • samples::Matrix: Matrix of samples. Not all columns will be actual samples if sampling was stopped early. See nsampled for the actual number of samples taken.
  • nsampled::Int: Number of samples taken
  • min_stepsize::T: Stop decreasing the stepsize when it is below this value.
  • didinform::Bool: Flag keeping track of whether we informed user that min_stepsize was reached.
  • stepsize_a::T: See stepsize
  • stepsize_b::T: See stepsize
  • stepsize_γ::T: See stepsize
  • maxnorm::T: Maximimum gradient norm. Gradients are being clipped if norm exceeds this value
BayesFlux.SGNHTSType

Stochastic Gradient Nose-Hoover Thermostat as proposed in

Proposed in Leimkuhler, B., & Shang, X. (2016). Adaptive thermostats for noisy gradient systems. SIAM Journal on Scientific Computing, 38(2), A712-A736.

This is similar to SGNHT as proposed in Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., & Neven, H. (2014). Bayesian sampling using stochastic gradient thermostats. Advances in neural information processing systems, 27.

Fields

  • samples::Matrix: Containing the samples
  • nsampled::Int: Number of samples taken so far. Can be smaller than size(samples, 2) if sampling was interrupted.
  • p::AbstractVector: Momentum
  • xi::Number: Thermostat
  • l::Number: Stepsize; This often is in the 0.001-0.1 range.
  • σA::Number: Diffusion factor; If the stepsize is small, this should be larger than 1.
  • μ::Number: Free parameter in thermostat. Defaults to 1.
  • t::Int: Current step count
  • kinetic::Vector: Keeps track of the kinetic energy. Goal of SGNHT is to have the average close to one
BayesFlux.SGNHTType

Stochastic Gradient Nose-Hoover Thermostat as proposed in

Ding, N., Fang, Y., Babbush, R., Chen, C., Skeel, R. D., & Neven, H. (2014). Bayesian sampling using stochastic gradient thermostats. Advances in neural information processing systems, 27.

Fields

  • samples::Matrix: Containing the samples
  • nsampled::Int: Number of samples taken so far. Can be smaller than size(samples, 2) if sampling was interrupted.
  • p::AbstractVector: Momentum
  • xi::Number: Thermostat
  • l::Number: Stepsize
  • A::Number: Diffusion factor
  • t::Int: Current step count
  • kinetic::Vector: Keeps track of the kinetic energy. Goal of SGNHT is to have the average close to one
BayesFlux.GGMCType

Gradient Guided Monte Carlo

Proposed in Garriga-Alonso, A., & Fortuin, V. (2021). Exact langevin dynamics with stochastic gradients. arXiv preprint arXiv:2102.01691.

Fields

  • samples::Matrix: Matrix containing the samples. If sampling stopped early, then not all columns will actually correspond to samples. See nsampled to check how many samples were actually taken
  • nsampled::Int: Number of samples taken.
  • t::Int: Total number of steps taken.
  • accepted::Vector{Bool}: If true, sample was accepted; If false, proposed sample was rejected and previous sample was taken.
  • β::T: See paper.
  • l::T: Step-length; See paper.
  • sadapter::StepsizeAdapter: A StepsizeAdapter. Default is DualAveragingStepSize
  • M::AbstractMatrix: Mass Matrix
  • Mhalf::AbstractMatrix: Lower triangual cholesky decomposition of M
  • Minv::AbstractMatrix: Inverse mass matrix.
  • madapter::MassAdapter: A MassAdapter
  • momentum::AbstractVector: Last momentum vector
  • lMH::T: log of Metropolis-Hastings ratio.
  • steps::Int: Number of steps to take before calculating MH ratio.
  • current_step::Int: Current step in the recurrent sequence 1, ..., steps.
  • maxnorm::T: Maximimum gradient norm. Gradients are being clipped if norm exceeds this value
BayesFlux.HMCType

Standard Hamiltonian Monte Carlo (Hybrid Monte Carlo).

Allows for the use of stochastic gradients, but the validity of doing so is not clear.

This is motivated by parts of the discussion in Neal, R. M. (1996). Bayesian Learning for Neural Networks (Vol. 118). Springer New York. https://doi.org/10.1007/978-1-4612-0745-0

Code was partially adapted from https://colindcarroll.com/2019/04/11/hamiltonian-monte-carlo-from-scratch/

Fields

  • samples::Matrix: Samples taken
  • nsampled::Int: Number of samples taken. Might be smaller than size(samples) if sampling was interrupted.
  • θold::AbstractVector: Old sample. Kept for rejection step.
  • momentum::AbstractVector: Momentum variables
  • momentumold::AbstractVector: Old momentum variables. Kept for rejection step.
  • t::Int: Current step.
  • path_len::Int: Number of leapfrog steps.
  • current_step::Int: Current leapfrog step.
  • accepted::Vector{Bool}: Whether a draw in samples was a accepted draw or rejected (in which case it is the same as the previous one.)
  • sadapter::StepsizeAdapter: Stepsize adapter giving the stepsize in each iteration.
  • l: Stepsize.
  • madapter::MassAdapter: Mass matrix adapter giving the inverse mass matrix in each iteration.
  • Minv::AbstractMatrix: Inverse mass matrix
  • maxnorm::T: Maximimum gradient norm. Gradients are being clipped if norm exceeds this value
BayesFlux.AdaptiveMHType

Adaptive Metropolis Hastings as introduced in

Haario, H., Saksman, E., & Tamminen, J. (2001). An adaptive Metropolis algorithm. Bernoulli, 223-242.

Fields

  • samples::Matix: Matrix holding the samples. If sampling was stopped early, not all columns will represent samples. To figure out how many columns represent samples, check out nsampled.
  • nsampled::Int: Number of samples obtained.
  • C0::Matrix: Initial covariance matrix.
  • Ct::Matrix: Covariance matrix in iteration t
  • t::Int: Current time period
  • t0::Int: When to start adaptig the covariance matrix? Covariance is adapted in a rolling window form.
  • sd::T: See the paper.
  • ϵ::T: Will be added to diagonal to prevent numerical non-pod-def problems. If you run into numerical problems, try increasing this values.
  • accepted::Vector{Bool}: For each sample, indicating whether the sample was accepted (true) or the previous samples was chosen (false)

Notes

  • Adaptive MH might not be suited if it is very costly to calculate the likelihood as this needs to be done for each sample on the full dataset. Plans exist to make this faster.
  • Works best when started at a MAP estimate.

Mass Adaptation

BayesFlux.MassAdapterType

Adapt the mass matrix in MCMC and especially dynamic MCMC methods such as HMC, GGMC, SGLD, SGNHT, ...

Mandatory Fields

  • Minv::AbstractMatrix: The inverse mass matrix used in HMC, GGMC, ...

Mandatory Functions

  • (madapter::MassAdapter)(s::MCMCState, θ::AbstractVector, bnn, ∇θ): Every mass adapter must be callable and have the sampler state, the current sample, the BNN and a gradient function as arguments. It must return the new Minv Matrix.
BayesFlux.DiagCovMassAdapterType

Use the variances as the diagonal of the inverse mass matrix as used in HMC, GGMC, ...;

Fields

  • Minv: Inverse mass matrix as used in HMC, SGLD, GGMC, ...
  • adapt_steps: Number of adaptation steps.
  • windowlength: Lookback length for calculation of covariance.
  • t: Current step.
  • kappa: How much to shrink towards the identity.
  • epsilon: Small value to add to diagonal to avoid numerical instability.
BayesFlux.FullCovMassAdapterType

Use the full covariance matrix of a moving average of samples as the mass matrix. This is similar to what is already done in Adaptive MH.

Fields

  • Minv: Inverse mass matrix as used in HMC, SGLD, GGMC, ...
  • adapt_steps: Number of adaptation steps.
  • windowlength: Lookback length for calculation of covariance.
  • t: Current step.
  • kappa: How much to shrink towards the identity.
  • epsilon: Small value to add to diagonal to avoid numerical instability.
BayesFlux.RMSPropMassAdapterType

Use RMSProp as a precondition/mass matrix adapter. This was proposed in

Li, C., Chen, C., Carlson, D., & Carin, L. (2016, February). Preconditioned stochastic gradient Langevin dynamics for deep neural networks. In Thirtieth AAAI Conference on Artificial Intelligence for the use in SGLD and related methods.

Stepsize Adaptation

BayesFlux.StepsizeAdapterType

Adapt the stepsize of MCMC algorithms.

Implentation Details

Mandatory Fields

  • l::Number The stepsize. Will be used by the sampler.

Mandatory Functions

  • (sadapter::StepsizeAdapter)(s::MCMCState, mh_probability::Number) Every stepsize adapter must be callable with arguments, being the sampler itself and the Metropolis-Hastings acceptance probability. The method must return the new stepsize.
BayesFlux.DualAveragingStepSizeType

Use the Dual Average method to tune the stepsize.

The use of the Dual Average method was proposed in:

Hoffman, M. D., & Gelman, A. (2014). The No-U-Turn Sampler: Adaptively Setting Path Lengths in Hamiltonian Monte Carlo. Journal of Machine Learning Research, 15, 31.