Technical description

Variational inference via the re-parametrisation trick

Our goal is to approximate the true (unnormalised) posterior distribution $p(\theta|\mathcal{D})$ with a Gaussian $q(\theta) = \mathcal{N}(\theta|\mu,\Sigma)$ by maximising the expected lower bound:

$\int q(\theta) \log p(x, \theta) d\theta + \mathcal{H}[q]$,

also known as the ELBO. The above integral is approximated with as a Monte carlo average over $S$ number of samples:

$\frac{1}{S} \sum_{s=1}^S \log p(x, \theta_s) + \mathcal{H}[q]$.

Using the reparametrisation trick, we re-introduce the variational parameters that we need to optimise:

$\frac{1}{S} \sum_{s=1}^S \log p(x, \mu + C z_s) + \mathcal{H}[q]$,

where $z_s\sim\mathcal{N}(0,I)$ and $C$ is a matrix root of $\Sigma$, i.e. $CC^T = \Sigma$.

By optimising the approximate lower bound with respect to the variational parameters $\mu$ and $C$ we obtain the approximate posterior $q(\theta) = \mathcal{N}(\theta|\mu,CC^T)$ that offers the best Gaussian approximation to true posterior $p(\theta|\mathcal{D})$.

The number of samples $S$ in the above description, can be controlled via the option S when calling VI.

In more detail

This package implements variational inference using the re-parametrisation trick. Contrary to other flavours of this method, that repeatedly draw new samples $z_s$ at each iteration of the optimiser, here a large number of samples $z_s$ is drawn at the start and is kept fixed throughout the execution of the algorithm[1]. This avoids the difficulty of working with a noisy gradient and allows the use of optimisers like LBFGS. Using LBFGS, does away with the typical requirement of tuning learning rates (step sizes). However, this comes at the expense of risking overfitting to the samples $z_s$ that happened to be drawn at the start. The package provides a mechanism for monitoring potential overfitting[2] via the options Stest and test_every. Because of fixing the samples $z_s$, the algorithm doesn't not enjoy the speed of optimisation via stochastic gradient. As a consequence, the present package is recommented for problems with relatively few parameters, e.g. 2-20 parameters perhaps.

The work was independently developed and published here (Arxiv link). Of course, the method has been widely popularised by the works Doubly Stochastic Variational Bayes for non-Conjugate Inference and Auto-Encoding Variational Bayes. The method indepedently appeared earlier in Fixed-Form Variational Posterior Approximation through Stochastic Linear Regression and later in A comparison of variational approximations for fast inference in mixed logit models and perhaps in other publications too...