Providing Datasets

Typically, one of the most difficult parts of any data science problem is to bring the data into a form which lends itself to the subsequent analysis. This section aims to describe the containers used by InformationGeometry.jl to store datasets and models in detail.

The data itself is stored using the DataSet container.

InformationGeometry.DataSetType

The DataSet type is a versatile container for storing data. Typically, it is constructed by passing it three vectors x, y, sigma where the components of sigma quantify the standard deviation associated with each y-value. Alternatively, a full covariance matrix can be supplied for the ydata instead of a vector of standard deviations. The contents of a DataSet DS can later be accessed via xdata(DS), ydata(DS), ysigma(DS).

Examples:

In the simplest case, where all data points are mutually independent and have a single $x$-component and a single $y$-component each, a DataSet consisting of four points can be constructed via

DataSet([1,2,3,4], [4,5,6.5,7.8], [0.5,0.45,0.6,0.8])

or alternatively by

using LinearAlgebra
DataSet([1,2,3,4], [4,5,6.5,7.8], Diagonal([0.5,0.45,0.6,0.8].^2))

where the diagonal covariance matrix in the second line is equivalent to the vector of standard deviations supplied in the first line.

For measurements with multiple components, it is also possible to enter them as a Matrix where the columns correspond to the respective components.

DataSet([0, 0.5, 1], [1 100; 2 103; 3 108], [0.5 8; 0.4 5; 0.6 10])

Note that if the uncertainty matrix is square, it may be falsely interpreted as a covariance matrix instead of as the columnwise specification of standard deviations.

More generally, if a dataset consists of $N$ points where each $x$-value has $n$ many components and each $y$-value has $m$ many components, this can be specified to the DataSet constructor via a tuple $(N,n,m)$ in addition to the vectors x, y and the covariance matrix. For example:

X = [0.9, 1.0, 1.1, 1.9, 2.0, 2.1, 2.9, 3.0, 3.1, 3.9, 4.0, 4.1]
Y = [1.0, 5.0, 4.0, 8.0, 9.0, 13.0, 16.0, 20.0]
Cov = Diagonal([2.0, 4.0, 2.0, 4.0, 2.0, 4.0, 2.0, 4.0])
dims = (4, 3, 2)
DS = DataSet(X, Y, Cov, dims)

In this case, X is a vector consisting of the concatenated x-values (with 3 components each) for 4 different data points. The values of Y are the corresponding concatenated y-values (with 2 components each) of said 4 data points. Clearly, the covariance matrix must therefore be a positive-definite $(m \cdot N) \times (m \cdot N)$ matrix.

To complete the specification of an inference problem, a model function which is assumed to be able to capture the relationship which is inherent in the data must be added.

InformationGeometry.DataModelType

In addition to storing a DataSet, a DataModel also contains a function model(x,θ) and its derivative dmodel(x,θ) where x denotes the x-value of the data and θ is a vector of parameters on which the model depends. Crucially, dmodel contains the derivatives of the model with respect to the parameters θ, not the x-values. For example

DS = DataSet([1,2,3,4], [4,5,6.5,7.8], [0.5,0.45,0.6,0.8])
model(x::Number, θ::AbstractVector{<:Number}) = θ[1] * x + θ[2]
DM = DataModel(DS, model)

In cases where the output of the model has more than one component (i.e. ydim > 1), it is advisable to define the model function in such a way that it outputs static vectors using StaticArrays.jl for increased performance. For ydim = 1, InformationGeometry.jl expects the model to output a number instead of a vector with one component. In contrast, the parameter configuration θ must always be supplied as a vector (even if it only has a single component).

An initial guess for the maximum likelihood parameters can optionally be passed to the DataModel as a vector via

DM = DataModel(DS, model, [1.0,2.5])

During the construction of a DataModel process which includes the search for the maximum likelihood estimate $\theta_\text{MLE}$, multiple tests are run. If necessary, these tests can be skipped by appending true as the last argument in the constructor:

DM = DataModel(DS, model, [-Inf,π,1], true)

If a DataModel is constructed as shown in the above examples, the gradient of the model with respect to the parameters θ (i.e. its "Jacobian") will be calculated using automatic differentiation. Alternatively, an explicit analytic expression for the Jacobian can be specified by hand:

using StaticArrays
function dmodel(x::Number, θ::AbstractVector{<:Number})
   @SMatrix [x  1.]     # ∂(model)/∂θ₁ and ∂(model)/∂θ₂
end
DM = DataModel(DS, model, dmodel)

The output of the Jacobian must be a matrix whose columns correspond to the partial derivatives with respect to different components of θ and whose rows correspond to evaluations at different components of x. Again, although it is not strictly required, outputting the Jacobian in form of a static matrix is typically beneficial for the overall performance.

It is also possible to specify a (logarithmized) prior distribution on the parameter space to the DataModel constructor after the initial guess for the MLE. For example:

using Distributions
Dist = MvNormal(ones(2), [1 0; 0 3.])
LogPriorFn(θ) = logpdf(Dist, θ)
DM = DataModel(DS, model, [1.0,2.5], LogPriorFn)

The DataSet contained in a DataModel named DM can be accessed via Data(DM), whereas the model and its Jacobian can be used via Predictor(DM) and dPredictor(DM) respectively. The MLE and the value of the log-likelihood at the MLE are accessible via MLE(DM) and LogLikeMLE(DM). The logarithmized prior can be accessed via LogPrior(DM).

"Simple" DataSets and DataModels can be visualized directly via plot(DM) using pre-written recipes for the Plots.jl package.