Providing Datasets

Typically, one of the most difficult parts of any data science problem is to bring the data into a form which lends itself to the subsequent analysis. This section aims to describe the containers used by InformationGeometry.jl to store datasets and models in detail.

The data itself is stored using the DataSet container.

InformationGeometry.DataSetType

The DataSet type is a versatile container for storing data. Typically, it is constructed by passing it three vectors x, y, sigma where the components of sigma quantify the standard deviation associated with each y-value. Alternatively, a full covariance matrix can be supplied for the ydata instead of a vector of standard deviations. The contents of a DataSetDS can later be accessed via xdata(DS), ydata(DS), sigma(DS).

Examples:

In the simplest case, where all data points are mutually independent and have a single $x$-component and a single $y$-component each, a DataSet consisting of four points can be constructed via

DataSet([1,2,3,4], [4,5,6.5,7.8], [0.5,0.45,0.6,0.8])

or alternatively by

using LinearAlgebra
DataSet([1,2,3,4], [4,5,6.5,7.8], Diagonal([0.5,0.45,0.6,0.8].^2))

where the diagonal covariance matrix in the second line is equivalent to the vector of standard deviations supplied in the first line.

More generally, if a dataset consists of $N$ points where each $x$-value has $n$ many components and each $y$-value has $m$ many components, this can be specified to the DataSet constructor via a tuple $(N,n,m)$ in addition to the vectors x, y and the covariance matrix. For example:

X = [0.9, 1.0, 1.1, 1.9, 2.0, 2.1, 2.9, 3.0, 3.1, 3.9, 4.0, 4.1]
Y = [1.0, 5.0, 4.0, 8.0, 9.0, 13.0, 16.0, 20.0]
Cov = Diagonal([2.0, 4.0, 2.0, 4.0, 2.0, 4.0, 2.0, 4.0])
dims = (4, 3, 2)
DS = DataSet(X, Y, Cov, dims)

In this case, X is a vector consisting of the concatenated x-values (with 3 components each) for 4 different data points. The values of Y are the corresponding concatenated y-values (with 2 components each) of said 4 data points. Clearly, the covariance matrix must therefore be a positive-definite $(m \cdot N) \times (m \cdot N)$ matrix.

To complete the specification of an inference problem, a model function which is assumed to be able to capture the relationship which is inherent in the data must be added.

InformationGeometry.DataModelType

In addition to storing a DataSet, a DataModel also contains a function model(x,θ) and its derivative dmodel(x,θ) where x denotes the x-value of the data and θ is a vector of parameters on which the model depends. Crucially, dmodel contains the derivatives of the model with respect to the parameters θ, not the x-values. For example

DS = DataSet([1,2,3,4], [4,5,6.5,7.8], [0.5,0.45,0.6,0.8])
model(x::Number, θ::AbstractVector{<:Number}) = θ[1] * x + θ[2]
DM = DataModel(DS, model)

In cases where the output of the model has more than one component (i.e. ydim > 1), it is advisable to define the model function in such a way that it outputs static vectors using StaticArrays.jl for increased performance. For ydim = 1, InformationGeometry.jl expects the model to output a number instead of a vector with one component. In contrast, the parameter configuration θ must always be supplied as a vector.

A starting value for the maximum likelihood estimation can be passed to the DataModel constructor by appending an appropriate vector, e.g.

DM = DataModel(DS, model, [1.0,2.5])

During the construction of a DataModel process which includes the search for the maximum likelihood estimate $\theta_\text{MLE}$, multiple tests are run. If necessary, these tests can be skipped by appending true as the last argument in the constructor:

DM = DataModel(DS, model, [-Inf,π,1+im], true)

If a DataModel is constructed as shown in the above examples, the gradient of the model with respect to the parameters θ (i.e. its "Jacobian") will be calculated using automatic differentiation. Alternatively, an explicit analytic expression for the Jacobian can be specified by hand:

using StaticArrays
function dmodel(x::Number, θ::AbstractVector{<:Number})
   @SMatrix [x  1.]     # ∂(model)/∂θ₁ and ∂(model)/∂θ₂
end
DM = DataModel(DS, model, dmodel)

The output of the Jacobian must be a matrix whose columns correspond to the partial derivatives with respect to different components of θ and whose rows correspond to evaluations at different components of x. Again, although it is not strictly required, outputting the Jacobian in form of a static matrix is typically beneficial for the overall performance.

The DataSet contained in a DataModel named DM can be accessed via Data(DM), whereas the model and its Jacobian can be used via DM.model and DM.dmodel respectively.

"Simple" DataSets and DataModels can be visualized directly via plot(DM) using pre-written recipes for the Plots.jl package.