Providing Datasets
Typically, one of the most difficult parts of any data science problem is to bring the data into a form which lends itself to the subsequent analysis. This section aims to describe the containers used by InformationGeometry.jl to store datasets and models in detail.
The data itself is stored using the DataSet
container.
InformationGeometry.DataSet
— TypeThe DataSet
type is a versatile container for storing data. Typically, it is constructed by passing it three vectors x
, y
, sigma
where the components of sigma
quantify the standard deviation associated with each y-value. Alternatively, a full covariance matrix can be supplied for the ydata
instead of a vector of standard deviations. The contents of a DataSet
DS
can later be accessed via xdata(DS)
, ydata(DS)
, sigma(DS)
.
Examples:
In the simplest case, where all data points are mutually independent and have a single $x$-component and a single $y$-component each, a DataSet
consisting of four points can be constructed via
DataSet([1,2,3,4], [4,5,6.5,7.8], [0.5,0.45,0.6,0.8])
or alternatively by
using LinearAlgebra
DataSet([1,2,3,4], [4,5,6.5,7.8], Diagonal([0.5,0.45,0.6,0.8].^2))
where the diagonal covariance matrix in the second line is equivalent to the vector of standard deviations supplied in the first line.
More generally, if a dataset consists of $N$ points where each $x$-value has $n$ many components and each $y$-value has $m$ many components, this can be specified to the DataSet
constructor via a tuple $(N,n,m)$ in addition to the vectors x
, y
and the covariance matrix. For example:
X = [0.9, 1.0, 1.1, 1.9, 2.0, 2.1, 2.9, 3.0, 3.1, 3.9, 4.0, 4.1]
Y = [1.0, 5.0, 4.0, 8.0, 9.0, 13.0, 16.0, 20.0]
Cov = Diagonal([2.0, 4.0, 2.0, 4.0, 2.0, 4.0, 2.0, 4.0])
dims = (4, 3, 2)
DS = DataSet(X, Y, Cov, dims)
In this case, X
is a vector consisting of the concatenated x-values (with 3 components each) for 4 different data points. The values of Y
are the corresponding concatenated y-values (with 2 components each) of said 4 data points. Clearly, the covariance matrix must therefore be a positive-definite $(m \cdot N) \times (m \cdot N)$ matrix.
To complete the specification of an inference problem, a model function which is assumed to be able to capture the relationship which is inherent in the data must be added.
InformationGeometry.DataModel
— TypeIn addition to storing a DataSet
, a DataModel
also contains a function model(x,θ)
and its derivative dmodel(x,θ)
where x
denotes the x-value of the data and θ
is a vector of parameters on which the model depends. Crucially, dmodel
contains the derivatives of the model with respect to the parameters θ
, not the x-values. For example
DS = DataSet([1,2,3,4], [4,5,6.5,7.8], [0.5,0.45,0.6,0.8])
model(x::Number, θ::AbstractVector{<:Number}) = θ[1] * x + θ[2]
DM = DataModel(DS, model)
In cases where the output of the model has more than one component (i.e. ydim > 1
), it is advisable to define the model function in such a way that it outputs static vectors using StaticArrays.jl for increased performance. For ydim = 1
, InformationGeometry.jl expects the model to output a number instead of a vector with one component. In contrast, the parameter configuration θ
must always be supplied as a vector.
A starting value for the maximum likelihood estimation can be passed to the DataModel
constructor by appending an appropriate vector, e.g.
DM = DataModel(DS, model, [1.0,2.5])
During the construction of a DataModel
process which includes the search for the maximum likelihood estimate $\theta_\text{MLE}$, multiple tests are run. If necessary, these tests can be skipped by appending true
as the last argument in the constructor:
DM = DataModel(DS, model, [-Inf,π,1+im], true)
If a DataModel
is constructed as shown in the above examples, the gradient of the model with respect to the parameters θ
(i.e. its "Jacobian") will be calculated using automatic differentiation. Alternatively, an explicit analytic expression for the Jacobian can be specified by hand:
using StaticArrays
function dmodel(x::Number, θ::AbstractVector{<:Number})
@SMatrix [x 1.] # ∂(model)/∂θ₁ and ∂(model)/∂θ₂
end
DM = DataModel(DS, model, dmodel)
The output of the Jacobian must be a matrix whose columns correspond to the partial derivatives with respect to different components of θ
and whose rows correspond to evaluations at different components of x
. Again, although it is not strictly required, outputting the Jacobian in form of a static matrix is typically beneficial for the overall performance.
The DataSet
contained in a DataModel
named DM
can be accessed via Data(DM)
, whereas the model and its Jacobian can be used via DM.model
and DM.dmodel
respectively.
"Simple" DataSet
s and DataModel
s can be visualized directly via plot(DM)
using pre-written recipes for the Plots.jl package.