Providing Datasets

Typically, one of the most difficult parts of any data science problem is to bring the data into a form which lends itself to the subsequent analysis. Thus, this section aims to describe the containers used by InformationGeometry.jl to store datasets and models in detail.

The data itself is stored using the DataSet container.

InformationGeometry.DataSetType

The DataSet type is a container for data points. It holds 3 vectors x, y, sigma where the components of sigma quantify the standard deviation associated with each measurement. Its fields can be obtained via xdata(DS), ydata(DS), sigma(DS).

In the simplest case, where all data points are mutually independent and have a single $x$-component and a single $y$-component each, a four-point DataSet can be constructed via

DataSet([1,2,3,4],[4,5,6.5,7.8],[0.5,0.45,0.6,0.8])

where the three arguments constitute a vector of x-values, y-values and 1σ uncertainties associated with the y-values, respectively.

Depending on the dimensionality of the dataset, that is, the number of components of the respective x-values and y-values, there are multiple ways the DataSet can be constructed.

To complete the specification of the inference problem, a model function which takes an x-value and a parameter configuration $\theta$ must be added.

InformationGeometry.DataModelType

In addition to a DataSet, a DataModel contains the model as a function model(x,θ) and its derivative dmodel(x,θ) where x denotes the x-value of the data and θ is a vector of parameters on which the model depends. Crucially, dmodel contains the derivatives of the model with respect to the parameters θ, not the x-values. For example

DS = DataSet([1,2,3.],[4,5,6.5],[0.5,0.45,0.6])
model(x,θ::Vector) = θ[1] .* x .+ θ[2]
DM = DataModel(DS,model)

If provided like this, the gradient of the model with respect to the parameters θ (i.e. its "Jacobian") will be calculated using automatic differentiation. Alternatively, an explicit analytic expression for the Jacobian can be specified by hand:

function dmodel(x,θ::Vector)
   J = Array{Float64}(undef, length(x), length(θ))
   @. J[:,1] = x        # ∂(model)/∂θ₁
   @. J[:,2] = 1.       # ∂(model)/∂θ₂
   return J
end
DM = DataModel(DS,model,dmodel)

The output of the Jacobian must be a matrix whose columns correspond to the partial derivatives with respect to different components of θ and whose rows correspond to evaluations at different values of x.

To be continued...