Providing Datasets
Typically, one of the most difficult parts of any data science problem is to bring the data into a form which lends itself to the subsequent analysis. Thus, this section aims to describe the containers used by InformationGeometry.jl to store datasets and models in detail.
The data itself is stored using the DataSet
container.
InformationGeometry.DataSet
— TypeThe DataSet
type is a container for data points. It holds 3 vectors x
, y
, sigma
where the components of sigma
quantify the standard deviation associated with each measurement. Its fields can be obtained via xdata(DS)
, ydata(DS)
, sigma(DS)
.
In the simplest case, where all data points are mutually independent and have a single $x$-component and a single $y$-component each, a four-point DataSet
can be constructed via
DataSet([1,2,3,4],[4,5,6.5,7.8],[0.5,0.45,0.6,0.8])
where the three arguments constitute a vector of x-values, y-values and 1σ uncertainties associated with the y-values, respectively.
Depending on the dimensionality of the dataset, that is, the number of components of the respective x-values and y-values, there are multiple ways the DataSet
can be constructed.
To complete the specification of the inference problem, a model function which takes an x-value and a parameter configuration $\theta$ must be added.
InformationGeometry.DataModel
— TypeIn addition to a DataSet
, a DataModel
contains the model as a function model(x,θ)
and its derivative dmodel(x,θ)
where x
denotes the x-value of the data and θ
is a vector of parameters on which the model depends. Crucially, dmodel
contains the derivatives of the model with respect to the parameters θ
, not the x-values. For example
DS = DataSet([1,2,3.],[4,5,6.5],[0.5,0.45,0.6])
model(x,θ::Vector) = θ[1] .* x .+ θ[2]
DM = DataModel(DS,model)
If provided like this, the gradient of the model with respect to the parameters θ
(i.e. its "Jacobian") will be calculated using automatic differentiation. Alternatively, an explicit analytic expression for the Jacobian can be specified by hand:
function dmodel(x,θ::Vector)
J = Array{Float64}(undef, length(x), length(θ))
@. J[:,1] = x # ∂(model)/∂θ₁
@. J[:,2] = 1. # ∂(model)/∂θ₂
return J
end
DM = DataModel(DS,model,dmodel)
The output of the Jacobian must be a matrix whose columns correspond to the partial derivatives with respect to different components of θ
and whose rows correspond to evaluations at different values of x
.
To be continued...