Advanced Datasets

The following table illustrates the capabilities of the various data types implemented by InformationGeometry.jl:

Container	allows non-Gaussian `y`-uncertainty	allows `x`-uncertainty	allows mixed `x`-`y` uncertainty	allows missing values
`DataSet`	✖	✖	✖	✖
`DataSetExact`	✔	✔	✖	✖
`GeneralizedDataSet`	✔	✔	✔	✖
`CompositeDataSet`	✖	✔	✖	✔

InformationGeometry.DataSetExact — Type

DataSetExact(x::AbstractArray, y::AbstractArray, Σ_y::AbstractArray)
DataSetExact(x::AbstractArray, Σ_x::AbstractArray, y::AbstractArray, Σ_y::AbstractArray)
DataSetExact(xd::Distribution, yd::Distribution, dims::Tuple{Int,Int,Int}=(length(xd),1,1))

A data container which allows for uncertainties in the independent variables, i.e. $x$-variables. Moreover, the observed data is stored in terms of two probability distributions over the spaces $\mathcal{X}^N$ and $\mathcal{Y}^N$ respectively, which also allows for uncertainties in the observations that are non-Gaussian. For instance, the uncertainties associated with a given observation might follow a Cauchy, t-student, log-normal or some other smooth distribution.

Examples:

using InformationGeometry, Distributions
X = product_distribution([Normal(0, 1), Cauchy(2, 0.5)])
Y = MvTDist(2, [3, 8.], [1 0.5; 0.5 3])
DataSetExact(X, Y, (2,1,1))

Note

Uncertainties in the independent $x$-variables are optional for DataSetExact, and can be set to zero by wrapping the x-data in a InformationGeometry.Dirac "distribution". The following illustrates numerically equivalent ways of encoding a dataset whose uncertainties in the $x$-variables is zero:

using InformationGeometry, Distributions, LinearAlgebra
DS1 = DataSetExact(InformationGeometry.Dirac([1,2]), MvNormal([5,6], Diagonal([0.1, 0.2].^2)))
DS2 = DataSetExact([1,2], [5,6], [0.1, 0.2])
DS3 = DataSet([1,2], [5,6], [0.1, 0.2])

where DS1 == DS2 == DS3 will evaluate to true.

InformationGeometry.CompositeDataSet — Type

The CompositeDataSet type is a more elaborate (and typically less performant) container for storing data. Essentially, it splits observed data which has multiple y-components into separate data containers (e.g. of type DataSet), each of which corresponds to one of the components of the y-data. Crucially, each of the smaller data containers still shares the same "kind" of x-data, that is, the same xdim, units and so on, although they do not need to share the exact same particular x-data.

The main advantage of this approach is that it can be applied when there are missing y-components in some observations. A typical use case for CompositeDataSets are time series where multiple quantities are tracked but not every quantity is necessarily recorded at each time step. Example:

using DataFrames
t = [1,2,3,4]
y₁ = [2.5, 6, missing, 9];      y₂ = [missing, 5, 3.1, 1.4]
σ₁ = 0.3*ones(4);               σ₂ = [missing, 0.2, 0.1, 0.5]
df = DataFrame([t y₁ σ₁ y₂ σ₂], :auto)

xdim = 1;   ydim = 2
CompositeDataSet(df, xdim, ydim; xerrs=false, stripedYs=true)

The boolean-valued keywords stripedXs and stripedYs can be used to indicate to the constructor whether the values and corresponding $1\sigma$ uncertainties are given in alternating order, or whether the initial block of ydim many columns are the values and the second ydim many columns are the corresponding uncertainties. Also, xerrs=true can be used to indicate that the x-values also carry uncertainties. Basically all functions which can be called on other data containers such as DataSet have been specialized to also work with CompositeDataSets.

InformationGeometry.GeneralizedDataSet — Type

GeneralizedDataSet(dist::ContinuousMultivariateDistribution, dims::Tuple{Int,Int,Int}=(length(dist), 1, 1))

Data structure which can take general x-y-covariance into account where dims=(Npoints, xdim, ydim) indicates the dimensionality of the data. dist should constitute a smooth distribution over the space $\mathcal{X}^N \times \mathcal{Y}^N$ where mean(dist) is interpreted as the concatenation of the (most likely values for the) observations $(x_1, ..., x_N, y_1, ..., y_N)$ and the width of dist specifies the uncertainty in the signal. Typically, dist is a multivariate Gaussian but other distributions such as Cauchy or student's t-distributions are also possible. Thus, arbitrary correlations between the dependent $y$ and independent $x$ variables can be encoded.

Note

If there is no correlation between the $x$ and $y$ variables (i.e. if the offdiagonal blocks of cov(dist) are zero), it can be more performant to use the type DataSetExact to encode the given data instead.