Advanced Datasets
The following table illustrates the capabilities of the various data types implemented by InformationGeometry.jl:
Container | allows non-Gaussian y -uncertainty | allows x -uncertainty | allows mixed x -y uncertainty | allows missing values |
---|---|---|---|---|
DataSet | ✖ | ✖ | ✖ | ✖ |
DataSetExact | ✔ | ✔ | ✖ | ✖ |
GeneralizedDataSet | ✔ | ✔ | ✔ | ✖ |
CompositeDataSet | ✖ | ✔ | ✖ | ✔ |
InformationGeometry.DataSetExact
— TypeDataSetExact(x::AbstractArray, y::AbstractArray, Σ_y::AbstractArray)
DataSetExact(x::AbstractArray, Σ_x::AbstractArray, y::AbstractArray, Σ_y::AbstractArray)
DataSetExact(xd::Distribution, yd::Distribution, dims::Tuple{Int,Int,Int}=(length(xd),1,1))
A data container which allows for uncertainties in the independent variables, i.e. $x$-variables. Moreover, the observed data is stored in terms of two probability distributions over the spaces $\mathcal{X}^N$ and $\mathcal{Y}^N$ respectively, which also allows for uncertainties in the observations that are non-Gaussian. For instance, the uncertainties associated with a given observation might follow a Cauchy, t-student, log-normal or some other smooth distribution.
Examples:
using InformationGeometry, Distributions
X = product_distribution([Normal(0, 1), Cauchy(2, 0.5)])
Y = MvTDist(2, [3, 8.], [1 0.5; 0.5 3])
DataSetExact(X, Y, (2,1,1))
Uncertainties in the independent $x$-variables are optional for DataSetExact
, and can be set to zero by wrapping the x
-data in a InformationGeometry.Dirac
"distribution". The following illustrates numerically equivalent ways of encoding a dataset whose uncertainties in the $x$-variables is zero:
using InformationGeometry, Distributions, LinearAlgebra
DS1 = DataSetExact(InformationGeometry.Dirac([1,2]), MvNormal([5,6], Diagonal([0.1, 0.2].^2)))
DS2 = DataSetExact([1,2], [5,6], [0.1, 0.2])
DS3 = DataSet([1,2], [5,6], [0.1, 0.2])
where DS1 == DS2 == DS3
will evaluate to true
.
InformationGeometry.CompositeDataSet
— TypeThe CompositeDataSet
type is a more elaborate (and typically less performant) container for storing data. Essentially, it splits observed data which has multiple y
-components into separate data containers (e.g. of type DataSet
), each of which corresponds to one of the components of the y
-data. Crucially, each of the smaller data containers still shares the same "kind" of x
-data, that is, the same xdim
, units and so on, although they do not need to share the exact same particular x
-data.
The main advantage of this approach is that it can be applied when there are missing
y
-components in some observations. A typical use case for CompositeDataSet
s are time series where multiple quantities are tracked but not every quantity is necessarily recorded at each time step. Example:
using DataFrames
t = [1,2,3,4]
y₁ = [2.5, 6, missing, 9]; y₂ = [missing, 5, 3.1, 1.4]
σ₁ = 0.3*ones(4); σ₂ = [missing, 0.2, 0.1, 0.5]
df = DataFrame([t y₁ σ₁ y₂ σ])
xdim = 1; ydim = 2
CompositeDataSet(df, xdim, ydim; xerrs=false, stripedYs=true)
The boolean-valued keywords stripedXs
and stripedYs
can be used to indicate to the constructor whether the values and corresponding $1\sigma$ uncertainties are given in alternating order, or whether the initial block of ydim
many columns are the values and the second ydim
many columns are the corresponding uncertainties. Also, xerrs=true
can be used to indicate that the x
-values also carry uncertainties. Basically all functions which can be called on other data containers such as DataSet
have been specialized to also work with CompositeDataSet
s.
InformationGeometry.GeneralizedDataSet
— TypeGeneralizedDataSet(dist::ContinuousMultivariateDistribution, dims::Tuple{Int,Int,Int}=(length(dist), 1, 1))
Data structure which can take general x-y-covariance into account where dims=(Npoints, xdim, ydim)
indicates the dimensionality of the data. dist
should constitute a smooth distribution over the space $\mathcal{X}^N \times \mathcal{Y}^N$ where mean(dist)
is interpreted as the concatenation of the (most likely values for the) observations $(x_1, ..., x_N, y_1, ..., y_N)$ and the width of dist
specifies the uncertainty in the signal. Typically, dist
is a multivariate Gaussian but other distributions such as Cauchy or student's t-distributions are also possible. Thus, arbitrary correlations between the dependent $y$ and independent $x$ variables can be encoded.
If there is no correlation between the $x$ and $y$ variables (i.e. if the offdiagonal blocks of cov(dist)
are zero), it can be more performant to use the type DataSetExact
to encode the given data instead.