Data Transformations

In general, data transformations change raw feature vectors into a representation that is more suitable for various estimators.

Standardization

Standardization of dataset is a common requirement for many machine learning techniques. These techniques might perform poorly if the individual features do not more or less look like standard normally distributed data.

Standardization transforms data points into corresponding standard scores by removing mean and scaling to unit variance.

The standard score is the signed number of standard deviations by which the value of an observation or data point is above the mean value of what is being observed or measured.

Standardization can be performed using fit(ZScoreTransform, ...).

StatsBase.fitMethod
fit(ZScoreTransform, X; dims=nothing, center=true, scale=true)

Fit standardization parameters to vector or matrix X and return a ZScoreTransform transformation object.

Keyword arguments

  • dims: if 1 fit standardization parameters in column-wise fashion; if 2 fit in row-wise fashion. The default is nothing, which is equivalent to dims=2 with a deprecation warning.

  • center: if true (the default) center data so that its mean is zero.

  • scale: if true (the default) scale the data so that its variance is equal to one.

Examples

julia> using StatsBase

julia> X = [0.0 -0.5 0.5; 0.0 1.0 2.0]
2×3 Array{Float64,2}:
 0.0  -0.5  0.5
 0.0   1.0  2.0

julia> dt = fit(ZScoreTransform, X, dims=2)
ZScoreTransform{Float64}(2, 2, [0.0, 1.0], [0.5, 1.0])

julia> StatsBase.transform(dt, X)
2×3 Array{Float64,2}:
  0.0  -1.0  1.0
 -1.0   0.0  1.0

Unit range normalization

Unit range normalization is an alternative data transformation which scales features to lie in the interval [0; 1].

Unit range normalization can be performed using fit(UnitRangeTransform, ...).

StatsBase.fitMethod
fit(UnitRangeTransform, X; dims=nothing, unit=true)

Fit a scaling parameters to vector or matrix X and return a UnitRangeTransform transformation object.

Keyword arguments

  • dims: if 1 fit standardization parameters in column-wise fashion;

if 2 fit in row-wise fashion. The default is nothing.

  • unit: if true (the default) shift the minimum data to zero.

Examples

julia> using StatsBase

julia> X = [0.0 -0.5 0.5; 0.0 1.0 2.0]
2×3 Array{Float64,2}:
 0.0  -0.5  0.5
 0.0   1.0  2.0

julia> dt = fit(UnitRangeTransform, X, dims=2)
UnitRangeTransform{Float64}(2, 2, true, [-0.5, 0.0], [1.0, 0.5])

julia> StatsBase.transform(dt, X)
2×3 Array{Float64,2}:
 0.5  0.0  1.0
 0.0  0.5  1.0

Additional methods

StatsBase.transformFunction
transform(t::AbstractDataTransform, x)

Return a standardized copy of vector or matrix x using transformation t.

StatsBase.transform!Function
transform!(t::AbstractDataTransform, x)

Apply transformation t to vector or matrix x in place.

StatsBase.reconstructFunction
reconstruct(t::AbstractDataTransform, y)

Return a reconstruction of an originally scaled data from a transformed vector or matrix y using transformation t.

StatsBase.reconstruct!Function
reconstruct!(t::AbstractDataTransform, y)

Perform an in-place reconstruction into an original data scale from a transformed vector or matrix y using transformation t.

StatsBase.standardizeFunction
standardize(DT, X; dims=nothing, kwargs...)

Return a standardized copy of vector or matrix X along dimensions dims using transformation DT which is a subtype of AbstractDataTransform:

  • ZScoreTransform
  • UnitRangeTransform

Example

julia> using StatsBase

julia> standardize(ZScoreTransform, [0.0 -0.5 0.5; 0.0 1.0 2.0], dims=2)
2×3 Array{Float64,2}:
  0.0  -1.0  1.0
 -1.0   0.0  1.0

julia> standardize(UnitRangeTransform, [0.0 -0.5 0.5; 0.0 1.0 2.0], dims=2)
2×3 Array{Float64,2}:
 0.5  0.0  1.0
 0.0  0.5  1.0