# API/Reference

BiweightStatsModule
BiweightStats

A module for robust statistics based on the biweight transform.[1]

Biweight Statistics

The basis of the biweight transform is robust analysis, that is, statistics which are resilient to outliers while still efficiently representing a variety of underlying distributions. The biweight transform is based off the median and the median absolute deviation (MAD). The median is a robust estimator of location, and the MAD is a robust estimator of scale

$$$\mathrm{MAD}(X) = \mathrm{median}\left|X_i - \bar{X}\right|$$$

where $\bar{X}$ is the median.

The biweight transform improves upon these estimates by filtering out data beyond a critical cutoff. The analogy is doing a sigma-filter, but using these robust statistics instead of the standard deviation and mean.

$$$u_i = \frac{X_i - \bar{X}}{c \cdot \mathrm{MAD}}$$$$$$\forall i \quad\mathrm{where}\quad u_i^2 \le 1$$$

The cutoff factor, $c$, can be directly related to a Gaussian standard-deviation by multiplying by 1.4826[2]. So a typical value of $c=9$ means outliers further than $13.3\sigma$ are clipped (for residuals which are truly Gaussian-distributed). In addition, in BiweightStats, we also skip NaNs and Infs (but not missing or nothing).

References

Methods

BiweightStats.locationFunction
location(X; c=9, M=nothing)
location(X::AbstractArray; dims=:, kwargs...)

Calculate the biweight location, a robust measure of location.

$$$\hat{y} = \frac{\sum_{u_i^2 \le 1}{y_i(1 - u_i^2)^2}}{\sum_{u_i^2 \le 1}{(1 - u_i^2)^2}}$$$

Examples

julia> X = 10 .* randn(rng, 1000) .+ 50;

julia> location(X)
49.98008021468018

Iterative refinement

You can iteratively refine the location estimate by manually passing the median, like so-

X = 10 .* randn(rng, 1000) .+ 50
let ystar, ystar_old
ystar = ystar_old = location(X)
tol = 1e-6
maxiter = 10
for _ = 1:maxiter
ystar = location(X; M=ystar_old)
isapprox(ystar_old, ystar; atol=tol) && break
ystar_old = ystar
end
ystar
end

# output

49.991666155308145

References

1. NIST: biweight location
BiweightStats.scaleFunction
scale(X; c=9, M=nothing)
scale(X::AbstractArray; dims=:, kwargs...)

Compute the biweight scale of the variable. This is the same as the square-root of the midvariance.

$$$\hat{\sigma} = \frac{\sqrt{n\sum^n_{u_i^2 \le 1}{(y_i - \bar{y})^2(1 - u_i^2)^4}}}{\sum^n_{u_i^2 \le 1}{(1 - u_i^2)(1 - 5u_i^2)}}$$$

Examples

julia> X = 10 .* randn(rng, 1000) .+ 50;

julia> scale(X)
10.045813567765071

References

1. NIST: biweight scale

BiweightStats.midvarFunction
midvar(X; c=9, M=nothing)
midvar(X::AbstractArray; dims=:, kwargs...)

Compute the biweight midvariance of the variable.

$$$\hat{\sigma}^2 = \frac{n\sum^n_{u_i^2 \le 1}{(y_i - \bar{y})^2(1 - u_i^2)^4}}{\left[\sum_{u_i^2 \le 1}{(1 - u_i^2)(1 - 5u_i^2)}\right]^2}$$$

Examples

julia> X = 10 .* randn(rng, 1000) .+ 50;

julia> midvar(X)
100.9183702382928

References

1. NIST: biweight midvariance

BiweightStats.midcovFunction
midcov(X, [Y]; c=9)

Computes biweight midcovariance between the two vectors. If only one vector is provided the biweight midvariance will be calculated.

$$$\hat{\sigma}_{xy} = \frac{\sum_{u_i^2 \le 1,v_i^2 \le 1}{(x_i - \bar{x})(1 - u_i^2)^2(y_i - \bar{y})(1 - v_i^2)^2}}{\sum_{u_i^2 \le 1}{(1 - u_i^2)(1 - 5u_i^2)}\sum_{v_i^2 \le 1}{(1 - v_i^2)(1 - 5v_i^2)}}$$$
Warning

NaN and Inf cannot be removed in the covariance calculation, so if they are present the returned value will be NaN. To prevent this, consider imputing values for the non-finite data.

Examples

julia> X = 10 .* randn(rng, 1000, 2) .+ 50;

julia> midcov(X[:, 1], X[:, 2])
-1.058463590812247

julia> midcov(X[:, 1]) ≈ midvar(X[:, 1])
true

julia> X[3, 2] = NaN;

julia> midcov(X[:, 1], X[:, 2])
NaN

References

1. NIST: biweight midcovariance

midcov(X::AbstractMatrix; dims=1, c=9)

Computes the variance-covariance matrix using the biweight midcovariance. By default, each column is a separate variable, so an (M, N) matrix with dims=1 will create an (N, N) covariance matrix. If dims=2, though, each row will become a variable, leading to an (M, M) covariance matrix.

Warning

NaN and Inf cannot be removed in the covariance calculation, so if they are present the returned value will be NaN. To prevent this, consider imputing values for the non-finite data.

Examples

julia> X = 10 .* randn(rng, 1000, 3) .+ 50;

julia> C = midcov(X)
3×3 Matrix{Float64}:
100.918    -1.05846    -2.88515
-1.05846  94.702      -0.490742
-2.88515  -0.490742  100.699

julia> size(midcov(X; dims=2))
(1000, 1000)

References

1. NIST: biweight midcovariance

BiweightStats.midcorFunction
midcor(X, Y; c=9)

Compute the correlation between two variables using the midvariance and midcovariances.

$$$\frac{s_{xy}}{\sqrt{s_{xx} \cdot s_{yy}}}$$$

where $s_{xx},s_{yy}$ are the midvariances of each vector, and $s_{xy}$ is the midcovariance of the two vectors.

Examples

julia> X = 10 .* randn(rng, 1000, 2) .+ 50;

julia> midcor(X[:, 1], X[:, 2])
-0.010827077678217934

References

midcor(X::AbstractMatrix; dims=1, c=9)

Computes the correlation matrix using the biweight midcorrealtion. By default, each column of the matrix is a separate variable, so an (M, N) matrix with dims=1 will create an (N, N) correlation matrix. If dims=2, though, each row will become a variable, leading to an (M, M) correlation matrix. The diagonal will always be one.

Examples

julia> X = 10 .* randn(rng, 1000, 3) .+ 50;

julia> C = midcor(X)
3×3 Matrix{Float64}:
1.0        -0.0108271  -0.0286201
-0.0108271   1.0        -0.0050253
-0.0286201  -0.0050253   1.0

julia> size(midcor(X; dims=2))
(1000, 1000)

References

BiweightStats.BiweightTransformType
BiweightTransform(X; c=9, M=nothing)

Creates an iterator based on the biweight transform.[1] This iterator will first filter all input data so that only finite values remain. Then, the iteration will progress using a custom state, which includes a flag to indicate whether the value is within the cutoff, which is c times the median-absolute-deviation (MAD). The MAD is based on the deviation from M, which will default to the median of X if M is nothing.

This transform iterator is used for the internal calculations in BiweightStats.jl, which is why it has a somewhat complicated iterator implementation.

Examples

julia> X = randn(rng, 100);

julia> X[10] = 1e4 # add clear outlier
10000.0

julia> X[13] = NaN # add NaN
NaN

julia> X[25] = Inf # add Inf
Inf

julia> bt = BiweightTransform(X);


Lets confirm all the entries are finite. The iteration interface is divided into

(d, u2, flag), state = iterate(bt, [state])

where d is the data value minus M, u2 is (d / (c * MAD))^2, and flag is whether the value is within the transformed dataset.

julia> all(d -> isfinite(d[1]), bt)
true

and let's see how iteration differs between a normal sample and an outlier sample, which we manually inserted at index 10-

julia> (d, u2, flag), _ = iterate(bt, 9)
((-0.17093842061187192, 0.0009098761083851183, true), 10)

julia> (d, u2, flag), _ = iterate(bt, 10)
((0.0, 0.0, false), 11)

References