Metrics

Types

SequencerJ currently supports the four algorithms originally cited in the article and supported by the python version of the Sequencer. These are:

Distances.Euclidean — Type

Euclidean([thresh])

Create a euclidean metric.

When computing distances among large numbers of points, it can be much more efficient to exploit the formula

(x-y)^2 = x^2 - 2xy + y^2

However, this can introduce roundoff error. thresh (which defaults to 0) specifies the relative square-distance tolerance on 2xy compared to x^2 + y^2 to force recalculation of the distance using the more precise direct (elementwise-subtraction) formula.

Example:

julia> x = reshape([0.1, 0.3, -0.1], 3, 1);

julia> pairwise(Euclidean(), x, x)
1×1 Array{Float64,2}:
 7.45058e-9

julia> pairwise(Euclidean(1e-12), x, x)
1×1 Array{Float64,2}:
 0.0

Missing docstring.

Missing docstring for KLDivergence. Check Documenter's build log for details.

SequencerJ.EMD — Type

EMD

Earth Mover's Distance, a.k.a. 1-p Monge-Wasserstein distance. The algorithm as implemented here assumes balanced transport where both vectors have the same statistical mass (they both sum to 1). If they are not already, input vectors will be normalized to sum to 1.

SequencerJ.Energy — Type

Energy

Energy distance as defined by Székely. An explicit grid may be provided. Default is axes(u,1) and axes(v,1).

Wikipedia page on Energy distance

Each algorithm has associated to it a constant that represents a usable default type instance (with a default grid):

Constants

SequencerJ.L2 — Constant

Constant to use when specifying the Euclidean or L2 distance metric for a sequencing run. This sets the default threshold to the value of L2THR = 1e-7.

To set a different threshold level, use the type constructor Distances.Euclidean(<your value>) directly (i.e. instead of the L2 constant).

SequencerJ.KLD — Constant

Constant for specifiying the Kullbach-Leibler Divergence metric.

SequencerJ.WASS1D — Constant

Constant for specifiying the Monge-Wasserstein a.k.a. 1-p Wasserstein a.k.a. Earth Mover's Distance (EMD) metric. This metric is sensitive to the underlying grid. Default is unit grid taken from the axes of the data vector.

See [EMD]@ref

SequencerJ.ENERGY — Constant

Constant for specifiying the Energy distance as defined by Székely.

See Energy

SequencerJ.ALL_METRICS — Constant

Convenience constant to represent all of L2, WASS1D, KLD, and ENERGY.

Usage

Data Orientation

SequencerJ operates on pairwise columns of data in the given matrix (or vector of vectors). It builds an N x N distance matrix for further analysis, given an N-column input. It is therefore important to properly orient the data matrix so that M observations run downward along N columns, with each column representing a separate data set or series.

This can be performed efficiently with permutedims:

A = rand(10,5)
M,N = size(A)
A = M < N ? permutedims(A) : A

10×5 Array{Float64,2}:
 0.176648  0.78158   0.750312   0.956181  0.0589609
 0.448624  0.98266   0.0774158  0.514672  0.0699944
 0.418726  0.435687  0.832439   0.380973  0.232185
 0.490742  0.320006  0.11697    0.22096   0.673471
 0.660174  0.406789  0.0495403  0.141667  0.537355
 0.21999   0.91087   0.471607   0.274198  0.650757
 0.407952  0.502757  0.505704   0.374235  0.879488
 0.599885  0.153405  0.272207   0.870525  0.107887
 0.634633  0.364997  0.783727   0.335093  0.139262
 0.910134  0.376992  0.408011   0.652405  0.160124

While a 2-D data set such as a photograph may be shuffled and reassembled in either orientation, the Sequencer algorithm will work better with more rows and fewer columns (i.e. M > N).

Autoscaling

The Sequencer algorithm's processing time rises as N x N, posing a challenge for data exploration with larger data sets. It may require considerable experimentation with multiple scales and metrics before finding an optimum combination. To assist with finding an appropriate scale, SequencerJ offers an "autoscaling" option. This is, in fact, the default behavior with the keyword argument scales=nothing.

In autoscaling, SequencerJ first performs the sequencing function at several scales on a subset of columns. The scale producing the greatest elongation (see elong) is then used on the full dataset. This is a feature specific to the Julia port of the Sequencer algorithm.

At present, there is no "autometric" option to choose the best metric.

Metrics and Grids

Setting the metrics keyword option as

sequence(A, metrics=(L2, ENERGY))

means to use Distances.Euclidean() and SequencerJ.Energy().

With equal effect, you can write:

sequence(A, metrics=(Euclidean(), Energy())

You can specify a threshold for the Euclidean calculations. To use, e.g., 1e-7 means round to zero any distance calculation results smaller than 1e-7.

sequence(A, metrics=(Euclidean(1e-7), Energy())

To specify a non-unit data grid, set grid as a Vector{AbstractFloat} in the range [0.0-1.0) and use the Distance type, not the constant.

using SequencerJ #hide
A = rand(100,100);
M,N = size(A)
g = collect(0.0:0.5:N-1)
sequence(A, metrics=(Energy()), grid=g)

To these four metrics, the intention is to add spectral methods for complex (i.e. 2-D) inputs and any other Distances that make sense for sequencing 1-D or 2-D data. I would love community help with this!