Metrics

Types

SequencerJ currently supports the four algorithms originally cited in the article and supported by the python version of the Sequencer. These are:

Distances.EuclideanType
Euclidean([thresh])

Create a euclidean metric.

When computing distances among large numbers of points, it can be much more efficient to exploit the formula

(x-y)^2 = x^2 - 2xy + y^2

However, this can introduce roundoff error. thresh (which defaults to 0) specifies the relative square-distance tolerance on 2xy compared to x^2 + y^2 to force recalculation of the distance using the more precise direct (elementwise-subtraction) formula.

Example:

julia> x = reshape([0.1, 0.3, -0.1], 3, 1);

julia> pairwise(Euclidean(), x, x)
1×1 Array{Float64,2}:
 7.45058e-9

julia> pairwise(Euclidean(1e-12), x, x)
1×1 Array{Float64,2}:
 0.0
Missing docstring.

Missing docstring for KLDivergence. Check Documenter's build log for details.

SequencerJ.EMDType
EMD

Earth Mover's Distance, a.k.a. 1-p Monge-Wasserstein distance. The algorithm as implemented here assumes balanced transport where both vectors have the same statistical mass (they both sum to 1). If they are not already, input vectors will be normalized to sum to 1.

Each algorithm has associated to it a constant that represents a usable default type instance (with a default grid):

Constants

SequencerJ.L2Constant

Constant to use when specifying the Euclidean or L2 distance metric for a sequencing run. This sets the default threshold to the value of L2THR = 1e-7.

To set a different threshold level, use the type constructor Distances.Euclidean(<your value>) directly (i.e. instead of the L2 constant).

SequencerJ.KLDConstant

Constant for specifiying the Kullbach-Leibler Divergence metric.

SequencerJ.WASS1DConstant

Constant for specifiying the Monge-Wasserstein a.k.a. 1-p Wasserstein a.k.a. Earth Mover's Distance (EMD) metric. This metric is sensitive to the underlying grid. Default is unit grid taken from the axes of the data vector.

See [EMD]@ref

Usage

Data Orientation

SequencerJ operates on pairwise columns of data in the given matrix (or vector of vectors). It builds an N x N distance matrix for further analysis, given an N-column input. It is therefore important to properly orient the data matrix so that M observations run downward along N columns, with each column representing a separate data set or series.

This can be performed efficiently with permutedims:

A = rand(10,5)
M,N = size(A)
A = M < N ? permutedims(A) : A
10×5 Array{Float64,2}:
 0.176648  0.78158   0.750312   0.956181  0.0589609
 0.448624  0.98266   0.0774158  0.514672  0.0699944
 0.418726  0.435687  0.832439   0.380973  0.232185
 0.490742  0.320006  0.11697    0.22096   0.673471
 0.660174  0.406789  0.0495403  0.141667  0.537355
 0.21999   0.91087   0.471607   0.274198  0.650757
 0.407952  0.502757  0.505704   0.374235  0.879488
 0.599885  0.153405  0.272207   0.870525  0.107887
 0.634633  0.364997  0.783727   0.335093  0.139262
 0.910134  0.376992  0.408011   0.652405  0.160124

While a 2-D data set such as a photograph may be shuffled and reassembled in either orientation, the Sequencer algorithm will work better with more rows and fewer columns (i.e. M > N).

Autoscaling

The Sequencer algorithm's processing time rises as N x N, posing a challenge for data exploration with larger data sets. It may require considerable experimentation with multiple scales and metrics before finding an optimum combination. To assist with finding an appropriate scale, SequencerJ offers an "autoscaling" option. This is, in fact, the default behavior with the keyword argument scales=nothing.

In autoscaling, SequencerJ first performs the sequencing function at several scales on a subset of columns. The scale producing the greatest elongation (see elong) is then used on the full dataset. This is a feature specific to the Julia port of the Sequencer algorithm.

At present, there is no "autometric" option to choose the best metric.

Metrics and Grids

Setting the metrics keyword option as

sequence(A, metrics=(L2, ENERGY))

means to use Distances.Euclidean() and SequencerJ.Energy().

With equal effect, you can write:

sequence(A, metrics=(Euclidean(), Energy())

You can specify a threshold for the Euclidean calculations. To use, e.g., 1e-7 means round to zero any distance calculation results smaller than 1e-7.

sequence(A, metrics=(Euclidean(1e-7), Energy())

To specify a non-unit data grid, set grid as a Vector{AbstractFloat} in the range [0.0-1.0) and use the Distance type, not the constant.

using SequencerJ #hide
A = rand(100,100);
M,N = size(A)
g = collect(0.0:0.5:N-1)
sequence(A, metrics=(Energy()), grid=g)

To these four metrics, the intention is to add spectral methods for complex (i.e. 2-D) inputs and any other Distances that make sense for sequencing 1-D or 2-D data. I would love community help with this!