Metrics
Types
SequencerJ currently supports the four algorithms originally cited in the article and supported by the python
version of the Sequencer. These are:
Distances.Euclidean
— TypeEuclidean([thresh])
Create a euclidean metric.
When computing distances among large numbers of points, it can be much more efficient to exploit the formula
(x-y)^2 = x^2 - 2xy + y^2
However, this can introduce roundoff error. thresh
(which defaults to 0) specifies the relative square-distance tolerance on 2xy
compared to x^2 + y^2
to force recalculation of the distance using the more precise direct (elementwise-subtraction) formula.
Example:
julia> x = reshape([0.1, 0.3, -0.1], 3, 1);
julia> pairwise(Euclidean(), x, x)
1×1 Array{Float64,2}:
7.45058e-9
julia> pairwise(Euclidean(1e-12), x, x)
1×1 Array{Float64,2}:
0.0
Missing docstring for KLDivergence
. Check Documenter's build log for details.
SequencerJ.EMD
— TypeEMD
Earth Mover's Distance, a.k.a. 1-p Monge-Wasserstein distance. The algorithm as implemented here assumes balanced transport where both vectors have the same statistical mass (they both sum to 1). If they are not already, input vectors will be normalized to sum to 1.
SequencerJ.Energy
— TypeEnergy
Energy distance as defined by Székely. An explicit grid may be provided. Default is axes(u,1)
and axes(v,1)
.
Each algorithm has associated to it a constant that represents a usable default type instance (with a default grid):
Constants
SequencerJ.L2
— ConstantConstant to use when specifying the Euclidean or L2 distance metric for a sequencing run. This sets the default threshold to the value of L2THR
= 1e-7.
To set a different threshold level, use the type constructor Distances.Euclidean(<your value>)
directly (i.e. instead of the L2
constant).
SequencerJ.KLD
— ConstantConstant for specifiying the Kullbach-Leibler Divergence metric.
SequencerJ.WASS1D
— ConstantConstant for specifiying the Monge-Wasserstein a.k.a. 1-p Wasserstein a.k.a. Earth Mover's Distance (EMD) metric. This metric is sensitive to the underlying grid. Default is unit grid taken from the axes of the data vector.
See [EMD]@ref
SequencerJ.ENERGY
— ConstantConstant for specifiying the Energy distance as defined by Székely.
See Energy
SequencerJ.ALL_METRICS
— ConstantConvenience constant to represent all of L2, WASS1D, KLD, and ENERGY.
Usage
Data Orientation
SequencerJ operates on pairwise columns of data in the given matrix (or vector of vectors). It builds an N x N distance matrix for further analysis, given an N-column input. It is therefore important to properly orient the data matrix so that M observations run downward along N columns, with each column representing a separate data set or series.
This can be performed efficiently with permutedims
:
A = rand(10,5)
M,N = size(A)
A = M < N ? permutedims(A) : A
10×5 Array{Float64,2}: 0.176648 0.78158 0.750312 0.956181 0.0589609 0.448624 0.98266 0.0774158 0.514672 0.0699944 0.418726 0.435687 0.832439 0.380973 0.232185 0.490742 0.320006 0.11697 0.22096 0.673471 0.660174 0.406789 0.0495403 0.141667 0.537355 0.21999 0.91087 0.471607 0.274198 0.650757 0.407952 0.502757 0.505704 0.374235 0.879488 0.599885 0.153405 0.272207 0.870525 0.107887 0.634633 0.364997 0.783727 0.335093 0.139262 0.910134 0.376992 0.408011 0.652405 0.160124
While a 2-D data set such as a photograph may be shuffled and reassembled in either orientation, the Sequencer algorithm will work better with more rows and fewer columns (i.e. M > N).
Autoscaling
The Sequencer algorithm's processing time rises as N x N, posing a challenge for data exploration with larger data sets. It may require considerable experimentation with multiple scales and metrics before finding an optimum combination. To assist with finding an appropriate scale, SequencerJ offers an "autoscaling" option. This is, in fact, the default behavior with the keyword argument scales=nothing
.
In autoscaling, SequencerJ first performs the sequencing function at several scales on a subset of columns. The scale producing the greatest elongation (see elong
) is then used on the full dataset. This is a feature specific to the Julia port of the Sequencer algorithm.
At present, there is no "autometric" option to choose the best metric.
Metrics and Grids
Setting the metrics keyword option as
sequence(A, metrics=(L2, ENERGY))
means to use Distances.Euclidean()
and SequencerJ.Energy()
.
With equal effect, you can write:
sequence(A, metrics=(Euclidean(), Energy())
You can specify a threshold for the Euclidean
calculations. To use, e.g., 1e-7
means round to zero any distance calculation results smaller than 1e-7
.
sequence(A, metrics=(Euclidean(1e-7), Energy())
To specify a non-unit data grid, set grid
as a Vector{AbstractFloat}
in the range [0.0-1.0) and use the Distance type, not the constant.
using SequencerJ #hide
A = rand(100,100);
M,N = size(A)
g = collect(0.0:0.5:N-1)
sequence(A, metrics=(Energy()), grid=g)
To these four metrics, the intention is to add spectral methods for complex (i.e. 2-D) inputs and any other Distances that make sense for sequencing 1-D or 2-D data. I would love community help with this!