SequencerJ.jl

Installation

SequencerJ is not yet a registered package, but installing it requires mere whispers of additional keystroke beyond those for a regular package:


`]` add "https://github.com/turingtest37/SequencerJ.jl/"

The ] key switches the REPL into package manager mode. The add command accepts a URL. The delete key gets you back to the REPL.

Alternatively you can stay in the REPL:

julia> using Pkg; Pkg.add(PackageSpec(url="https://github.com/turingtest37/SequencerJ.jl/"));
    Cloning git-repo `https://github.com/turingtest37/SequencerJ.jl`
�[?25l    Fetching: [>                                        ]  0.0 %
�[2K�[?25h   Updating git-repo `https://github.com/turingtest37/SequencerJ.jl`
�[?25l�[2K�[?25h  Resolving package versions...
Updating `~/.julia/packages/SequencerJ/CnHKe/docs/Project.toml`
  [348581b9] ~ SequencerJ v1.0.0 `~/.julia/packages/SequencerJ/CnHKe` ⇒ v1.0.0 `https://github.com/turingtest37/SequencerJ.jl/#master`
Updating `~/.julia/packages/SequencerJ/CnHKe/docs/Manifest.toml`
  [348581b9] ~ SequencerJ v1.0.0 `~/.julia/packages/SequencerJ/CnHKe` ⇒ v1.0.0 `https://github.com/turingtest37/SequencerJ.jl/#master`

You may get WARNINGs upon compilation. You can safely ignore them for most purposes, but if you are developing SequencerJ locally and use the Revise package, note that you may have to restart your Julia environment more often than usual.

Using SequencerJ

Getting started with SequencerJ is straightforward. First, we need to wrangle some data to analyze. For a quick (but unpromising) start, let's use a random array.

julia> using SequencerJ

julia> A = rand(100,50);

julia> r = sequence(A);
┌ Info: Sequencing data with
│     shape: (100, 50)
│     metric(s): (Euclidean(1.0e-7), EMD(nothing), KLDivergence(), Energy(nothing))
└     scale(s): nothing
[ Info: After autoscaling, best scale = 1
[ Info: Euclidean(1.0e-7) at scale 1: η = 3.834 (0.0027s)
[ Info: EMD(nothing) at scale 1: η = 13.05 (0.12s)
[ Info: KLDivergence() at scale 1: η = 3.4 (0.0093s)
[ Info: Energy(nothing) at scale 1: η = 11.75 (0.17s)
[ Info: Final: η = 6.965 sequence = 11,49,42,39,29...31,13,32,10,16

The sequence method is the primary interface to SequencerJ. Only the input matrix or vector A is required. All other options use their defaults: scales=nothing enables autoscaling to find the best row segmentation; metrics=ALL_METRICS enables all supported measurement; grid=nothing forces automatic grid creation. See sequence

Data and statistics from a sequencing run are encapsulated in a SequencerResult.

julia> r
Sequencer Result: η = 6.965, order = 11,49,42,39,29...31,13,32,10,16

Use SequencerJ's accessor functions to get the details of a run. The result data sequence (1-based column indices, not 0-based row indices as in Sequencer for python) can be obtained with the order function.

julia> order(r)
50-element Array{Int64,1}:
 11
 49
 42
 39
 29
 27
  7
 20
 25
  2
  ⋮
 12
  6
 33
 21
 31
 13
 32
 10
 16

A key feature of the Sequencer algorithm is the calculated elongation of the minimum spanning tree that describes the best fit sequence. See elong

julia> elong(r)
6.9648

A Better Example

Random data looks pretty boring from every direction, so let's apply the Sequencer to something more enjoyable: trying to put back together an image whose columns (of pixels) have been shuffled. This trick will even impress your mom!

The resources folder contains test images you can use to play with SequencerJ.

using SequencerJ
using Random
using Images

img = load(joinpath(@__DIR__, "..", "..","resources","bread.jpeg"))

Color is nice but grayscale is easier.

imgg = (Gray.(img))

Let's slice up this loaf of bread with a different kind of knife...

A = convert(Matrix{Float32}, imgg);
shuff_idx = shuffle(axes(A,2));
prettyp(shuff_idx, 3)

Reordering the image with a shuffled column index creates a lot of bread crumbs!

As = A[:, shuff_idx]
colorview(Gray, As)

Now let's put back together the pieces. We run SequencerJ against the shuffled matrix, and gather the best sequence as a column index.

seqres = sequence(As, metrics=(KLD, L2))
bestguessidx = SequencerJ.order(seqres)

Drum roll, please!

colorview(Gray, As[:, bestguessidx])

Oops, the image is mirrored. That's easy to fix:

colorview(Gray, As[:, reverse(bestguessidx)])

Voilà!

Failure is an option

The Sequencer sometimes fails to reassemble images completely. Usually this is because an image is too small along the row dimension and therefore does not contain enough data to allow the algorithm to discriminate correctly between similar columns of data. To ensure best performance, images should be presented in "portrait" mode, with more rows than columns.

See a full example of portrait mode here Data in Portrait Mode

Metrics

To make sense of A, we must choose which statistical distance metrics to apply. Each metric compares A pairwise, column by column, to create a distance matrix. The matrix is analyzed using graph theory to identify an optimal ordering of columns in A.

This optimal sequence represents the least-cost path through the distance matrix and hence the closest affinities between columns, as proximity is the inverse of distance. See The Sequencer Algorithm

SequencerJ currently supports four metrics:

L2 - Euclidian
Earth Mover's Distance - EMD
Kullback-Lubler divergence - KLDivergence
Szekely's energy distance - Energy

With no other arguments, sequence applies all algorithms it knows to the data (metrics = ALL_METRICS). Distance algorithms may be specified individually or in groups with the metrics keyword.

r = sequence(A, scales=(1,3), metrics=(L2,ENERGY))

To specify a single metric, write it as a 1-tuple:

r = sequence(A, scales=(1,3), metrics=(L2,))

Scales

The default scale is chosen by an "autoscaling" algorithm that samples the columns, runs the Sequencer on this subset at several scales (currently, the Fibonacci sequence), and compares them. The scale producing the greatest elongation is chosen to run on the full data set. You may force autoscaling with (scales=nothing).

sequence(A, scales=nothing)

Scale means the number of parts into which the data is partitioned. Each section or chunk contains approximately size(A,1)/scale elements. For example, 100 rows at scale 3 will result in chunks of 33, 33, and 34 rows. Set scales manually using the scales keyword:

r = sequence(A, scales=(1,3))

To choose only one scale, write it as a 1-tuple:

r = sequence(A, scales=(4,))

Output

Use accessor functions to get details from the SequencerResult.


julia> A = rand(2000, 100);

julia> r = sequence(A);
┌ Info: Sequencing data with
│     shape: (2000, 100)
│     metric(s): (Euclidean(1.0e-7), EMD(nothing), KLDivergence(), Energy(nothing))
└     scale(s): nothing
[ Info: After autoscaling, best scale = 144
[ Info: Euclidean(1.0e-7) at scale 144: η = 4.011 (1.5s)
[ Info: EMD(nothing) at scale 144: η = 4.712 (16s)
[ Info: KLDivergence() at scale 144: η = 2.997 (2.1s)
[ Info: Energy(nothing) at scale 144: η = 3.08 (16s)
[ Info: Final: η = 3.664 sequence = 75,68,53,43,27...59,15,87,14,89

Best column sequence: order

julia> order(r)
100-element Array{Int64,1}:
 75
 68
 53
 43
 27
 90
 45
 19
 13
 82
  ⋮
 67
  3
  7
  5
 59
 15
 87
 14
 89

Final elongation: elong

julia> elong(r)
3.664

Final elongation: mst

julia> mst(r)
{100, 99} undirected simple Int64 graph with Float64 weights

Final distance matrix: D

julia> D(r)
100×100 SparseArrays.SparseMatrixCSC{Float64,Int64} with 10000 stored entries:
  [1  ,   1]  =  1.0e-6
  [2  ,   1]  =  Inf
  [3  ,   1]  =  Inf
  [4  ,   1]  =  Inf
  [5  ,   1]  =  Inf
  [6  ,   1]  =  1.20236
  [7  ,   1]  =  Inf
  [8  ,   1]  =  Inf
  [9  ,   1]  =  Inf
  ⋮
  [91 , 100]  =  Inf
  [92 , 100]  =  Inf
  [93 , 100]  =  Inf
  [94 , 100]  =  Inf
  [95 , 100]  =  Inf
  [96 , 100]  =  Inf
  [97 , 100]  =  1.20334
  [98 , 100]  =  Inf
  [99 , 100]  =  Inf
  [100, 100]  =  1.0e-6

Per-segment intermediate results:

julia> using Base.Iterators

julia> segDict = r.EOSeg;

julia> for (k,l) in Iterators.take(keys(segDict), 3)
           η, mst = segDict[(k,l)]
       end

julia> segDict[KLD]
ERROR: KeyError: key KLDivergence() not found

julia> η, mst = segDict[(KLD,2)]
ERROR: KeyError: key (KLDivergence(), 2) not found

julia> η = first(segDict[(KLD,2)])
ERROR: KeyError: key (KLDivergence(), 2) not found

Collect the mean elongations across segments for each metric+scale

julia> collect(StatsBase.mean(first(v)) for (k,v) in r)
ERROR: MethodError: no method matching iterate(::SequencerResult)
Closest candidates are:
  iterate(!Matched::Base.RegexMatchIterator) at regex.jl:552
  iterate(!Matched::Base.RegexMatchIterator, !Matched::Any) at regex.jl:552
  iterate(!Matched::DataStructures.SparseIntSet, !Matched::Any...) at /juliateam/.julia/packages/DataStructures/5hvIb/src/sparse_int_set.jl:147
  ...

In a similar fashion, get final elongations and the MST for each metric+scale:

julia> rk = r.EOAlgScale
Dict{Tuple,Any} with 4 entries:
  (KLDivergence(), 144)    => (2.9968, {100, 99} directed simple Int64 graph)
  (EMD(nothing), 144)      => (4.7116, {100, 99} directed simple Int64 graph)
  (Euclidean(1.0e-7), 144) => (4.0112, {100, 99} directed simple Int64 graph)
  (Energy(nothing), 144)   => (3.08, {100, 99} directed simple Int64 graph)

The Sequencer Algorithm

The algorithm is described in detail in the paper available on Arxiv

@misc{baron2020extracting,
title={Extracting the main trend in a dataset: the Sequencer algorithm},
author={Dalya Baron and Brice Ménard},
year={2020},
eprint={2006.13948},
archivePrefix={arXiv},
primaryClass={cs.LG},
year=2020

}

As another example of how the Sequencer algorithm may be used, see Mapping Earth's deepest secrets.