Interface for IB clustering computation. Input data 'x' must be 1D. β controls the amount of clustering : the smaller the beta, the greater the clustering (and information loss). Can be initalized directly with the data via IB(x), or if you already have the histograms/probability of co-occurences of 'x' and 'y' via IB(pxy). To use the deterministic IB algorithm, set algorithm to "DIB" like IB(x, "DIB").

A class whose instances contain all usefull informations about a given motif

found in a categorical time-series.

scans through the qt probabilities and merges clusters where values are too low.

Estimates the probability distribution of the categories in 'x' and computes its entropy. It characterizes the amount of information carried by the distribution. it tells how "disperse" the distribution is.

IB_optimize(model::IB; abs_thres = 10^-3, rel_thres = 0.01, ct_thres = 5, n_conv = 1000)

Optimizes IB model of the data. Keywords arguments are : - absthres : threshold for convergence in absoulte change of the optimization function (lagrangian) L. - relthres : threshold for convergence in relative change of L. - ctthres : number of steps where L has to be < to absthres or relthres to declare convergence. - nconv : limit of the number of iterations for the optimization algorithm. - scan (bool): when using 'DIB' algorithm, checks and merges clusters if it leads to reduction of cost function. - report : wether or not to say when and how convergence was reached, and indicates which cluster merged.

Computes the KL divergence for a collection of distributions in matrix form.

Given P of shape MxN and Q of shape MxL, returns a matrix of shape NxL.


Used in the computation of cramer's V and cohen's K. Returns the lagged bivariate probability of two given categories, Pij. Given i and j two categories, and l a lag, Pij is the probability to have the category j at time t + l, if we have i at time t. inputs : - time series : the data to analyse - lag : the value of lag at which one wants Pij - category1 : the first category - category2 : the second category output : - Pij : the probability to observe j at t + l if we observe i at time t.


If no categories are explicitely provided, will a CxC symmetric probability matrix, with C the number of != categories. the element [i,j] of this matrix corresponds to Pij at lag 'Lag'. Use this function if you want to have the contingency table at lag 'Lag'.

S_matrix(ts, window)

Returns S matrix containing all segments of length 'window' in 'ts'. Done with a sliding window.

apply_mapping(ts, mapping)

Transforms the input time-series 'ts' according to the mapping 'mapping'. Typically, 'mapping' would be the output of the 'get_mappings' function, although you can provide your own. 'mapping' should be a Dict specifying each value and what it should be mappied to. Example: {"A" : 0.54, "G" : 0.62, "T" : -0.57, "C" : 0.0} Returns an array which correspond to the mapped series according to 'mapping'

bootstrap_CI(Series, lags, coefficient, n_iter = 1000)

Provides 95% a confidence interval by shuffling 'Series' 'n_iter' times, computing the values of 'coefficient' then finding the value of the top and bottom 2,5% to get the interval. This is done for every point in 'lags' (can be costly if 'Series' is long).

Input : - Series : input array of categorical data - lags (Array{Int64,1}) : the lag values at which the analysis is conducted - coeffunc (function) : the function for which the CI needs to be computed. 'coefficient' can be one of the following functions : 'cramercoefficient, cohencoefficient, theilsU'. - niter (Int64) : how many iterations are run for the bootstrap procedure. - intervalsize (Float64) : the size of the desired confidence interval in percent. Defaults to 0.95 (i.e 95% confidence interval). returns : - top (Array{Float64,1}) : Array of values for the upper limit of the CI. - bottom (Array{Float64,1}) : Array of values for the lower limit of the CI.

brute_optimize!(m::IB, findbest = true; report = true)

Performs bute force optimization of IB model, looking at all possible cluster combinations and choosing the merges that reduce the cost function.


Computes the lagged cohen coefficient κ describing how much the different categories are correlated to each others at different lag values in the time-series. K is a measure of agreement. input: -Serie : a categorical time series -lag: an integer lag value output: -κ : cohen's coefficient

collision_matrix(ts, w, d)

Constructs and returns the collision matrix of time-series 'ts'. inputs (Int): ts : input time-series w : motif size (window size) d : # of allowed errors between motifs e : exclusion zone to get rid of trivial matches. t : length of projections after applying mask. Defaults to w - d. iters : the number of cycles used to construct the collision matrix. returns (Int): C : collision matrix


Computing the conditional entropy of y given x: H(Y|X), Given in bits or shannons (meaning using the log2 for the measurement). Wikipedia:

cramer_coefficient(Serie, Lags::Array{Int64,1})

Returns the cramer's V coefficient for the given 'Lags' values.

Cramer's v is used in categorical data analysis to test the degree of association between a given set of categorical variable and another set of variables (example : eye color and hair color). Here, it measures the association between the categorical values of a time series, and the values at a later lagged time.

V lies in [0,1]. 0 no association, 1 perfect association. It is symmetric, meaning v(A,B) = v(B,A), so the information contained in the asymmetry between the variables is lost.

input :
- Serie : the time series containing the data
- Lags : Array containing the lags at which V will be computed. If an int is given
            a single point value will be returned.

output :
- V : the values of v for the given lags.
detect_motifs(ts, w, d, t = w - d; iters = 1000, tolerance = 0.7)

Detects all motifs of length 'w' occuring more often than chance, being identical up to 'd' differences. Input:

w : length of motifs to look for
d : allowed errors (differences) between motifs
t : size of the masks to use for random projection in the detection
iters : the numbers of iterations for the random projection process (defaults to 1000)
tolerance : threshold of motif identification.

returns : motifs : a list of motif instances containing all usefull informations about the found motifs (positions, frequency, shapes ...)

Computes the entropy of a distribution. If distribution is of shape (MxN),

treats each column as an independant distribution and returns an 1xN entropy vector.

match_probability(ts, w, d)

Estimates and returns the expectation value of motif-pair of length 'w' matching up to 'd' errors in the provided time-series 'ts'.

find_motifs(ts, w, d)

Given a motif of shape 'shape' (array{any,1}), look for all the repetitions of it which differ only up to 'd' differences. Input:

ts : time-series in which to look for motifs
shape : shape (aray{any,1}) of the motif to look for.
d : allowed errors (differences) between motifs

returns : motif : an instance of 'pattern' containing the found repetition of the input 'shape'.

 get_mappings(data, freq; m = 3)

Returns the optimal mappings corresponding to the frequency 'freq'. Prints the position and strengh of peak at 'f' for control purposes. input : - data : input categorical time series - freq : the frequency where one wants the optimal mappings - m : smoothing parameters (see 'spectralenvelope') returns : - mappings : the mapping at the peak associated with the desired frequency. example: # we have some data, plotting the spectral envelope, we see a peak at 0.33 m = getmappings(data, 0.33)


Returns all possible masks of length d among the possible w positions. positive masks: the indexes represent the columns that are KEPT. w : motif length t : length of masked motifs


Returns joint probability of x and y from normalized 2D histogram. Automatically replaces 0 by 10^-16 to prevent Inf and NaN values during KL computation.

get_y(x, algorithm = "nn")

Returns the corresponding context vector y to the data vector x. Several types of context are possible, and are selected by the type argument. type = "nn" is the next neighbor to every point of x. type = "an" stands for "adjacent neighbor", or the previous and next element of every element of x.

init_values(x, y)

Initializes and return clustering porbability matrices qt_x, q and qy_t. 'qtx' refers to the probability of a category in x to be mapped to an element of t. 'q' is the overall probability of an element of t to occur. 'qyt' is the probability that an element of t corresponds to a given element of y.

least_occurence_threshold(w, d, p; confidence = 0.95)

Computes the least number of occurences that a motif has to have in order not to be considered a product of chance. input: ts : input time-series w : motif size (window size). d : # of allowed errors between motifs. confidence (optional) : the level of desired confidence. If set to 0.5, the expected number of motifs chains matching 'matchnumber' times by chance will be 0.5. (defaults to 0.5) Returns: matchnumber : match number by chance allowed by 'confidence'.


Maps an input array of any type (string, float, any...) to an array of Int usable by the algorithm. The structure of the array is preserved as the mapping is one-to-one.

masked_match(w, d, t)

Returns the probability of two random masked generated motifs of length 't' to match despite 'd' error. 'w' is the size of the full motifs. !! this is a probability PER ITERATION !! inputs (Int): w : motif size (window size). d : # of allowed errors between motifs. t : length of projections after applying mask. Defaults to w - d.

periodogram_matrix(ts::Array{Float64,2}, demean = false; m=2)

Computes the k x k periodogram matrix (k number of categories). If 'average' true, segments data and computes the average periodogram matrix over all segments. m is the smoothing parameter.

plot_motif(motif, ts; subplots = false)

Plots all instances of the given motif. If the corresponding time series 'ts' is provided, plots them on top of it, preserving the time-orderding. Otherwise, plots all instances of 'motif' on top of each other to facilitate their comparison. if subplots is given false, will only plot the motifs on top of the time-series.

print_results(m::IB, disp_thres = 0.1)

Displays the clustering of data for an optimized input model 'm'.

Since the IB algorithm is not deterministic, some probabilities in the clustering matrix 'm.qtx' are non-zero. For ease of readability, every probability under 'dispthres' is displayed as 0, everything else as 1. The result is a 2D matrix where the rows represent the clusters and the column the initial categories. The 1 tell to which cluster which category belongs. If several 1 are present for a single category, it means that the optimization wasn't able to assign a single cluster to this category for the chosen β value. You might try a diffent β, or use the DIB variant of the algorithm.


A way to test the stationarity of the input categorical time-series 'Series'. if it varies linearly, then the time series is more or less stationary.


Returns the relative frequency of a given category (of type ::Any). For example, if we have 3 category a,b and c, and a occurs 300 times in a time series of length 900, its relative frequency is 1/3.

replace_zeros(arr::Array{Float64,2}, cut = 10^-305)

Replaces zeros (values below 'cut') by 'cut' to avoid exponential term causing NaN at high beta values.

search_optima!(m::IB, n_iter = 10000)

The original IB algorithm can converge to a local optima. To find the global optima, searchoptima! performs 'niter' optimization of input model 'm', randomly re-initializing it at every iteration. The best optimization is kept and 'm' is modified in-place.


Computes the spectral envelope of the given time-series. input: - ts: the time series to analyse - m : the degree of smoothing wished. m corresponds to the number of neighbooring points that are mixed with given point to realize the smoothing. output: - Frequencies : list of points corresponding to the involved frequencies. contained in [0,0.5] - amplitude : values of the spectral envelope for each given frequency point. - eigenvectors : the optimal scaling for the different categories, for each frequency point. - categories : the list of categories present in the data.

theils_u(x, Lags)

Measures what portion of the information associated to the values in 'x' is know at t + lag if the value 'x' is known at t. Input : x : a categorical time-series lags : an array of lags onto U is computed. returns: U : an array of U for the given values in 'Lags'.

update_c_matrix!(collision_matrix, masked_S)

Updates the collision matrix by adding 1 in the row where repetitions of a motif are found. The row are the index of the repetitions, the column represent the first found motif of this type. An exclusion zone is applied to tackle trivial neighbours motifs.