Docstrings · CategoricalTimeSeries.jl

CategoricalTimeSeries.IB — Type

Interface for IB clustering computation. Input data 'x' must be 1D. β controls the amount of clustering : the smaller the beta, the greater the clustering (and information loss). Can be initalized directly with the data via IB(x), or if you already have the histograms/probability of co-occurences of 'x' and 'y' via IB(pxy). To use the deterministic IB algorithm, set algorithm to "DIB" like IB(x, "DIB").

CategoricalTimeSeries.pattern — Type

A class whose instances contain all usefull informations about a given motif

found in a categorical time-series.

Base.merge! — Function

scans through the qt probabilities and merges clusters where values are too low.

CategoricalTimeSeries.H — Method

H(x)

Estimates the probability distribution of the categories in 'x' and computes its entropy. It characterizes the amount of information carried by the distribution. it tells how "disperse" the distribution is.

CategoricalTimeSeries.IB_optimize! — Method

IB_optimize(model::IB; abs_thres = 10^-3, rel_thres = 0.01, ct_thres = 5, n_conv = 1000)

Optimizes IB model of the data. Keywords arguments are : - absthres : threshold for convergence in absoulte change of the optimization function (lagrangian) L. - relthres : threshold for convergence in relative change of L. - ctthres : number of steps where L has to be < to absthres or relthres to declare convergence. - nconv : limit of the number of iterations for the optimization algorithm. - scan (bool): when using 'DIB' algorithm, checks and merges clusters if it leads to reduction of cost function. - report : wether or not to say when and how convergence was reached, and indicates which cluster merged.

CategoricalTimeSeries.IB_step! — Method

Performs one iteration of the IB algorithm.

CategoricalTimeSeries.KL — Method

Computes the KL divergence for a collection of distributions in matrix form.

Given P of shape MxN and Q of shape MxL, returns a matrix of shape NxL.

CategoricalTimeSeries.KL_single — Method

Computes the KL divergence for two 1D probability distributions.

CategoricalTimeSeries.LaggedBivariateProbability — Method

Used in the computation of cramer's V and cohen's K. Returns the lagged bivariate probability of two given categories, Pij. Given i and j two categories, and l a lag, Pij is the probability to have the category j at time t + l, if we have i at time t. inputs : - time series : the data to analyse - lag : the value of lag at which one wants Pij - category1 : the first category - category2 : the second category output : - Pij : the probability to observe j at t + l if we observe i at time t.

CategoricalTimeSeries.LaggedBivariateProbability — Method

If no categories are explicitely provided, will a CxC symmetric probability matrix, with C the number of != categories. the element [i,j] of this matrix corresponds to Pij at lag 'Lag'. Use this function if you want to have the contingency table at lag 'Lag'.

CategoricalTimeSeries.LaggedBivariateProbability — Method

If an array of lags 'Lag' is provided instead of a single value, computes Pij for each lag value and returns an array.

CategoricalTimeSeries.S_matrix — Method

S_matrix(ts, window)

Returns S matrix containing all segments of length 'window' in 'ts'. Done with a sliding window.

CategoricalTimeSeries.apply_mapping — Method

apply_mapping(ts, mapping)

Transforms the input time-series 'ts' according to the mapping 'mapping'. Typically, 'mapping' would be the output of the 'get_mappings' function, although you can provide your own. 'mapping' should be a Dict specifying each value and what it should be mappied to. Example: {"A" : 0.54, "G" : 0.62, "T" : -0.57, "C" : 0.0} Returns an array which correspond to the mapped series according to 'mapping'

CategoricalTimeSeries.bootstrap_CI — Function

bootstrap_CI(Series, lags, coefficient, n_iter = 1000)

Provides 95% a confidence interval by shuffling 'Series' 'n_iter' times, computing the values of 'coefficient' then finding the value of the top and bottom 2,5% to get the interval. This is done for every point in 'lags' (can be costly if 'Series' is long).

Input : - Series : input array of categorical data - lags (Array{Int64,1}) : the lag values at which the analysis is conducted - coeffunc (function) : the function for which the CI needs to be computed. 'coefficient' can be one of the following functions : 'cramercoefficient, cohencoefficient, theilsU'. - niter (Int64) : how many iterations are run for the bootstrap procedure. - intervalsize (Float64) : the size of the desired confidence interval in percent. Defaults to 0.95 (i.e 95% confidence interval). returns : - top (Array{Float64,1}) : Array of values for the upper limit of the CI. - bottom (Array{Float64,1}) : Array of values for the lower limit of the CI.

CategoricalTimeSeries.brute_optimize! — Function

brute_optimize!(m::IB, findbest = true; report = true)

Performs bute force optimization of IB model, looking at all possible cluster combinations and choosing the merges that reduce the cost function.

CategoricalTimeSeries.check_conv — Method

Checks if absolute or relative convergence criterion on L function is satisfied.

CategoricalTimeSeries.cohen_coefficient — Function

Computes the lagged cohen coefficient κ describing how much the different categories are correlated to each others at different lag values in the time-series. K is a measure of agreement. input: -Serie : a categorical time series -lag: an integer lag value output: -κ : cohen's coefficient

CategoricalTimeSeries.cohen_coefficient — Method

If 'lag' is provided as an array of lags, computes and returns an array of K for each elements of 'lag'.

CategoricalTimeSeries.collision_matrix — Function

collision_matrix(ts, w, d)

Constructs and returns the collision matrix of time-series 'ts'. inputs (Int): ts : input time-series w : motif size (window size) d : # of allowed errors between motifs e : exclusion zone to get rid of trivial matches. t : length of projections after applying mask. Defaults to w - d. iters : the number of cycles used to construct the collision matrix. returns (Int): C : collision matrix

CategoricalTimeSeries.conditional_entropy — Method

Computing the conditional entropy of y given x: H(Y|X), Given in bits or shannons (meaning using the log2 for the measurement). Wikipedia: https://en.wikipedia.org/wiki/Conditional_entropy

CategoricalTimeSeries.cramer_coefficient — Method

cramer_coefficient(Serie, Lags::Array{Int64,1})

Returns the cramer's V coefficient for the given 'Lags' values.

Cramer's v is used in categorical data analysis to test the degree of association between a given set of categorical variable and another set of variables (example : eye color and hair color). Here, it measures the association between the categorical values of a time series, and the values at a later lagged time.

V lies in [0,1]. 0 no association, 1 perfect association. It is symmetric, meaning v(A,B) = v(B,A), so the information contained in the asymmetry between the variables is lost.

input :
- Serie : the time series containing the data
- Lags : Array containing the lags at which V will be computed. If an int is given
            a single point value will be returned.

output :
- V : the values of v for the given lags.

CategoricalTimeSeries.detect_motifs — Function

detect_motifs(ts, w, d, t = w - d; iters = 1000, tolerance = 0.7)

Detects all motifs of length 'w' occuring more often than chance, being identical up to 'd' differences. Input:

w : length of motifs to look for
d : allowed errors (differences) between motifs
t : size of the masks to use for random projection in the detection
iters : the numbers of iterations for the random projection process (defaults to 1000)
tolerance : threshold of motif identification.

returns : motifs : a list of motif instances containing all usefull informations about the found motifs (positions, frequency, shapes ...)

CategoricalTimeSeries.drop_rows — Method

Remove rows of 2D array 'value_arr'. Boolean array 'drop_arr' indicates which rows to drop.

CategoricalTimeSeries.entropy — Method

Computes the entropy of a distribution. If distribution is of shape (MxN),

treats each column as an independant distribution and returns an 1xN entropy vector.

CategoricalTimeSeries.error_dist — Method

error_dist(x1, x2)

Returns the number of differences between two sorted patterns 'x1' and 'x2'.

CategoricalTimeSeries.exclusion_mask! — Method

exclusion_mask!(c_matrix, exclusion_zone)

Applies exclusion zone to collision matrix (in place).

CategoricalTimeSeries.expected_matches — Method

match_probability(ts, w, d)

Estimates and returns the expectation value of motif-pair of length 'w' matching up to 'd' errors in the provided time-series 'ts'.

CategoricalTimeSeries.find_motifs — Method

find_motifs(ts, w, d)

Given a motif of shape 'shape' (array{any,1}), look for all the repetitions of it which differ only up to 'd' differences. Input:

ts : time-series in which to look for motifs
shape : shape (aray{any,1}) of the motif to look for.
d : allowed errors (differences) between motifs

returns : motif : an instance of 'pattern' containing the found repetition of the input 'shape'.

CategoricalTimeSeries.findall_fast — Method

A version of findall adapted for loops to go faster.

CategoricalTimeSeries.get_IB_curve — Function

Scans the IB plane with various vlaues of beta to get the optimal curve in the IB plane.

CategoricalTimeSeries.get_mappings — Method

 get_mappings(data, freq; m = 3)

Returns the optimal mappings corresponding to the frequency 'freq'. Prints the position and strengh of peak at 'f' for control purposes. input : - data : input categorical time series - freq : the frequency where one wants the optimal mappings - m : smoothing parameters (see 'spectralenvelope') returns : - mappings : the mapping at the peak associated with the desired frequency. example: # we have some data, plotting the spectral envelope, we see a peak at 0.33 m = getmappings(data, 0.33)

CategoricalTimeSeries.get_masks — Method

Returns all possible masks of length d among the possible w positions. positive masks: the indexes represent the columns that are KEPT. w : motif length t : length of masked motifs

CategoricalTimeSeries.get_px — Method

Computes marginal probability of 'x'.

CategoricalTimeSeries.get_pxy — Method

get_pxy(x,y)

Returns joint probability of x and y from normalized 2D histogram. Automatically replaces 0 by 10^-16 to prevent Inf and NaN values during KL computation.

CategoricalTimeSeries.get_py — Method

Computes marginal probability of 'y'.

CategoricalTimeSeries.get_py_x — Method

returns the conditional probability of y given x.

CategoricalTimeSeries.get_y — Function

get_y(x, algorithm = "nn")

Returns the corresponding context vector y to the data vector x. Several types of context are possible, and are selected by the type argument. type = "nn" is the next neighbor to every point of x. type = "an" stands for "adjacent neighbor", or the previous and next element of every element of x.

CategoricalTimeSeries.group_equivalent! — Method

Helper function that puts all similar clusters next to each other for ease of readability.

CategoricalTimeSeries.init! — Method

Re-initialies values of IB model

CategoricalTimeSeries.init_values — Function

init_values(x, y)

Initializes and return clustering porbability matrices qt_x, q and qy_t. 'qtx' refers to the probability of a category in x to be mapped to an element of t. 'q' is the overall probability of an element of t to occur. 'qyt' is the probability that an element of t corresponds to a given element of y.

CategoricalTimeSeries.least_occurence_threshold — Method

least_occurence_threshold(w, d, p; confidence = 0.95)

Computes the least number of occurences that a motif has to have in order not to be considered a product of chance. input: ts : input time-series w : motif size (window size). d : # of allowed errors between motifs. confidence (optional) : the level of desired confidence. If set to 0.5, the expected number of motifs chains matching 'matchnumber' times by chance will be 0.5. (defaults to 0.5) Returns: matchnumber : match number by chance allowed by 'confidence'.

CategoricalTimeSeries.map_to_integer — Method

map_to_integer(ts)

returns an integer mapping of input time-series 'ts'.

CategoricalTimeSeries.mapped_init_values — Method

Returns all initialization values for the IB struct in cases where input is not a real valued array.

CategoricalTimeSeries.mapping — Method

mapping(input)

Maps an input array of any type (string, float, any...) to an array of Int usable by the algorithm. The structure of the array is preserved as the mapping is one-to-one.

CategoricalTimeSeries.masked_match — Method

masked_match(w, d, t)

Returns the probability of two random masked generated motifs of length 't' to match despite 'd' error. 'w' is the size of the full motifs. !! this is a probability PER ITERATION !! inputs (Int): w : motif size (window size). d : # of allowed errors between motifs. t : length of projections after applying mask. Defaults to w - d.

CategoricalTimeSeries.p_motif_match — Method

p_motif_match(p, match_number)

probability of two motif matching 'match_number' times by chance.

CategoricalTimeSeries.periodogram_matrix — Function

periodogram_matrix(ts::Array{Float64,2}, demean = false; m=2)

Computes the k x k periodogram matrix (k number of categories). If 'average' true, segments data and computes the average periodogram matrix over all segments. m is the smoothing parameter.

CategoricalTimeSeries.plot_motif — Method

plot_motif(motif, ts; subplots = false)

Plots all instances of the given motif. If the corresponding time series 'ts' is provided, plots them on top of it, preserving the time-orderding. Otherwise, plots all instances of 'motif' on top of each other to facilitate their comparison. if subplots is given false, will only plot the motifs on top of the time-series.

CategoricalTimeSeries.print_results — Function

print_results(m::IB, disp_thres = 0.1)

Displays the clustering of data for an optimized input model 'm'.

Since the IB algorithm is not deterministic, some probabilities in the clustering matrix 'm.qtx' are non-zero. For ease of readability, every probability under 'dispthres' is displayed as 0, everything else as 1. The result is a 2D matrix where the rows represent the clusters and the column the initial categories. The 1 tell to which cluster which category belongs. If several 1 are present for a single category, it means that the optimization wasn't able to assign a single cluster to this category for the chosen β value. You might try a diffent β, or use the DIB variant of the algorithm.

CategoricalTimeSeries.qt_step! — Method

Peforms q(t) update step and drops unused clusters.

CategoricalTimeSeries.qt_x_step! — Method

Peforms q(t|x) update step.

CategoricalTimeSeries.qy_t_step! — Method

Performs q(y|t) update step.

CategoricalTimeSeries.rate_evolution — Method

rate_evolution(Series)

A way to test the stationarity of the input categorical time-series 'Series'. if it varies linearly, then the time series is more or less stationary.

CategoricalTimeSeries.relative_frequency — Method

Returns the relative frequency of a given category (of type ::Any). For example, if we have 3 category a,b and c, and a occurs 300 times in a time series of length 900, its relative frequency is 1/3.

CategoricalTimeSeries.replace_zeros — Function

replace_zeros(arr::Array{Float64,2}, cut = 10^-305)

Replaces zeros (values below 'cut') by 'cut' to avoid exponential term causing NaN at high beta values.

CategoricalTimeSeries.search_optima! — Function

search_optima!(m::IB, n_iter = 10000)

The original IB algorithm can converge to a local optima. To find the global optima, searchoptima! performs 'niter' optimization of input model 'm', randomly re-initializing it at every iteration. The best optimization is kept and 'm' is modified in-place.

CategoricalTimeSeries.sort_decreasing — Method

sort_decreasing(arr)

Returns the indices needed to sort input array in decreasing order.

CategoricalTimeSeries.spectral_envelope — Method

Computes the spectral envelope of the given time-series. input: - ts: the time series to analyse - m : the degree of smoothing wished. m corresponds to the number of neighbooring points that are mixed with given point to realize the smoothing. output: - Frequencies : list of points corresponding to the involved frequencies. contained in [0,0.5] - amplitude : values of the spectral envelope for each given frequency point. - eigenvectors : the optimal scaling for the different categories, for each frequency point. - categories : the list of categories present in the data.

CategoricalTimeSeries.symbol_probabilities — Method

Returns dictionnary containing the probability of each symbol found in ts.

CategoricalTimeSeries.theils_u — Method

theils_u(x, Lags)

Measures what portion of the information associated to the values in 'x' is know at t + lag if the value 'x' is known at t. Input : x : a categorical time-series lags : an array of lags onto U is computed. returns: U : an array of U for the given values in 'Lags'.

CategoricalTimeSeries.update_c_matrix! — Method

update_c_matrix!(collision_matrix, masked_S)

Updates the collision matrix by adding 1 in the row where repetitions of a motif are found. The row are the index of the repetitions, the column represent the first found motif of this type. An exclusion zone is applied to tackle trivial neighbours motifs.

CategoricalTimeSeries.varcov — Method

Computes the covariance-variance matrix of a given time-series.