The BetaML.Clustering Module

BetaML.ClusteringModule
Clustering module (WIP)

Provide clustering methods and collaborative filtering using clustering methods as backend.

The em algorithm is work in progress, as its API will likely change to account for different type of mixtures.

The module provide the following functions. Use ?[function] to access their full signature and detailed documentation:

Module Index

Detailed API

BetaML.Clustering.collFilteringGMMMethod

collFilteringGMM(X,K;p₀,μ₀,σ²₀,tol,msgStep,minVariance,missingValue)

Fill missing entries in a sparse matrix assuming an underlying Gaussian Mixture probabilistic Model (GMM) and implementing an Expectation-Maximisation algorithm.

Implemented in the log-domain for better numerical accuracy with many dimensions.

Parameters:

  • X : A (N x D) sparse matrix of data to fill according to a GMM model
  • K : Number of mixtures desired
  • p₀ : Initial probabilities of the categorical distribution (K x 1) [default: nothing]
  • μ₀ : Initial means (K x D) of the Gaussian [default: nothing]
  • σ²₀: Initial variance of the gaussian (K x 1). We assume here that the gaussian has the same variance across all the dimensions [default: nothing]
  • tol: Tolerance to stop the algorithm [default: 10^(-6)]
  • msgStep : Iterations between update messages. Use 0 for no updates [default: 10]
  • minVariance: Minimum variance for the mixtures [default: 0.25]
  • missingValue: Value to be considered as missing in the X [default: missing]

Returns:

  • A named touple of:
    • ̂X̂ : The Filled Matrix of size (N x D)
    • nFill: The number of items filled
    • lL : The log-likelihood (without considering the last mixture optimisation)
    • BIC : The Bayesian Information Criterion

Example:

julia>  cFOut = collFilteringGMM([1 10.5;1.5 0; 1.8 8; 1.7 15; 3.2 40; 0 0; 3.3 38; 0 -2.3; 5.2 -2.4],3,msgStep=1,missingValue=0)
BetaML.Clustering.emGMMMethod

emGMM(X,K;p₀,μ₀,σ²₀,tol,msgStep,minVariance,missingValue)

Compute Expectation-Maximisation algorithm to identify K clusters of X data assuming a Gaussian Mixture probabilistic Model.

X can contain missing values in some or all of its dimensions. In such case the learning is done only with the available data. Implemented in the log-domain for better numerical accuracy with many dimensions.

Parameters:

  • X : A (n x d) data to clusterise
  • K : Number of cluster wanted
  • p₀ : Initial probabilities of the categorical distribution (K x 1) [default: nothing]
  • μ₀ : Initial means (K x d) of the Gaussian [default: nothing]
  • σ²₀: Initial variance of the gaussian (K x 1). We assume here that the gaussian has the same variance across all the dimensions [default: nothing]
  • tol: Tolerance to stop the algorithm [default: 10^(-6)]
  • msgStep : Iterations between update messages. Use 0 for no updates [default: 10]
  • minVariance: Minimum variance for the mixtures [default: 0.25]
  • missingValue: Value to be considered as missing in the X [default: missing]

Returns:

  • A named touple of:
    • pⱼₓ: Matrix of size (N x K) of the probabilities of each point i to belong to cluster j
    • pⱼ : Probabilities of the categorical distribution (K x 1)
    • μ : Means (K x d) of the Gaussian
    • σ² : Variance of the gaussian (K x 1). We assume here that the gaussian has the same variance across all the dimensions
    • ϵ : Vector of the discrepancy (matrix norm) between pⱼₓ and the lagged pⱼₓ at each iteration
    • lL : The log-likelihood (without considering the last mixture optimisation)
    • BIC : The Bayesian Information Criterion

Example:

julia> clusters = emGMM([1 10.5;1.5 0; 1.8 8; 1.7 15; 3.2 40; 0 0; 3.3 38; 0 -2.3; 5.2 -2.4],3,msgStep=1,missingValue=0)
BetaML.Clustering.initRepresentativesMethod

initRepresentatives(X,K;initStrategy,Z₀)

Initialisate the representatives for a K-Mean or K-Medoids algorithm

Parameters:

  • X: a (N x D) data to clusterise
  • K: Number of cluster wonted
  • initStrategy: Wheter to select the initial representative vectors:
    • random: randomly in the X space
    • grid: using a grid approach [default]
    • shuffle: selecting randomly within the available points
    • given: using a provided set of initial representatives provided in the Z₀ parameter
  • Z₀: Provided (K x D) matrix of initial representatives (used only together with the given initStrategy) [default: nothing]

Returns:

  • A (K x D) matrix of initial representatives

Example:

julia> Z₀ = initRepresentatives([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.6 38],2,initStrategy="given",Z₀=[1.7 15; 3.6 40])
BetaML.Clustering.kmeansMethod

kmeans(X,K;dist,initStrategy,Z₀)

Compute K-Mean algorithm to identify K clusters of X using Euclidean distance

Parameters:

  • X: a (N x D) data to clusterise
  • K: Number of cluster wonted
  • dist: Function to employ as distance (see notes). Default to Euclidean distance.
  • initStrategy: Wheter to select the initial representative vectors:
    • random: randomly in the X space
    • grid: using a grid approach [default]
    • shuffle: selecting randomly within the available points
    • given: using a provided set of initial representatives provided in the Z₀ parameter
  • Z₀: Provided (K x D) matrix of initial representatives (used only together with the given initStrategy) [default: nothing]

Returns:

  • A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives

Notes:

  • Some returned clusters could be empty
  • The dist parameter can be:
    • Any user defined function accepting two vectors and returning a scalar
    • An anonymous function with the same characteristics (e.g. dist = (x,y) -> norm(x-y)^2)
    • One of the above predefined distances: l1_distance, l2_distance, l2²_distance, cosine_distance

Example:

julia> (clIdx,Z) = kmeans([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3)
BetaML.Clustering.kmedoidsMethod

kmedoids(X,K;dist,initStrategy,Z₀)

Compute K-Medoids algorithm to identify K clusters of X using distance definition dist

Parameters:

  • X: a (n x d) data to clusterise
  • K: Number of cluster wonted
  • dist: Function to employ as distance (see notes). Default to Euclidean distance.
  • initStrategy: Wheter to select the initial representative vectors:
    • random: randomly in the X space
    • grid: using a grid approach
    • shuffle: selecting randomly within the available points [default]
    • given: using a provided set of initial representatives provided in the Z₀ parameter
  • Z₀: Provided (K x D) matrix of initial representatives (used only together with the given initStrategy) [default: nothing]

Returns:

  • A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives

Notes:

  • Some returned clusters could be empty
  • The dist parameter can be:
    • Any user defined function accepting two vectors and returning a scalar
    • An anonymous function with the same characteristics (e.g. dist = (x,y) -> norm(x-y)^2)
    • One of the above predefined distances: l1_distance, l2_distance, l2²_distance, cosine_distance

Example:

julia> (clIdx,Z) = kmedoids([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3,initStrategy="grid")