The BetaML.Clustering Module
BetaML.Clustering
— ModuleClustering module (WIP)
Provide clustering methods and collaborative filtering using clustering methods as backend.
The em algorithm is work in progress, as its API will likely change to account for different type of mixtures.
The module provide the following functions. Use ?[function]
to access their full signature and detailed documentation:
initRepresentatives(X,K;initStrategy,Z₀)
: Initialisation strategies for Kmean and Kmedoids- `kmeans(X,K;dist,initStrategy,Z₀)`: Classical KMean algorithm
- `kmedoids(X,K;dist,initStrategy,Z₀)`: Kmedoids algorithm
- `em(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance,initStrategy)`: EM algorithm over GMM
- `predictMissing(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance)`: Fill mixing values / collaborative filtering using GMM as backbone
{Spherical|Diagonal|Full}Gaussian mixtures for em()/predictMissing() are already provided. User defined mixtrues can be used defining a struct as subtype of Mixture
and implementing for that mixture the following functions:
initMixtures!(mixtures, X; minVariance, minCovariance, initStrategy)
lpdf(m,x,mask)
updateParameters!(mixtures, X, pₙₖ; minVariance, minCovariance)
Module Index
BetaML.Clustering.em
BetaML.Clustering.initMixtures!
BetaML.Clustering.initRepresentatives
BetaML.Clustering.kmeans
BetaML.Clustering.kmedoids
BetaML.Clustering.lpdf
BetaML.Clustering.lpdf
BetaML.Clustering.lpdf
BetaML.Clustering.predictMissing
Detailed API
BetaML.Clustering.em
— Methodem(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance,initStrategy)
Compute Expectation-Maximisation algorithm to identify K clusters of X data, i.e. employ a Generative Mixture Model as the underlying probabilistic model.
X can contain missing values in some or all of its dimensions. In such case the learning is done only with the available data. Implemented in the log-domain for better numerical accuracy with many dimensions.
Parameters:
X
: A (n x d) data to clusteriseK
: Number of cluster wantedp₀
: Initial probabilities of the categorical distribution (K x 1) [default:nothing
]mixtures
: An array (of length K) of the mixture to employ (see notes) [def:[DiagonalGaussian() for i in 1:K]
]tol
: Tolerance to stop the algorithm [default: 10^(-6)]verbosity
: A verbosity parameter regulating the information messages frequency [def:STD
]minVariance
: Minimum variance for the mixtures [default: 0.05]minCovariance
: Minimum covariance for the mixtures with full covariance matrix [default: 0]. This should be set different than minVariance (see notes).initStrategy
: Mixture initialisation algorithm [def:grid
]
Returns:
- A named touple of:
pⱼₓ
: Matrix of size (N x K) of the probabilities of each point i to belong to cluster jpⱼ
: Probabilities of the categorical distribution (K x 1)μ
: Means (K x d) of the Gaussianσ²
: Variance of the gaussian (K x 1). We assume here that the gaussian has the same variance across all the dimensionsϵ
: Vector of the discrepancy (matrix norm) between pⱼₓ and the lagged pⱼₓ at each iterationlL
: The log-likelihood (without considering the last mixture optimisation)BIC
: The Bayesian Information Criterion (lower is better)AIC
: The Akaike Information Criterion (lower is better)
Notes:
- The mixtures currently implemented are
SphericalGaussian(μ,σ²)
,DiagonalGaussian(μ,σ²)
andFullGaussian(μ,σ²)
- Reasonable choices for the minVariance/Covariance depends on the mixture. For example 0.25 seems a reasonable value for the SphericalGaussian, 0.05 seems better for the DiagonalGaussian, and FullGaussian seems to prefer either very low values of variance/covariance (e.g.
(0.05,0.05)
) or very big but similar ones (e.g.(100,100)
).
Example:
julia> clusters = emGMM([1 10.5;1.5 0; 1.8 8; 1.7 15; 3.2 40; 0 0; 3.3 38; 0 -2.3; 5.2 -2.4],3,verbosity=HIGH)
BetaML.Clustering.initMixtures!
— MethodinitMixtures!(mixtures::Array{T,1}, X; minVariance=0.25, minCovariance=0.0, initStrategy="grid")
BetaML.Clustering.initRepresentatives
— MethodinitRepresentatives(X,K;initStrategy,Z₀)
Initialisate the representatives for a K-Mean or K-Medoids algorithm
Parameters:
X
: a (N x D) data to clusteriseK
: Number of cluster wontedinitStrategy
: Wheter to select the initial representative vectors:random
: randomly in the X spacegrid
: using a grid approach [default]shuffle
: selecting randomly within the available pointsgiven
: using a provided set of initial representatives provided in theZ₀
parameter
Z₀
: Provided (K x D) matrix of initial representatives (used only together with thegiven
initStrategy) [default:nothing
]
Returns:
- A (K x D) matrix of initial representatives
Example:
julia> Z₀ = initRepresentatives([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.6 38],2,initStrategy="given",Z₀=[1.7 15; 3.6 40])
BetaML.Clustering.kmeans
— Methodkmeans(X,K;dist,initStrategy,Z₀)
Compute K-Mean algorithm to identify K clusters of X using Euclidean distance
Parameters:
X
: a (N x D) data to clusteriseK
: Number of cluster wonteddist
: Function to employ as distance (see notes). Default to Euclidean distance.initStrategy
: Wheter to select the initial representative vectors:random
: randomly in the X spacegrid
: using a grid approach [default]shuffle
: selecting randomly within the available pointsgiven
: using a provided set of initial representatives provided in theZ₀
parameter
Z₀
: Provided (K x D) matrix of initial representatives (used only together with thegiven
initStrategy) [default:nothing
]
Returns:
- A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives
Notes:
- Some returned clusters could be empty
- The
dist
parameter can be:- Any user defined function accepting two vectors and returning a scalar
- An anonymous function with the same characteristics (e.g.
dist = (x,y) -> norm(x-y)^2
) - One of the above predefined distances:
l1_distance
,l2_distance
,l2²_distance
,cosine_distance
Example:
julia> (clIdx,Z) = kmeans([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3)
BetaML.Clustering.kmedoids
— Methodkmedoids(X,K;dist,initStrategy,Z₀)
Compute K-Medoids algorithm to identify K clusters of X using distance definition dist
Parameters:
X
: a (n x d) data to clusteriseK
: Number of cluster wonteddist
: Function to employ as distance (see notes). Default to Euclidean distance.initStrategy
: Wheter to select the initial representative vectors:random
: randomly in the X spacegrid
: using a grid approachshuffle
: selecting randomly within the available points [default]given
: using a provided set of initial representatives provided in theZ₀
parameter
Z₀
: Provided (K x D) matrix of initial representatives (used only together with thegiven
initStrategy) [default:nothing
]
Returns:
- A tuple of two items, the first one being a vector of size N of ids of the clusters associated to each point and the second one the (K x D) matrix of representatives
Notes:
- Some returned clusters could be empty
- The
dist
parameter can be:- Any user defined function accepting two vectors and returning a scalar
- An anonymous function with the same characteristics (e.g.
dist = (x,y) -> norm(x-y)^2
) - One of the above predefined distances:
l1_distance
,l2_distance
,l2²_distance
,cosine_distance
Example:
julia> (clIdx,Z) = kmedoids([1 10.5;1.5 10.8; 1.8 8; 1.7 15; 3.2 40; 3.6 32; 3.3 38; 5.1 -2.3; 5.2 -2.4],3,initStrategy="grid")
BetaML.Clustering.lpdf
— Methodlpdf(m::DiagonalGaussian,x,mask) - Log PDF of the mixture given the observation x
BetaML.Clustering.lpdf
— Methodlpdf(m::FullGaussian,x,mask) - Log PDF of the mixture given the observation x
BetaML.Clustering.lpdf
— Methodlpdf(m::SphericalGaussian,x,mask) - Log PDF of the mixture given the observation x
BetaML.Clustering.predictMissing
— MethodpredictMissing(X,K;p₀,mixtures,tol,verbosity,minVariance,minCovariance)
Fill missing entries in a sparse matrix assuming an underlying Gaussian Mixture probabilistic Model (GMM) and implementing an Expectation-Maximisation algorithm.
Implemented in the log-domain for better numerical accuracy with many dimensions.
Parameters:
X
: A (N x D) sparse matrix of data to fill according to a GMM modelK
: Number of mixtures desiredp₀
: Initial probabilities of the categorical distribution (K x 1) [default:nothing
]mixtures
: An array (of length K) of the mixture to employ (see notes) [def:[DiagonalGaussian() for i in 1:K]
]tol
: Tolerance to stop the algorithm [default: 10^(-6)]verbosity
: A verbosity parameter regulating the information messages frequency [def:STD
]minVariance
: Minimum variance for the mixtures [default: 0.05]minCovariance
: Minimum covariance for the mixtures with full covariance matrix [default: 0]. This should be set different than minVariance (see notes).initStrategy
: Mixture initialisation algorithm [def:grid
]
Returns:
A named touple of:
̂X̂
: The Filled Matrix of size (N x D)nFill
: The number of items filledlL
: The log-likelihood (without considering the last mixture optimisation)BIC
: The Bayesian Information Criterion (lower is better)AIC
: The Akaike Information Criterion (lower is better)
Notes:
- The mixtures currently implemented are
SphericalGaussian(μ,σ²)
,DiagonalGaussian(μ,σ²)
andFullGaussian(μ,σ²)
- For mixtures with full covariance matrix (i.e.
FullGaussian(μ,σ²)
) the minCovariance should NOT be set equal to the minVariance, or if the covariance matrix goes too low, it will become singular and not invertible.
Example:
julia> cFOut = predictMissing([1 10.5;1.5 missing; 1.8 8; 1.7 15; 3.2 40; missing missing; 3.3 38; missing -2.3; 5.2 -2.4],3)