Functions · ClusterAnalysis.jl

Documentation of ClusterAnalysis functions.

ClusterAnalysis.kmeans — Function

kmeans(table, K::Int; nstart::Int = 10, maxiter::Int = 10, init::Symbol = :kmpp)
kmeans(data::AbstractMatrix, K::Int; nstart::Int = 10, maxiter::Int = 10, init::Symbol = :kmpp)

Classify all data observations in k clusters by minimizing the total-variance-within each cluster.

Arguments (positional)

table or data: table or Matrix of data observations.
K: number of clusters.

Keyword

nstart: number of starts.
maxiter: number of maximum iterations.
init: centroids inicialization algorithm - :kmpp (default) or :random.

Example

julia> using ClusterAnalysis
julia> using CSV, DataFrames

julia> iris = CSV.read(joinpath(pwd(), "path/to/iris.csv"), DataFrame);
julia> df = iris[:, 1:end-1];

julia> model = kmeans(df, 3)
KmeansResult{Float64}:
 K = 3
 centroids = [
     [5.932307692307693, 2.755384615384615, 4.42923076923077, 1.4384615384615382]
     [5.006, 3.4279999999999995, 1.462, 0.24599999999999997]
     [6.874285714285714, 3.088571428571429, 5.791428571428571, 2.117142857142857]
 ]
 cluster = [2, 2, 2, 2, 2, 2, 2, 2, 2, 2  …  3, 3, 1, 3, 3, 3, 1, 3, 3, 1]
 within-cluster sum of squares = 78.85144142614601
 iterations = 7

Pseudo-code of the algorithm:

Repeat nstart times:
1. Initialize K clusters centroids using KMeans++ algorithm or random init.
2. Estimate clusters.
3. Repeat maxiter times:
  - Update centroids using the mean().
  - Reestimates the clusters.
  - Calculate the total-variance-within-cluster.
  - Evaluate the stop rule.
Keep the best result (minimum total-variance-within-cluster) of all nstart executions.

For more detailed explanation of the algorithm, check the Algorithm's Overview of KMeans.

ClusterAnalysis.dbscan — Function

dbscan(df, ϵ::Real, min_pts::Int)

Classify data observations in clusters and noises by using a density concept obtained with the parameters input (ϵ, min_pts).

The number of clusters are obtained during the execution of the model, therefore, initially the user don't know how much clusters it will obtain. The algorithm use the KDTree structure from NearestNeighbors.jl to calculate the RangeQuery operation more efficiently.

For more detailed explanation of the algorithm, check the Algorithm's Overview of DBSCAN

ClusterAnalysis.euclidean — Function

ClusterAnalysis.euclidean(a::AbstractVector, b::AbstractVector)

Calculate euclidean distance from two vectors. √∑(aᵢ - bᵢ)².

Arguments (positional)

a: First vector.
b: Second vector.

Example

julia> using ClusterAnalysis

julia> a = rand(100); b = rand(100);

julia> ClusterAnalysis.euclidean(a, b)
3.8625780213774954

ClusterAnalysis.squared_error — Function

ClusterAnalysis.squared_error(data::AbstractMatrix)
ClusterAnalysis.squared_error(col::AbstractVector)

Function that evaluate the kmeans, using the Sum of Squared Error (SSE).

Arguments (positional)

data or col: Matrix of data observations or a Vector which represents one column of data.

Example

julia> using ClusterAnalysis

julia> a = rand(100, 4);

julia> ClusterAnalysis.squared_error(a)
34.71086095943974

julia> ClusterAnalysis.squared_error(a[:, 1])
10.06029322934825

ClusterAnalysis.totalwithinss — Function

ClusterAnalysis.totalwithinss(data::AbstractMatrix, K::Int, cluster::Vector)

Calculate the total-variance-within-cluster using the squared_error() function.

Arguments (positional)

data: Matrix of data observations.
K: number of clusters.
cluster: Vector of cluster for each data observation.

Example

julia> using ClusterAnalysis
julia> using CSV, DataFrames

julia> iris = CSV.read(joinpath(pwd(), "path/to/iris.csv"), DataFrame);
julia> df = iris[:, 1:end-1];
julia> model = kmeans(df, 3);

julia> ClusterAnalysis.totalwithinss(Matrix(df), model.K, model.cluster)
78.85144142614601