# K-means

K-means is a classical method for clustering or vector quantization. It produces a fixed number of clusters, each associated with a *center* (also known as a *prototype*), and each data point is assigned to a cluster with the nearest center.

From a mathematical standpoint, K-means is a coordinate descent algorithm that solves the following optimization problem:

\[\text{minimize} \ \sum_{i=1}^n \| \mathbf{x}_i - \boldsymbol{\mu}_{z_i} \|^2 \ \text{w.r.t.} \ (\boldsymbol{\mu}, z)\]

Here, $\boldsymbol{\mu}_k$ is the center of the $k$-th cluster, and $z_i$ is an index of the cluster for $i$-th point $\mathbf{x}_i$.

`Clustering.kmeans`

— Function`kmeans(X, k, [...]) -> KmeansResult`

K-means clustering of the $d×n$ data matrix `X`

(each column of `X`

is a $d$-dimensional data point) into `k`

clusters.

**Arguments**

`init`

(defaults to`:kmpp`

): how cluster seeds should be initialized, could be one of the following:- a
`Symbol`

, the name of a seeding algorithm (see Seeding for a list of supported methods); - an instance of
`SeedingAlgorithm`

; - an integer vector of length $k$ that provides the indices of points to use as initial seeds.

- a
`weights`

: $n$-element vector of point weights (the cluster centers are the weighted means of cluster members)`maxiter`

,`tol`

,`display`

: see common options

`Clustering.KmeansResult`

— TypeIf you already have a set of initial center vectors, `kmeans!`

could be used:

`Clustering.kmeans!`

— Function`kmeans!(X, centers; [kwargs...]) -> KmeansResult`

Update the current cluster `centers`

($d×k$ matrix, where $d$ is the dimension and $k$ the number of centroids) using the $d×n$ data matrix `X`

(each column of `X`

is a $d$-dimensional data point).

See `kmeans`

for the description of optional `kwargs`

.

## Examples

```
using Clustering
# make a random dataset with 1000 random 5-dimensional points
X = rand(5, 1000)
# cluster X into 20 clusters using K-means
R = kmeans(X, 20; maxiter=200, display=:iter)
@assert nclusters(R) == 20 # verify the number of clusters
a = assignments(R) # get the assignments of points to clusters
c = counts(R) # get the cluster sizes
M = R.centers # get the cluster centers
```

```
5×20 Matrix{Float64}:
0.650222 0.353973 0.18825 0.73306 … 0.311224 0.579704 0.766426
0.81964 0.244866 0.296307 0.474165 0.314661 0.1691 0.79694
0.764112 0.208619 0.239778 0.812187 0.223546 0.712977 0.615925
0.28546 0.195898 0.370713 0.193935 0.812648 0.741689 0.713125
0.717969 0.729592 0.188958 0.201141 0.742873 0.781443 0.171877
```

Scatter plot of the K-means clustering results:

```
using RDatasets, Clustering, Plots
iris = dataset("datasets", "iris"); # load the data
features = collect(Matrix(iris[:, 1:4])'); # features to use for clustering
result = kmeans(features, 3); # run K-means for the 3 clusters
# plot with the point color mapped to the assigned cluster index
scatter(iris.PetalLength, iris.PetalWidth, marker_z=result.assignments,
color=:lightrainbow, legend=false)
```