# Hierarchical Clustering

Hierarchical clustering algorithms build a dendrogram of nested clusters by repeatedly merging or splitting clusters.

The `hclust`

function implements several classical algorithms for hierarchical clustering (the algorithm to use is defined by the `linkage`

parameter):

`Clustering.hclust`

— Function`hclust(d::AbstractMatrix; [linkage], [uplo], [branchorder]) -> Hclust`

Perform hierarchical clustering using the distance matrix `d`

and the cluster `linkage`

function.

Returns the dendrogram as a `Hclust`

object.

**Arguments**

`d::AbstractMatrix`

: the pairwise distance matrix. $d_{ij}$ is the distance between $i$-th and $j$-th points.`linkage::Symbol`

:*cluster linkage*function to use.`linkage`

defines how the distances between the data points are aggregated into the distances between the clusters. Naturally, it affects what clusters are merged on each iteration. The valid choices are:`:single`

(the default): use the minimum distance between any of the cluster members`:average`

: use the mean distance between any of the cluster members`:complete`

: use the maximum distance between any of the members`:ward`

: the distance is the increase of the average squared distance of a point to its cluster centroid after merging the two clusters`:ward_presquared`

: same as`:ward`

, but assumes that the distances in`d`

are already squared.

`uplo::Symbol`

(optional): specifies whether the upper (`:U`

) or the lower (`:L`

) triangle of`d`

should be used to get the distances. If not specified, the method expects`d`

to be symmetric.`branchorder::Symbol`

(optional): algorithm to order leaves and branches. The valid choices are:`:r`

(the default): ordering based on the node heights and the original elements order (compatible with R's`hclust`

)`:barjoseph`

(or`:optimal`

): branches are ordered to reduce the distance between neighboring leaves from separate branches using the "fast optimal leaf ordering" algorithm from Bar-Joseph et. al.*Bioinformatics*(2001)

`Clustering.Hclust`

— Type`Hclust{T<:Real}`

The output of `hclust`

, hierarchical clustering of data points.

Provides the bottom-up definition of the dendrogram as the sequence of merges of the two lower subtrees into a higher level subtree.

This type mostly follows R's `hclust`

class.

**Fields**

`merges::Matrix{Int}`

: $N×2$ matrix encoding subtree merges:- each row specifies the left and right subtrees (referenced by their $id$s) that are merged
- negative subtree $id$ denotes the leaf node and corresponds to the data point at position $-id$
- positive $id$ denotes nontrivial subtree (the row
`merges[id, :]`

specifies its left and right subtrees)

`linkage::Symbol`

: the name of*cluster linkage*function used to construct the hierarchy (see`hclust`

)`heights::Vector{T}`

: subtree heights, i.e. the distances between the left and right branches of each subtree calculated using the specified`linkage`

`order::Vector{Int}`

: the data point indices ordered so that there are no intersecting branches on the*dendrogram*plot. This ordering also puts the points of the same cluster close together.

See also: `hclust`

.

Single-linkage clustering using distance matrix:

```
using Clustering
D = rand(1000, 1000);
D += D'; # symmetric distance matrix (optional)
result = hclust(D, linkage=:single)
```

`Hclust{Float64}([-694 -728; -15 -678; … ; -710 997; -208 998], [0.0019407855241330152, 0.0020475745306123283, 0.0023304148178595607, 0.0025500784666548926, 0.0028398556955293586, 0.003048644855992766, 0.003554800697222582, 0.0037321446752289766, 0.004679267584573821, 0.005285454266108158 … 0.09972453365379408, 0.10004349874981466, 0.10050051245074743, 0.1009276708358513, 0.10188237494815844, 0.10337230023167143, 0.10628519117508228, 0.11774139137820017, 0.11883459645782435, 0.1389806774394966], [208, 710, 593, 521, 484, 654, 863, 926, 671, 16 … 428, 746, 993, 222, 231, 566, 24, 870, 223, 689], :single)`

The resulting dendrogram could be converted into disjoint clusters with the help of `cutree`

function.

`Clustering.cutree`

— Function`cutree(hclu::Hclust; [k], [h]) -> Vector{Int}`

Cut the `hclu`

dendrogram to produce clusters at the specified level of granularity.

Returns the cluster assignments vector $z$ ($z_i$ is the index of the cluster for the $i$-th data point).

**Arguments**

`k::Integer`

(optional) the number of desired clusters.`h::Real`

(optional) the height at which the tree is cut.

If both `k`

and `h`

are specified, it's guaranteed that the number of clusters is not less than `k`

and their height is not above `h`

.

See also: `hclust`