# DBSCAN

Density-based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm was proposed in:

Martin Ester, Hans-peter Kriegel, Jörg S, and Xiaowei Xu

A density-based algorithm for discovering clusters in large spatial databases with noise.1996.

## Density Reachability

DBSCAN's definition of a cluster is based on the concept of *density reachability*: a point $q$ is said to be *directly density reachable* by another point $p$ if the distance between them is below a specified threshold $\epsilon$ and $p$ is surrounded by sufficiently many points. Then, $q$ is considered to be *density reachable* by $p$ if there exists a sequence $p_1, p_2, \ldots, p_n$ such that $p_1 = p$ and $p_{i+1}$ is directly density reachable from $p_i$.

The points within DBSCAN clusters are categorized into *core* (or *seeds*) and *boundary*:

- All points of the cluster
*core*are mutually*density-connected*, meaning that for any two distinct points $p$ and $q$ in a core, there exists a point $o$ such that both $p$ and $q$ are*density reachable*from $o$. - If a point is
*density-connected*to any point of a cluster core, it is also part of the core. - All points within the $\epsilon$-neighborhood of any core point, but not belonging to that core (i.e. not
*density reachable*from the core), are considered cluster*boundary*.

## Interface

The implementation of *DBSCAN* algorithm provided by `dbscan`

function supports the two ways of specifying clustering data:

- The $d \times n$ matrix of point coordinates. This is the preferred method as it uses memory- and time-efficient neighboring points queries via NearestNeighbors.jl package.
- The $n\times n$ matrix of precalculated pairwise point distances. It requires $O(n^2)$ memory and time to run.

`Clustering.dbscan`

— Function```
dbscan(points::AbstractMatrix, radius::Real;
[metric=Euclidean()],
[min_neighbors=1], [min_cluster_size=1],
[nntree_kwargs...]) -> DbscanResult
```

Cluster `points`

using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.

**Arguments**

`points`

: when`metric`

is specified, the*d×n*matrix, where each column is a*d*-dimensional coordinate of a point; when`metric=nothing`

, the*n×n*matrix of pairwise distances between the points`radius::Real`

: neighborhood radius; points within this distance are considered neighbors

Optional keyword arguments to control the algorithm:

`metric`

(defaults to`Euclidean()`

): the points distance metric to use,`nothing`

means`points`

is the*n×n*precalculated distance matrix`min_neighbors::Integer`

(defaults to 1): the minimal number of neighbors required to assign a point to a cluster "core"`min_cluster_size::Integer`

(defaults to 1): the minimal number of points in a cluster; cluster candidates with fewer points are discarded`nntree_kwargs...`

: parameters (like`leafsize`

) for the`KDTree`

constructor

**Example**

```
points = randn(3, 10000)
# DBSCAN clustering, clusters with less than 20 points will be discarded:
clustering = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)
```

**References:**

- Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu,
*"A density-based algorithm for discovering clusters in large spatial databases with noise"*, KDD-1996, pp. 226–231. - Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu,
*"DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN"*, ACM Transactions on Database Systems, Vol.42(3)3, pp. 1–21, https://doi.org/10.1145/3068335

`Clustering.DbscanResult`

— Type`DbscanResult <: ClusteringResult`

The output of `dbscan`

function.

**Fields**

`clusters::Vector{DbscanCluster}`

: clusters, length*K*`seeds::Vector{Int}`

: indices of the first points of each cluster's*core*, length*K*`counts::Vector{Int}`

: cluster sizes (number of assigned points), length*K*`assignments::Vector{Int}`

: vector of clusters indices, where each point was assigned to, length*N*

`Clustering.DbscanCluster`

— Type`DbscanCluster`

DBSCAN cluster, part of `DbscanResult`

returned by `dbscan`

function.

**Fields**

`size::Int`

: number of points in a cluster (core + boundary)`core_indices::Vector{Int}`

: indices of points in the cluster*core*, a.k.a.*seeds*(have at least`min_neighbors`

neighbors in the cluster)`boundary_indices::Vector{Int}`

: indices of the cluster points outside of*core*