# DBSCAN

Density-based Spatial Clustering of Applications with Noise (DBSCAN) is a data clustering algorithm that finds clusters through density-based expansion of seed points. The algorithm was proposed in:

Martin Ester, Hans-peter Kriegel, Jörg S, and Xiaowei Xu A density-based algorithm for discovering clusters in large spatial databases with noise. 1996.

## Density Reachability

DBSCAN's definition of a cluster is based on the concept of density reachability: a point $q$ is said to be directly density reachable by another point $p$ if the distance between them is below a specified threshold $\epsilon$ and $p$ is surrounded by sufficiently many points. Then, $q$ is considered to be density reachable by $p$ if there exists a sequence $p_1, p_2, \ldots, p_n$ such that $p_1 = p$ and $p_{i+1}$ is directly density reachable from $p_i$.

The points within DBSCAN clusters are categorized into core (or seeds) and boundary:

1. All points of the cluster core are mutually density-connected, meaning that for any two distinct points $p$ and $q$ in a core, there exists a point $o$ such that both $p$ and $q$ are density reachable from $o$.
2. If a point is density-connected to any point of a cluster core, it is also part of the core.
3. All points within the $\epsilon$-neighborhood of any core point, but not belonging to that core (i.e. not density reachable from the core), are considered cluster boundary.

## Interface

The implementation of DBSCAN algorithm provided by dbscan function supports the two ways of specifying clustering data:

• The $d \times n$ matrix of point coordinates. This is the preferred method as it uses memory- and time-efficient neighboring points queries via NearestNeighbors.jl package.
• The $n\times n$ matrix of precalculated pairwise point distances. It requires $O(n^2)$ memory and time to run.
Clustering.dbscanFunction
dbscan(points::AbstractMatrix, radius::Real;
[metric=Euclidean()],
[min_neighbors=1], [min_cluster_size=1],
[nntree_kwargs...]) -> DbscanResult

Cluster points using the DBSCAN (Density-Based Spatial Clustering of Applications with Noise) algorithm.

Arguments

• points: when metric is specified, the d×n matrix, where each column is a d-dimensional coordinate of a point; when metric=nothing, the n×n matrix of pairwise distances between the points
• radius::Real: neighborhood radius; points within this distance are considered neighbors

Optional keyword arguments to control the algorithm:

• metric (defaults to Euclidean()): the points distance metric to use, nothing means points is the n×n precalculated distance matrix
• min_neighbors::Integer (defaults to 1): the minimal number of neighbors required to assign a point to a cluster "core"
• min_cluster_size::Integer (defaults to 1): the minimal number of points in a cluster; cluster candidates with fewer points are discarded
• nntree_kwargs...: parameters (like leafsize) for the KDTree constructor

Example

points = randn(3, 10000)
# DBSCAN clustering, clusters with less than 20 points will be discarded:
clustering = dbscan(points, 0.05, min_neighbors = 3, min_cluster_size = 20)

References:

• Martin Ester, Hans-Peter Kriegel, Jörg Sander, and Xiaowei Xu, "A density-based algorithm for discovering clusters in large spatial databases with noise", KDD-1996, pp. 226–231.
• Erich Schubert, Jörg Sander, Martin Ester, Hans Peter Kriegel, and Xiaowei Xu, "DBSCAN Revisited, Revisited: Why and How You Should (Still) Use DBSCAN", ACM Transactions on Database Systems, Vol.42(3)3, pp. 1–21, https://doi.org/10.1145/3068335
Clustering.DbscanResultType
DbscanResult <: ClusteringResult

The output of dbscan function.

Fields

• clusters::Vector{DbscanCluster}: clusters, length K
• seeds::Vector{Int}: indices of the first points of each cluster's core, length K
• counts::Vector{Int}: cluster sizes (number of assigned points), length K
• assignments::Vector{Int}: vector of clusters indices, where each point was assigned to, length N
Clustering.DbscanClusterType
DbscanCluster

DBSCAN cluster, part of DbscanResult returned by dbscan function.

Fields

• size::Int: number of points in a cluster (core + boundary)
• core_indices::Vector{Int}: indices of points in the cluster core, a.k.a. seeds (have at least min_neighbors neighbors in the cluster)
• boundary_indices::Vector{Int}: indices of the cluster points outside of core