UMAP.UMAP_
— MethodUMAP_(X::AbstractMatrix[, n_components=2]; <kwargs>) -> UMAP_ object
Create a model representing the embedding of data X
into n_components
-dimensional space. The returned model has the following fields:
graph
: the graph representing the fuzzy simplicial set of the manifold ofX
.embedding
: then-component
-dimensional embedding of the dataX
.data
: a reference to the input dataX
.knns
: a matrix of indices ofX
representing each point's nearest neighbors according tometric
.knns[j, i]
is the index of point i's jth nearest neighbor.dists
: the respective distances of the above neighbors.dists[j, i]
is the distance of point i's jth nearest neighbor.
Keyword Arguments
n_neighbors::Integer = 15
: the number of neighbors to consider as locally connected. Larger values capture more global structure in the data, while small values capture more local structure.metric::{SemiMetric, Symbol} = Euclidean()
: the metric to calculate distance in the input space. It is also possible to passmetric = :precomputed
to treatX
like a precomputed distance matrix.n_epochs::Integer = 300
: the number of training epochs for embedding optimizationlearning_rate::Real = 1
: the initial learning rate during optimizationinit::Symbol = :spectral
: how to initialize the output embedding; valid options are:spectral
and:random
min_dist::Real = 0.1
: the minimum spacing of points in the output embeddingspread::Real = 1
: the effective scale of embedded points. Determines how clustered embedded points are in combination withmin_dist
.set_operation_ratio::Real = 1
: interpolates between fuzzy set union and fuzzy set intersection when constructing the UMAP graph (global fuzzy simplicial set). The value of this parameter should be between 1.0 and 0.0: 1.0 indicates pure fuzzy union, while 0.0 indicates pure fuzzy intersection.local_connectivity::Integer = 1
: the number of nearest neighbors that should be assumed to be locally connected. The higher this value, the more connected the manifold becomes. This should not be set higher than the intrinsic dimension of the manifold.repulsion_strength::Real = 1
: the weighting of negative samples during the optimization process.neg_sample_rate::Integer = 5
: the number of negative samples to select for each positive sample. Higher values will increase computational cost but result in slightly more accuracy.a = nothing
: this controls the embedding. By default, this is determined automatically bymin_dist
andspread
.b = nothing
: this controls the embedding. By default, this is determined automatically bymin_dist
andspread
.
UMAP.compute_membership_strengths
— Methodcompute_membership_strengths(knns, dists, σs, ρs) -> rows, cols, vals
Compute the membership strengths for the 1-skeleton of each fuzzy simplicial set.
UMAP.fit_ab
— Methodfit_ab(min_dist, spread, _a, _b) -> a, b
Find a smooth approximation to the membership function of points embedded in ℜᵈ. This fits a smooth curve that approximates an exponential decay offset by min_dist
, returning the parameters (a, b)
.
UMAP.fuzzy_simplicial_set
— Functionfuzzy_simplicial_set(knns, dists, n_neighbors, n_points, local_connectivity, set_op_ratio, apply_fuzzy_combine=true) -> membership_graph::SparseMatrixCSC,
Construct the local fuzzy simplicial sets of each point represented by its distances to its n_neighbors
nearest neighbors, stored in knns
and dists
, normalizing the distances on the manifolds, and converting the metric space to a simplicial set. n_points
indicates the total number of points of the original data, while knns
contains indices of some subset of those points (ie some subset of 1:n_points
). If knns
represents neighbors of the elements of some set with itself, then knns
should have n_points
number of columns. Otherwise, these two values may be inequivalent. If apply_fuzzy_combine
is true, use intersections and unions to combine fuzzy sets of neighbors (default true).
The returned graph will have size (n_points
, size(knns, 2)).
UMAP.initialize_embedding
— Methodinitialize_embedding(graph::AbstractMatrix{<:Real}, ref_embedding::AbstractMatrix{T<:AbstractFloat}) -> embedding
Initialize an embedding of points corresponding to the columns of the graph
, by taking weighted means of the columns of ref_embedding
, where weights are values from the rows of the graph
.
The resulting embedding will have shape (size(ref_embedding, 1), size(graph, 2))
, where size(ref_embedding, 1)
is the number of components (dimensions) of the reference embedding
, and size(graph, 2)
is the number of samples in the resulting embedding. Its elements will have type T.
UMAP.knn_search
— Methodknn_search(X, Q, k, metric, knns, dists) -> knns, dists
Given a matrix X
and a matrix Q
, use the given metric to compute the k
nearest neighbors out of the columns of X
from the queries (columns in Q
). If the matrices are large, reconstruct the approximate nearest neighbors graph of X
using the given knns
and dists
, representing indices and distances of pairwise neighbors of X
, and use this to search for approximate nearest neighbors of Q
. If the matrices are small, search for exact nearest neighbors of Q
by computing all pairwise distances with X
.
metric
may be of type:
- ::Symbol -
knn_search
is dispatched to one of the following based on the evaluation ofmetric
: - ::Val(:precomputed) - computes neighbors from
X
treated as a precomputed distance matrix. - ::SemiMetric - computes neighbors from
X
treated as samples, using the given metric.
Returns
knns
:knns[j, i]
is the index of node i's jth nearest neighbor.dists
:dists[j, i]
is the distance of node i's jth nearest neighbor.
UMAP.knn_search
— Methodknn_search(X, k, metric) -> knns, dists
Find the k
nearest neighbors of each point.
metric
may be of type:
- ::Symbol -
knn_search
is dispatched to one of the following based on the evaluation ofmetric
: - ::Val(:precomputed) - computes neighbors from
X
treated as a precomputed distance matrix. - ::SemiMetric - computes neighbors from
X
treated as samples, using the given metric.
Returns
knns
:knns[j, i]
is the index of node i's jth nearest neighbor.dists
:dists[j, i]
is the distance of node i's jth nearest neighbor.
UMAP.optimize_embedding
— Functionoptimize_embedding(graph, query_embedding, ref_embedding, n_epochs, initial_alpha, min_dist, spread, gamma, neg_sample_rate, _a=nothing, _b=nothing; move_ref=false) -> embedding
Optimize an embedding by minimizing the fuzzy set cross entropy between the high and low dimensional simplicial sets using stochastic gradient descent. Optimize "query" samples with respect to "reference" samples.
Arguments
graph
: a sparse matrix of shape (nsamples, nsamples)query_embedding
: a vector of length (n_samples,) of vectors representing the embedded data points to be optimized ("query" samples)ref_embedding
: a vector of length (n_samples,) of vectors representing the embedded data points to optimize against ("reference" samples)n_epochs
: the number of training epochs for optimizationinitial_alpha
: the initial learning rategamma
: the repulsive strength of negative samplesneg_sample_rate
: the number of negative samples per positive sample_a
: this controls the embedding. If the actual argument isnothing
, this is determined automatically bymin_dist
andspread
._b
: this controls the embedding. If the actual argument isnothing
, this is determined automatically bymin_dist
andspread
.
Keyword Arguments
move_ref::Bool = false
: if true, also improve the embeddings inref_embedding
, else fix them and only improve embeddings inquery_embedding
.
UMAP.smooth_knn_dists
— Methodsmooth_knn_dists(dists, k, local_connectivity; <kwargs>) -> knn_dists, nn_dists
Compute the distances to the nearest neighbors for a continuous value k
. Returns the approximated distances to the kth nearest neighbor (knn_dists
) and the nearest neighbor (nn_dists) from each point.
UMAP.spectral_layout
— Methodspectral_layout(graph, embed_dim) -> embedding
Initialize the graph layout with spectral embedding.
UMAP.transform
— Methodtransform(model::UMAP_, Q::AbstractMatrix; <kwargs>) -> embedding
Use the given model to embed new points into an existing embedding. Q
is a matrix of some number of points (columns) in the same space as model.data
. The returned embedding is the embedding of these points in n-dimensional space, where n is the dimensionality of model.embedding
. This embedding is created by finding neighbors of Q
in model.embedding
and optimizing cross entropy according to membership strengths according to these neighbors.
Keyword Arguments
n_neighbors::Integer = 15
: the number of neighbors to consider as locally connected. Larger values capture more global structure in the data, while small values capture more local structure.metric::{SemiMetric, Symbol} = Euclidean()
: the metric to calculate distance in the input space. It is also possible to passmetric = :precomputed
to treatX
like a precomputed distance matrix.n_epochs::Integer = 300
: the number of training epochs for embedding optimizationlearning_rate::Real = 1
: the initial learning rate during optimizationinit::Symbol = :spectral
: how to initialize the output embedding; valid options are:spectral
and:random
min_dist::Real = 0.1
: the minimum spacing of points in the output embeddingspread::Real = 1
: the effective scale of embedded points. Determines how clustered embedded points are in combination withmin_dist
.set_operation_ratio::Real = 1
: interpolates between fuzzy set union and fuzzy set intersection when constructing the UMAP graph (global fuzzy simplicial set). The value of this parameter should be between 1.0 and 0.0: 1.0 indicates pure fuzzy union, while 0.0 indicates pure fuzzy intersection.local_connectivity::Integer = 1
: the number of nearest neighbors that should be assumed to be locally connected. The higher this value, the more connected the manifold becomes. This should not be set higher than the intrinsic dimension of the manifold.repulsion_strength::Real = 1
: the weighting of negative samples during the optimization process.neg_sample_rate::Integer = 5
: the number of negative samples to select for each positive sample. Higher values will increase computational cost but result in slightly more accuracy.a = nothing
: this controls the embedding. By default, this is determined automatically bymin_dist
andspread
.b = nothing
: this controls the embedding. By default, this is determined automatically bymin_dist
andspread
.
UMAP.umap
— Methodumap(X::AbstractMatrix[, n_components=2]; <kwargs>) -> embedding
Embed the data X
into a n_components
-dimensional space. n_neighbors
controls how many neighbors to consider as locally connected.
See UMAP_
for a description of keyword arguments.