Docstrings · UMAP.jl

UMAP.UMAP_ — Method

UMAP_(X::AbstractMatrix[, n_components=2]; <kwargs>) -> UMAP_ object

Create a model representing the embedding of data X into n_components-dimensional space. The returned model has the following fields:

graph: the graph representing the fuzzy simplicial set of the manifold of X.
embedding: the n-component-dimensional embedding of the data X.
data: a reference to the input data X.
knns: a matrix of indices of X representing each point's nearest neighbors according to metric. knns[j, i] is the index of point i's jth nearest neighbor.
dists: the respective distances of the above neighbors. dists[j, i] is the distance of point i's jth nearest neighbor.

Keyword Arguments

n_neighbors::Integer = 15: the number of neighbors to consider as locally connected. Larger values capture more global structure in the data, while small values capture more local structure.
metric::{SemiMetric, Symbol} = Euclidean(): the metric to calculate distance in the input space. It is also possible to pass metric = :precomputed to treat X like a precomputed distance matrix.
n_epochs::Integer = 300: the number of training epochs for embedding optimization
learning_rate::Real = 1: the initial learning rate during optimization
init::Symbol = :spectral: how to initialize the output embedding; valid options are :spectral and :random
min_dist::Real = 0.1: the minimum spacing of points in the output embedding
spread::Real = 1: the effective scale of embedded points. Determines how clustered embedded points are in combination with min_dist.
set_operation_ratio::Real = 1: interpolates between fuzzy set union and fuzzy set intersection when constructing the UMAP graph (global fuzzy simplicial set). The value of this parameter should be between 1.0 and 0.0: 1.0 indicates pure fuzzy union, while 0.0 indicates pure fuzzy intersection.
local_connectivity::Integer = 1: the number of nearest neighbors that should be assumed to be locally connected. The higher this value, the more connected the manifold becomes. This should not be set higher than the intrinsic dimension of the manifold.
repulsion_strength::Real = 1: the weighting of negative samples during the optimization process.
neg_sample_rate::Integer = 5: the number of negative samples to select for each positive sample. Higher values will increase computational cost but result in slightly more accuracy.
a = nothing: this controls the embedding. By default, this is determined automatically by min_dist and spread.
b = nothing: this controls the embedding. By default, this is determined automatically by min_dist and spread.

UMAP.compute_membership_strengths — Method

compute_membership_strengths(knns, dists, σs, ρs) -> rows, cols, vals

Compute the membership strengths for the 1-skeleton of each fuzzy simplicial set.

UMAP.fit_ab — Method

fit_ab(min_dist, spread, _a, _b) -> a, b

Find a smooth approximation to the membership function of points embedded in ℜᵈ. This fits a smooth curve that approximates an exponential decay offset by min_dist, returning the parameters (a, b).

UMAP.fuzzy_simplicial_set — Function

fuzzy_simplicial_set(knns, dists, n_neighbors, n_points, local_connectivity, set_op_ratio, apply_fuzzy_combine=true) -> membership_graph::SparseMatrixCSC,

Construct the local fuzzy simplicial sets of each point represented by its distances to its n_neighbors nearest neighbors, stored in knns and dists, normalizing the distances on the manifolds, and converting the metric space to a simplicial set. n_points indicates the total number of points of the original data, while knns contains indices of some subset of those points (ie some subset of 1:n_points). If knns represents neighbors of the elements of some set with itself, then knns should have n_points number of columns. Otherwise, these two values may be inequivalent. If apply_fuzzy_combine is true, use intersections and unions to combine fuzzy sets of neighbors (default true).

The returned graph will have size (n_points, size(knns, 2)).

UMAP.initialize_embedding — Method

initialize_embedding(graph::AbstractMatrix{<:Real}, ref_embedding::AbstractMatrix{T<:AbstractFloat}) -> embedding

Initialize an embedding of points corresponding to the columns of the graph, by taking weighted means of the columns of ref_embedding, where weights are values from the rows of the graph.

The resulting embedding will have shape (size(ref_embedding, 1), size(graph, 2)), where size(ref_embedding, 1) is the number of components (dimensions) of the reference embedding, and size(graph, 2) is the number of samples in the resulting embedding. Its elements will have type T.

UMAP.knn_search — Method

knn_search(X, Q, k, metric, knns, dists) -> knns, dists

Given a matrix X and a matrix Q, use the given metric to compute the k nearest neighbors out of the columns of X from the queries (columns in Q). If the matrices are large, reconstruct the approximate nearest neighbors graph of X using the given knns and dists, representing indices and distances of pairwise neighbors of X, and use this to search for approximate nearest neighbors of Q. If the matrices are small, search for exact nearest neighbors of Q by computing all pairwise distances with X.

metric may be of type:

::Symbol - knn_search is dispatched to one of the following based on the evaluation of metric:
::Val(:precomputed) - computes neighbors from X treated as a precomputed distance matrix.
::SemiMetric - computes neighbors from X treated as samples, using the given metric.

Returns

knns: knns[j, i] is the index of node i's jth nearest neighbor.
dists: dists[j, i] is the distance of node i's jth nearest neighbor.

UMAP.knn_search — Method

knn_search(X, k, metric) -> knns, dists

Find the k nearest neighbors of each point.

metric may be of type:

::Symbol - knn_search is dispatched to one of the following based on the evaluation of metric:
::Val(:precomputed) - computes neighbors from X treated as a precomputed distance matrix.
::SemiMetric - computes neighbors from X treated as samples, using the given metric.

Returns

knns: knns[j, i] is the index of node i's jth nearest neighbor.
dists: dists[j, i] is the distance of node i's jth nearest neighbor.

UMAP.optimize_embedding — Function

optimize_embedding(graph, query_embedding, ref_embedding, n_epochs, initial_alpha, min_dist, spread, gamma, neg_sample_rate, _a=nothing, _b=nothing; move_ref=false) -> embedding

Optimize an embedding by minimizing the fuzzy set cross entropy between the high and low dimensional simplicial sets using stochastic gradient descent. Optimize "query" samples with respect to "reference" samples.

Arguments

graph: a sparse matrix of shape (nsamples, nsamples)
query_embedding: a vector of length (n_samples,) of vectors representing the embedded data points to be optimized ("query" samples)
ref_embedding: a vector of length (n_samples,) of vectors representing the embedded data points to optimize against ("reference" samples)
n_epochs: the number of training epochs for optimization
initial_alpha: the initial learning rate
gamma: the repulsive strength of negative samples
neg_sample_rate: the number of negative samples per positive sample
_a: this controls the embedding. If the actual argument is nothing, this is determined automatically by min_dist and spread.
_b: this controls the embedding. If the actual argument is nothing, this is determined automatically by min_dist and spread.

Keyword Arguments

move_ref::Bool = false: if true, also improve the embeddings in ref_embedding, else fix them and only improve embeddings in query_embedding.

UMAP.smooth_knn_dists — Method

smooth_knn_dists(dists, k, local_connectivity; <kwargs>) -> knn_dists, nn_dists

Compute the distances to the nearest neighbors for a continuous value k. Returns the approximated distances to the kth nearest neighbor (knn_dists) and the nearest neighbor (nn_dists) from each point.

UMAP.spectral_layout — Method

spectral_layout(graph, embed_dim) -> embedding

Initialize the graph layout with spectral embedding.

UMAP.transform — Method

transform(model::UMAP_, Q::AbstractMatrix; <kwargs>) -> embedding

Use the given model to embed new points into an existing embedding. Q is a matrix of some number of points (columns) in the same space as model.data. The returned embedding is the embedding of these points in n-dimensional space, where n is the dimensionality of model.embedding. This embedding is created by finding neighbors of Q in model.embedding and optimizing cross entropy according to membership strengths according to these neighbors.

Keyword Arguments

n_neighbors::Integer = 15: the number of neighbors to consider as locally connected. Larger values capture more global structure in the data, while small values capture more local structure.
metric::{SemiMetric, Symbol} = Euclidean(): the metric to calculate distance in the input space. It is also possible to pass metric = :precomputed to treat X like a precomputed distance matrix.
n_epochs::Integer = 300: the number of training epochs for embedding optimization
learning_rate::Real = 1: the initial learning rate during optimization
init::Symbol = :spectral: how to initialize the output embedding; valid options are :spectral and :random
min_dist::Real = 0.1: the minimum spacing of points in the output embedding
spread::Real = 1: the effective scale of embedded points. Determines how clustered embedded points are in combination with min_dist.
set_operation_ratio::Real = 1: interpolates between fuzzy set union and fuzzy set intersection when constructing the UMAP graph (global fuzzy simplicial set). The value of this parameter should be between 1.0 and 0.0: 1.0 indicates pure fuzzy union, while 0.0 indicates pure fuzzy intersection.
local_connectivity::Integer = 1: the number of nearest neighbors that should be assumed to be locally connected. The higher this value, the more connected the manifold becomes. This should not be set higher than the intrinsic dimension of the manifold.
repulsion_strength::Real = 1: the weighting of negative samples during the optimization process.
neg_sample_rate::Integer = 5: the number of negative samples to select for each positive sample. Higher values will increase computational cost but result in slightly more accuracy.
a = nothing: this controls the embedding. By default, this is determined automatically by min_dist and spread.
b = nothing: this controls the embedding. By default, this is determined automatically by min_dist and spread.

UMAP.umap — Method

umap(X::AbstractMatrix[, n_components=2]; <kwargs>) -> embedding

Embed the data X into a n_components-dimensional space. n_neighbors controls how many neighbors to consider as locally connected.

See UMAP_ for a description of keyword arguments.