CluGen.CluGenModule
CluGen

A Julia package for generating multidimensional clusters. Provides the clugen function for this purpose, as well as a number of auxiliary functions, used internally and modularly by clugen. Users can swap these auxiliary functions by their own customized versions, fine-tuning their cluster generation strategies, or even use them as the basis for their own generation algorithms.

CluGen.angle_btwMethod
angle_btw(v1::AbstractArray{<:Real, 1}, v2::AbstractArray{<:Real, 1}) -> Real

Angle between two $n$-dimensional vectors.

Typically, the angle between two vectors v1 and v2 can be obtained with:

acos(dot(v1, v2) / (norm(v1) * norm(v2)))

However, this approach is numerically unstable. The version provided here is numerically stable and based on the AngleBetweenVectors.jl package by Jeffrey Sarnoff (MIT license), implementing an algorithm provided by Prof. W. Kahan in these notes (see page 15).

Examples

julia> rad2deg(angle_btw([1.0, 1.0, 1.0, 1.0], [1.0, 0.0, 0.0, 0.0]))
60.00000000000001
CluGen.angle_deltasMethod
angle_deltas(
    num_clusters::Integer,
    angle_disp::Real;
    rng::AbstractRNG = Random.GLOBAL_RNG
) -> AbstractArray{<:Real, 1}

Determine the angles between the average cluster direction and the cluster-supporting lines. These angles are obtained from a wrapped normal distribution (μ=0, σ=angle_disp) with support in the interval $\left[-\pi/2,\pi/2\right]$. Note this is different from the standard wrapped normal distribution, the support of which is given by the interval $\left[-\pi,\pi\right]$.

The angle_disp parameter must be specified in radians and results are given in radians in the interval $\left[-\pi/2,\pi/2\right]$.

This function is not exported by the package and must be prefixed with CluGen if invoked by user code.

Examples

julia> CluGen.angle_deltas(4, pi/128)
4-element Vector{Float64}:
  0.01888791855096079
 -0.027851298321307266
  0.03274154825228485
 -0.004475798744567242

julia> CluGen.angle_deltas(3, pi/32; rng=MersenneTwister(987)) # Reproducible
3-element Vector{Float64}:
  0.08834204306583336
  0.014678748091943444
 -0.15202559427536264
CluGen.clucentersMethod
clucenters(
    num_clusters::Integer,
    clu_sep::AbstractArray{<:Real, 1},
    clu_offset::AbstractArray{<:Real, 1};
    rng::AbstractRNG = Random.GLOBAL_RNG
) ->  AbstractArray{<:Real}

Determine cluster centers using the uniform distribution, taking into account the number of clusters (num_clusters) and the average cluster separation (clu_sep).

More specifically, let $c=$ num_clusters, $\mathbf{s}=$ clu_sep, $\mathbf{o}=$ clu_offset, $n=$ length(clu_sep) (i.e., number of dimensions). Cluster centers are obtained according to the following equation:

\[\mathbf{C}=c\mathbf{U} \cdot \operatorname{diag}(\mathbf{s}) + \mathbf{1}\,\mathbf{o}^T\]

where $\mathbf{C}$ is the $c \times n$ matrix of cluster centers, $\mathbf{U}$ is an $c \times n$ matrix of random values drawn from the uniform distribution between -0.5 and 0.5, and $\mathbf{1}$ is an $c \times 1$ vector with all entries equal to 1.

This function is not exported by the package and must be prefixed with CluGen if invoked by user code.

Examples

julia> CluGen.clucenters(4, [10, 50], [0, 0]) # 2D
4×2 Matrix{Float64}:
 10.7379   -37.3512
 17.6206    32.511
  6.95835   17.2044
 -4.18188  -89.5734

julia> CluGen.clucenters(5, [20, 10, 30], [10, 10, -10]) # 3D
5×3 Matrix{Float64}:
 -13.136    15.8746      2.34767
 -29.1129   -0.715105  -46.6028
 -23.6334    8.19236    20.879
   7.30168  -1.20904   -41.2033
  46.5412    7.3284    -42.8401

julia> CluGen.clucenters(3, [100], [0]; rng=MersenneTwister(121)) # 1D, reproducible
3×1 Matrix{Float64}:
  -91.3675026663759
  140.98964768714384
 -124.90981996579862
CluGen.clugenMethod
clugen(
    num_dims::Integer,
    num_clusters::Integer,
    num_points::Integer,
    direction::AbstractArray{<:Real},
    angle_disp::Real,
    cluster_sep::AbstractArray{<:Real, 1},
    llength::Real,
    llength_disp::Real,
    lateral_disp::Real;
    # Keyword arguments
    allow_empty::Bool = false,
    cluster_offset::Union{AbstractArray{<:Real, 1}, Nothing} = nothing,
    proj_dist_fn::Union{String, <:Function} = "norm",
    point_dist_fn::Union{String, <:Function} = "n-1",
    clusizes_fn::Union{<:Function, AbstractArray{<:Real, 1}} = GluGen.clusizes,
    clucenters_fn::Union{<:Function, AbstractArray{<:Real}} = GluGen.clucenters,
    llengths_fn::Union{<:Function, AbstractArray{<:Real, 1}} = GluGen.llengths,
    angle_deltas_fn::Union{<:Function, AbstractArray{<:Real, 1}} = GluGen.angle_deltas,
    rng::AbstractRNG = Random.GLOBAL_RNG
) -> NamedTuple{(
        :points,      # Array{<:Real,2}
        :clusters,    # Array{<:Integer,1}
        :projections, # Array{<:Real,2}
        :sizes,       # Array{<:Integer,1}
        :centers,     # Array{<:Real,2}
        :directions,  # Array{<:Real,2}
        :lengths      # Array{<:Real,1}
     )}

Generate multidimensional clusters.

This is the main function of the CluGen package, and possibly the only function most users will need.

Arguments (mandatory)

  • num_dims: Number of dimensions.
  • num_clusters: Number of clusters to generate.
  • num_points: Total number of points to generate.
  • direction: Average direction of the cluster-supporting lines. Can be a a vector of length num_dims (same direction for all clusters) or a matrix of size num_clusters x num_dims (one direction per cluster).
  • angle_disp: Angle dispersion of cluster-supporting lines (radians).
  • cluster_sep: Average cluster separation in each dimension (num_dims x 1).
  • llength: Average length of cluster-supporting lines.
  • llength_disp: Length dispersion of cluster-supporting lines.
  • lateral_disp: Cluster lateral dispersion, i.e., dispersion of points from their projection on the cluster-supporting line.

Note that the terms "average" and "dispersion" refer to measures of central tendency and statistical dispersion, respectively. Their exact meaning depends on the optional arguments, described next.

Arguments (optional)

  • allow_empty: Allow empty clusters? false by default.
  • cluster_offset: Offset to add to all cluster centers. If set to nothing (the default), the offset will be equal to zeros(num_dims).
  • proj_dist_fn: Distribution of point projections along cluster-supporting lines, with three possible values:
    • "norm" (default): Distribute point projections along lines using a normal distribution (μ=line center, σ=llength/6).
    • "unif": Distribute points uniformly along the line.
    • User-defined function, which accepts two parameters, line length (float), number of points (integer) and a random number generator, and returns an array containing the distance of each point projection to the center of the line. For example, the "norm" option roughly corresponds to (len, n, rng) -> (1.0 / 6.0) * len .* randn(rng, n).
  • point_dist_fn: Controls how the final points are created from their projections on the cluster-supporting lines, with three possible values:
    • "n-1" (default): Final points are placed on a hyperplane orthogonal to the cluster-supporting line, centered at each point's projection, using the normal distribution (μ=0, σ=lateral_disp). This is done by the CluGen.clupoints_n_1() function.
    • "n": Final points are placed around their projection on the cluster-supporting line using the normal distribution (μ=0, σ=lateral_disp). This is done by the CluGen.clupoints_n() function.
    • User-defined function: The user can specify a custom point placement strategy by passing a function with the same signature as CluGen.clupoints_n_1() and CluGen.clupoints_n().
  • clusizes_fn: Distribution of cluster sizes. By default, cluster sizes are determined by the CluGen.clusizes() function, which uses the normal distribution (μ=num_points/num_clusters, σ=μ/3), and assures that the final cluster sizes add up to num_points. This parameter allows the user to specify a custom function for this purpose, which must follow CluGen.clusizes()'s signature. Note that custom functions are not required to strictly obey the num_points parameter. Alternatively, the user can specify an array of cluster sizes directly.
  • clucenters_fn: Distribution of cluster centers. By default, cluster centers are determined by the CluGen.clucenters() function, which uses the uniform distribution, and takes into account the num_clusters and cluster_sep parameters for generating well-distributed cluster centers. This parameter allows the user to specify a custom function for this purpose, which must follow CluGen.clucenters()'s signature. Alternatively, the user can specify a matrix of size num_clusters x num_dims with the exact cluster centers.
  • llengths_fn: Distribution of line lengths. By default, the lengths of cluster-supporting lines are determined by the CluGen.llengths() function, which uses the folded normal distribution (μ=llength, σ=llength_disp). This parameter allows the user to specify a custom function for this purpose, which must follow CluGen.llengths()'s signature. Alternatively, the user can specify an array of line lengths directly.
  • angle_deltas_fn: Distribution of line angle differences with respect to direction. By default, the angles between the main direction of each cluster and the final directions of their cluster-supporting lines are determined by the CluGen.angle_deltas() function, which uses the wrapped normal distribution (μ=0, σ=angle_disp) with support in the interval $\left[-\pi/2,\pi/2\right]$. This parameter allows the user to specify a custom function for this purpose, which must follow CluGen.angle_deltas()'s signature. Alternatively, the user can specify an array of angle deltas directly.
  • rng: A concrete instance of AbstractRNG for reproducible runs. Alternatively, the user can set the global RNG seed with Random.seed!() before invoking clugen().

Return values

The function returns a NamedTuple with the following fields:

  • points: A num_points x num_dims matrix with the generated points for all clusters.
  • clusters: A num_points x 1 vector indicating which cluster each point in points belongs to.
  • projections: A num_points x num_dims matrix with the point projections on the cluster-supporting lines.
  • sizes: A num_clusters x 1 vector with the number of points in each cluster.
  • centers: A num_clusters x num_dims matrix with the coordinates of the cluster centers.
  • directions: A num_clusters x num_dims matrix with the final direction of each cluster-supporting line.
  • angles: A num_clusters x 1 vector with the angles between the cluster-supporting lines and the main direction.
  • lengths: A num_clusters x 1 vector with the lengths of the cluster-supporting lines.

Note that if a custom function was given in the clusizes_fn parameter, it is possible that num_points may have a different value than what was specified in clugen's num_points parameter.

Examples

julia> # Create 5 clusters in 3D space with a total of 10000 points...

julia> out = clugen(3, 5, 10000, [0.5, 0.5, 0.5], pi/16, [10, 10, 10], 10, 1, 2);

julia> out.centers # What are the cluster centers?
5×3 Matrix{Float64}:
   8.12774  -16.8167    -1.80764
   4.30111   -1.34916  -11.209
 -22.3933    18.2706    -2.6716
 -11.568      5.87459    4.11589
 -19.5565   -10.7151   -12.2009

The following instruction displays a scatter plot of the clusters in 3D space:

julia> plot(out.points[:,1], out.points[:,2], out.points[:,3], seriestype = :scatter, group=out.point_clusters)

Check the Examples section for a number of illustrative examples on how to use the clugen() function. The Theory section provides more information on how the function works and the impact each parameter has on the final result.

CluGen.clumergeMethod
clumerge(
    data::Union{NamedTuple,Dict}...;
    fields::Tuple{Vararg{Symbol}}=(:points, :clusters),
    clusters_field::Union{Symbol,Nothing}=:clusters,
    output_type::Symbol=:NamedTuple
) -> Union{NamedTuple, Dict}

Merges the fields (specified in fields) of two or more data sets (named tuples or dictionaries). The fields to be merged need to have the same number of columns. The corresponding merged field will contain the rows of the fields to be merged, and will have a common supertype.

The clusters_field parameter specifies a field containing integers that identify the cluster to which the respective points belongs to. If clusters_field is specified (by default it's specified as :clusters), cluster assignments in individual datasets will be updated in the merged dataset so that clusters are considered separate. This parameter can be set to nothing, in which case no field will be considered as a special cluster assignments field.

This function can be used to merge data sets generated with the clugen() function, by default merging the :points and :clusters fields in those data sets. It also works with arbitrary data by specifying alternative fields in the fields parameter. It can be used, for example, to merge third-party data with clugen()-generated data.

The function returns a NamedTuple by default, but can return a dictionary by setting the output_type parameter to :Dict.

Examples

julia> # Generate data with clugen()

julia> clu_data = clugen(2, 5, 1000, [1, 1], 0.01, [20, 20], 14, 1.2, 1.5);

julia> # Generate 500 points of random uniform noise

julia> noise = (points=120 * rand(500, 2) .- 60, clusters = ones(Int32, 500));

julia> # Create a new data set with the clugen()-generated data plus the noise

julia> clu_data_with_noise = clumerge(noise, clu_data);

The Examples section contains several illustrative examples on how to use the clumerge() function.

CluGen.clupoints_nMethod
GluGen.clupoints_n(
    projs::AbstractArray{<:Real, 2},
    lat_disp::Real,
    line_len::Real,
    clu_dir::AbstractArray{<:Real, 1},
    clu_ctr::AbstractArray{<:Real, 1};
    rng::AbstractRNG = Random.GLOBAL_RNG
) -> AbstractArray{<:Real}

Generate points from their $n$-dimensional projections on a cluster-supporting line, placing each point around its projection using the normal distribution (μ=0, σ=lat_disp).

This function's main intended use is by the clugen() function, generating the final points when the point_dist_fn parameter is set to "n".

This function is not exported by the package and must be prefixed with CluGen if invoked by user code.

Arguments

  • projs: Point projections on the cluster-supporting line.
  • lat_disp: Standard deviation for the normal distribution, i.e., cluster lateral dispersion.
  • line_len: Length of cluster-supporting line (ignored).
  • clu_dir: Direction of the cluster-supporting line.
  • clu_ctr: Center position of the cluster-supporting line (ignored).
  • rng: An optional pseudo-random number generator for reproducible executions.

Examples

julia> projs = points_on_line([5.0,5.0], [1.0,0.0], -4:2:4) # Get 5 point projections on a 2D line
5×2 Matrix{Float64}:
 1.0  5.0
 3.0  5.0
 5.0  5.0
 7.0  5.0
 9.0  5.0

julia> CluGen.clupoints_n(projs, 0.5, 1.0, [1,0], [0,0]; rng=MersenneTwister(123))
5×2 Matrix{Float64}:
 1.59513  4.66764
 4.02409  5.49048
 5.57133  4.96226
 7.22971  5.13691
 8.80166  4.90289
CluGen.clupoints_n_1Method
CluGen.clupoints_n_1(
    projs::AbstractArray{<:Real, 2},
    lat_disp::Real,
    line_len::Real,
    clu_dir::AbstractArray{<:Real, 1},
    clu_ctr::AbstractArray{<:Real, 1};
    rng::AbstractRNG = Random.GLOBAL_RNG
) -> AbstractArray{<:Real}

Generate points from their $n$-dimensional projections on a cluster-supporting line, placing each point on a hyperplane orthogonal to that line and centered at the point's projection, using the normal distribution (μ=0, σ=lat_disp).

This function's main intended use is by the clugen() function, generating the final points when the point_dist_fn parameter is set to "n-1".

This function is not exported by the package and must be prefixed with CluGen if invoked by user code.

Arguments

  • projs: Point projections on the cluster-supporting line.
  • lat_disp: Standard deviation for the normal distribution, i.e., cluster lateral dispersion.
  • line_len: Length of cluster-supporting line (ignored).
  • clu_dir: Direction of the cluster-supporting line (unit vector).
  • clu_ctr: Center position of the cluster-supporting line (ignored).
  • rng: An optional pseudo-random number generator for reproducible executions.

Examples

julia> projs = points_on_line([5.0,5.0], [1.0,0.0], -4:2:4) # Get 5 point projections on a 2D line
5×2 Matrix{Float64}:
 1.0  5.0
 3.0  5.0
 5.0  5.0
 7.0  5.0
 9.0  5.0

julia> CluGen.clupoints_n_1(projs, 0.5, 1.0, [1,0], [0,0]; rng=MersenneTwister(123))
5×2 Matrix{Float64}:
 1.0  5.59513
 3.0  3.97591
 5.0  4.42867
 7.0  5.22971
 9.0  4.80166
CluGen.clupoints_n_1_templateMethod
CluGen.clupoints_n_1_template(
    projs::AbstractArray{<:Real, 2},
    lat_disp::Real,
    clu_dir::AbstractArray{<:Real, 1},
    dist_fn::Function;
    rng::AbstractRNG = Random.GLOBAL_RNG
) -> AbstractArray{<:Real}

Generate points from their $n$-dimensional projections on a cluster-supporting line, placing each point on a hyperplane orthogonal to that line and centered at the point's projection. The function specified in dist_fn is used to perform the actual placement.

This function is used internally by CluGen.clupoints_n_1() and may be useful for constructing user-defined final point placement strategies for the point_dist_fn parameter of the main clugen() function.

This function is not exported by the package and must be prefixed with CluGen if invoked by user code.

Arguments

  • projs: Point projections on the cluster-supporting line.
  • lat_disp: Dispersion of points from their projection.
  • clu_dir: Direction of the cluster-supporting line (unit vector).
  • dist_fn: Function to place points on a second line, orthogonal to the first. The functions accepts as parameters the number of points in the current cluster, the lateral_disp parameter (the same passed to the clugen() function), and a random number generator, returning a vector containing the distance of each point to its projection on the cluster-supporting line.
  • rng: An optional pseudo-random number generator for reproducible executions.
CluGen.clusizesMethod
clusizes(
    num_clusters::Integer,
    num_points::Integer,
    allow_empty::Bool;
    rng::AbstractRNG = Random.GLOBAL_RNG
) -> AbstractArray{<:Integer, 1}

Determine cluster sizes, i.e., the number of points in each cluster, using the normal distribution (μ=num_points/num_clusters, σ=μ/3), and then assuring that the final cluster sizes add up to num_points via the CluGen.fix_num_points!() function.

This function is not exported by the package and must be prefixed with CluGen if invoked by user code.

Examples

julia> CluGen.clusizes(4, 6, true)
4-element Vector{Int64}:
 1
 0
 3
 2

julia> CluGen.clusizes(4, 100, false)
4-element Vector{Int64}:
 29
 26
 24
 21

julia> CluGen.clusizes(5, 500, true; rng=MersenneTwister(123)) # Reproducible
5-element Vector{Int64}:
 108
 129
 107
  89
  67
CluGen.fix_empty!Function
fix_empty!(
    clu_num_points::AbstractArray{<:Integer, 1},
    allow_empty::Bool = false
) -> AbstractArray{<:Integer, 1}

Certifies that, given enough points, no clusters are left empty. This is done by removing a point from the largest cluster and adding it to an empty cluster while there are empty clusters. If the total number of points is smaller than the number of clusters (or if the allow_empty parameter is set to true), this function does nothing.

This function is used internally by CluGen.clusizes() and might be useful for custom cluster sizing implementations given as the clusizes_fn parameter of the main clugen() function.

This function is not exported by the package and must be prefixed with CluGen if invoked by user code.

CluGen.fix_num_points!Method
fix_num_points!(
    clu_num_points::AbstractArray{<:Integer, 1},
    num_points::Integer
) -> AbstractArray{<:Integer, 1}

Certifies that the values in the clu_num_points array, i.e. the number of points in each cluster, add up to num_points. If this is not the case, the clu_num_points array is modified in-place, incrementing the value corresponding to the smallest cluster while sum(clu_num_points) < num_points, or decrementing the value corresponding to the largest cluster while sum(clu_num_points) > num_points.

This function is used internally by CluGen.clusizes() and might be useful for custom cluster sizing implementations given as the clusizes_fn parameter of the main clugen() function.

This function is not exported by the package and must be prefixed with CluGen if invoked by user code.

CluGen.llengthsMethod
llengths(
    num_clusters::Integer,
    llength::Real,
    llength_disp::Real;
    rng::AbstractRNG = Random.GLOBAL_RNG
) -> AbstractArray{<:Real, 1}

Determine length of cluster-supporting lines using the folded normal distribution (μ=llength, σ=llength_disp).

This function is not exported by the package and must be prefixed with CluGen if invoked by user code.

Examples

julia> CluGen.llengths(5, 10, 3)
5-element Vector{Float64}:
 13.57080364295883
 16.14453912336772
 13.427952708601596
 11.37824686122124
  8.809962762114331

julia> CluGen.llengths(3, 100, 60; rng=MersenneTwister(111)) # Reproducible
3-element Vector{Float64}:
 146.1737820482947
  31.914161161783426
 180.04064126207396
CluGen.points_on_lineMethod
points_on_line(
    center::AbstractArray{<:Real, 1},
    direction::AbstractArray{<:Real, 1},
    dist_center::AbstractArray{<:Real, 1}
) -> AbstractArray{<:Real, 2}

Determine coordinates of points on a line with center and direction, based on the distances from the center given in dist_center.

This works by using the vector formulation of the line equation assuming direction is a $n$-dimensional unit vector. In other words, considering $\mathbf{d}=$ direction ($n \times 1$), $\mathbf{c}=$ center ($n \times 1$), and $\mathbf{w}=$ dist_center ($p \times 1$), the coordinates of points on the line are given by:

\[\mathbf{P}=\mathbf{1}\,\mathbf{c}^T + \mathbf{w}\mathbf{d}^T\]

where $\mathbf{P}$ is the $p \times n$ matrix of point coordinates on the line, and $\mathbf{1}$ is a $p \times 1$ vector with all entries equal to 1.

Examples

julia> points_on_line([5.0,5.0], [1.0,0.0], -4:2:4) # 2D, 5 points
5×2 Matrix{Float64}:
 1.0  5.0
 3.0  5.0
 5.0  5.0
 7.0  5.0
 9.0  5.0

julia> points_on_line([-2.0,0,0,2.0], [0,0,-1.0,0], [10,-10]) # 4D, 2 points
2×4 Matrix{Float64}:
 -2.0  0.0  -10.0  2.0
 -2.0  0.0   10.0  2.0
CluGen.rand_ortho_vectorMethod
rand_ortho_vector(
    u::AbstractArray{<:Real, 1};
    rng::AbstractRNG = Random.GLOBAL_RNG
) -> AbstractArray{<:Real, 1}

Get a random unit vector orthogonal to u.

Note that u is expected to be a unit vector itself.

Examples

julia> u = normalize([1,2,5.0,-3,-0.2]); # Define a 5D unit vector

julia> v = rand_ortho_vector(u);

julia> ≈(dot(u, v), 0; atol=1e-15) # Vectors orthogonal? (needs LinearAlgebra package)
true

julia> rand_ortho_vector([1,0,0]; rng=MersenneTwister(567)) # 3D, reproducible
3-element Vector{Float64}:
  0.0
 -0.717797705156548
  0.6962517177515569
CluGen.rand_unit_vectorMethod
rand_unit_vector(
    num_dims::Integer;
    rng::AbstractRNG = Random.GLOBAL_RNG
) ->  AbstractArray{<:Real, 1}

Get a random unit vector with num_dims dimensions.

Examples

julia> v = rand_unit_vector(4) # 4D
4-element Vector{Float64}:
 -0.24033021128704707
 -0.032103799230189585
  0.04223910709972599
 -0.9692402145232775

julia> norm(v) # Check vector magnitude is 1 (needs LinearAlgebra package)
1.0

julia> rand_unit_vector(2; rng=MersenneTwister(33)) # 2D, reproducible
2-element Vector{Float64}:
  0.8429232717309576
 -0.5380337888779647
CluGen.rand_vector_at_angleMethod
rand_vector_at_angle(
    u::AbstractArray{<:Real, 1},
    angle::Real;
    rng::AbstractRNG = Random.GLOBAL_RNG
) ->  AbstractArray{<:Real, 1}

Get a random unit vector which is at angle radians of vector u.

Note that u is expected to be a unit vector itself.

Examples

julia> u = normalize([1,0.5,0.3,-0.1]); # Define a 4D unit vector

julia> v = rand_vector_at_angle(u, pi/4); # pi/4 = 0.7853981... radians = 45 degrees

julia> a = acos(dot(u, v) / (norm(u) * norm(v))) # Angle (radians) between u and v?
0.7853981633974483

julia> rand_vector_at_angle([0, 1], pi/6; rng=MersenneTwister(456)) # 2D, reproducible
2-element Vector{Float64}:
 -0.4999999999999999
  0.8660254037844387