Docstrings · DPMMSubClustersStreaming.jl

DPMMSubClustersStreaming.compact_mnm_hyper — Type

compact_mnm_hyper(α::AbstractArray{Float32,1})

Dirichlet Distribution

DPMMSubClustersStreaming.multinomial_dist — Type

multinomial_hyper(α::AbstractArray{Float32,1})

Dirichlet Distribution

DPMMSubClustersStreaming.multinomial_hyper — Type

multinomial_hyper(α::AbstractArray{Float32,1})

Dirichlet Distribution

DPMMSubClustersStreaming.mv_gaussian — Type

mv_gaussian(μ::AbstractArray{Float32,1}
    Σ::AbstractArray{Float32,2}
    invΣ::AbstractArray{Float32,2}
    logdetΣ::Float32
    invChol::UpperTriangular)

Multivariate Normal Distribution

DPMMSubClustersStreaming.niw_hyperparams — Type

niw_hyperparams(κ::Float32, m::AbstractArray{Float32}, ν::Float32, ψ::AbstractArray{Float32})

Normal Inverse Wishart

DPMMSubClustersStreaming.cluster_statistics — Method

cluster_statistics(points,labels, clusters)

Provide avg statsitcs of probabiliy and likelihood for given points, labels and clusters

Args and Kwargs

points a DxN array containing the data
labels points labels
clusters vector of clusters distributions

Return values

avgll, avgprob

avg_ll each cluster avg point ll
avg_prob each cluster avg point prob

Example:

julia> dp = run_model_from_checkpoint("checkpoint__50.jld2")
Loading Model:
  1.073261 seconds (2.27 M allocations: 113.221 MiB, 2.60% gc time)
Including params
Loading data:
  0.000881 seconds (10.02 k allocations: 378.313 KiB)
Creating model:
Node Leaders:
Dict{Any,Any}(2=>Any[2, 3])
Running model:
...

DPMMSubClustersStreaming.dp_parallel — Function

dp_parallel(all_data::AbstractArray{Float32,2},
    local_hyper_params::distribution_hyper_params,
    α_param::Float32,
     iters::Int64 = 100,
     init_clusters::Int64 = 1,
     seed = nothing,
     verbose = true,
     save_model = false,
     burnout = 15,
     gt = nothing,
     max_clusters = Inf,
     outlier_weight = 0,
     outlier_params = nothing)

Run the model.

Args and Kwargs

all_data::AbstractArray{Float32,2} a DxN array containing the data
local_hyper_params::distribution_hyper_params the prior hyperparams
α_param::Float32 the concetration parameter
iters::Int64 number of iterations to run the model
init_clusters::Int64 number of initial clusters
seed define a random seed to be used in all workers, if used must be preceeded with @everywhere using random.
verbose will perform prints on every iteration.
save_model will save a checkpoint every 25 iterations.
burnout how long to wait after creating a cluster, and allowing it to split/merge
gt Ground truth, when supplied, will perform NMI and VI analysis on every iteration.
max_clusters limit the number of cluster
outlier_weight constant weight of an extra non-spliting component
outlier_params hyperparams for an extra non-spliting component

Return values

dpmodel, itercount , nmiscorehistory, liklihoodhistory, clustercount_history

dp_model The DPMM model inferred
iter_count Timing for each iteration
nmi_score_history NMI score per iteration (if gt suppled)
likelihood_history Log likelihood per iteration.
cluster_count_history Cluster counts per iteration.

DPMMSubClustersStreaming.dp_parallel — Method

dp_parallel(model_params::String; verbose = true, save_model = true,burnout = 5, gt = nothing)

Run the model in advanced mode.

Args and Kwargs

model_params::String A path to a parameters file (see below)
verbose will perform prints on every iteration.
save_model will save a checkpoint every X iterations, where X is specified in the parameter file.
burnout how long to wait after creating a cluster, and allowing it to split/merge
gt Ground truth, when supplied, will perform NMI and VI analysis on every iteration.

Return values

dpmodel, itercount , nmiscorehistory, liklihoodhistory, clustercount_history

dp_model The DPMM model inferred
iter_count Timing for each iteration
nmi_score_history NMI score per iteration (if gt suppled)
likelihood_history Log likelihood per iteration.
cluster_count_history Cluster counts per iteration.

DPMMSubClustersStreaming.dp_parallel_streaming — Function

dp_parallel_streaming(all_data::AbstractArray{Float32,2},
    local_hyper_params::distribution_hyper_params,
    α_param::Float32,
     iters::Int64 = 100,
     init_clusters::Int64 = 1,
     seed = nothing,
     verbose = true,
     save_model = false,
     burnout = 15,
     gt = nothing,
     max_clusters = Inf,
     outlier_weight = 0,
     outlier_params = nothing)

Run the model.

Args and Kwargs

all_data::AbstractArray{Float32,2} a DxN array containing the data
local_hyper_params::distribution_hyper_params the prior hyperparams
α_param::Float32 the concetration parameter
iters::Int64 number of iterations to run the model
init_clusters::Int64 number of initial clusters
seed define a random seed to be used in all workers, if used must be preceeded with @everywhere using random.
verbose will perform prints on every iteration.
save_model will save a checkpoint every 25 iterations.
burnout how long to wait after creating a cluster, and allowing it to split/merge
gt Ground truth, when supplied, will perform NMI and VI analysis on every iteration.
max_clusters limit the number of cluster
outlier_weight constant weight of an extra non-spliting component
outlier_params hyperparams for an extra non-spliting component

Return values

dpmodel, itercount , nmiscorehistory, liklihoodhistory, clustercount_history

dp_model The DPMM model inferred
iter_count Timing for each iteration
nmi_score_history NMI score per iteration (if gt suppled)
likelihood_history Log likelihood per iteration.
cluster_count_history Cluster counts per iteration.

DPMMSubClustersStreaming.fit — Method

fit(all_data::AbstractArray{Float32,2},local_hyper_params::distribution_hyper_params,α_param::Float32;
   iters::Int64 = 100, init_clusters::Int64 = 1,seed = nothing, verbose = true, save_model = false, burnout = 20, gt = nothing, max_clusters = Inf, outlier_weight = 0, outlier_params = nothing,smart_splits = false)

Run the model (basic mode).

Args and Kwargs

all_data::AbstractArray{Float32,2} a DxN array containing the data
local_hyper_params::distribution_hyper_params the prior hyperparams
α_param::Float32 the concetration parameter
iters::Int64 number of iterations to run the model
init_clusters::Int64 number of initial clusters
seed define a random seed to be used in all workers, if used must be preceeded with @everywhere using random.
verbose will perform prints on every iteration.
save_model will save a checkpoint every 25 iterations.
burnout how long to wait after creating a cluster, and allowing it to split/merge
gt Ground truth, when supplied, will perform NMI and VI analysis on every iteration.
max_clusters limit the number of cluster
outlier_weight constant weight of an extra non-spliting component
outlier_params hyperparams for an extra non-spliting component
smart_splits should use smart splits (Gaussian only, default is false)

Return Values

labels Labels assignments
clusters Cluster parameters
weights The cluster weights, does not sum to 1, but to 1 minus the weight of all uninstanistaed clusters.
iter_count Timing for each iteration
nmi_score_history NMI score per iteration (if gt suppled)
likelihood_history Log likelihood per iteration.
cluster_count_history Cluster counts per iteration.
sub_labels Sub labels assignments

Example:

julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
...

julia> hyper_params = DPMMSubClusters.niw_hyperparams(1.0,
                  zeros(2),
                  5,
                  [1 0;0 1])
DPMMSubClusters.niw_hyperparams(1.0f0, Float32[0.0, 0.0], 5.0f0, Float32[1.0 0.0; 0.0 1.0])

julia> ret_values= fit(x,hyper_params,10.0, iters = 100, verbose=false)

...

julia> unique(ret_values[1])
6-element Array{Int64,1}:
 3
 6
 1
 2
 5
 4

DPMMSubClustersStreaming.fit — Method

fit(all_data::AbstractArray{Float32,2},α_param::Float32;
    iters::Int64 = 100, init_clusters::Int64 = 1,seed = nothing, verbose = true, save_model = false,burnout = 20, gt = nothing, max_clusters = Inf, outlier_weight = 0, outlier_params = nothing,smart_splits = false)

Run the model (basic mode) with default NIW prior.

Args and Kwargs

all_data::AbstractArray{Float32,2} a DxN array containing the data
α_param::Float32 the concetration parameter
iters::Int64 number of iterations to run the model
init_clusters::Int64 number of initial clusters
seed define a random seed to be used in all workers, if used must be preceeded with @everywhere using random.
verbose will perform prints on every iteration.
save_model will save a checkpoint every 25 iterations.
burnout how long to wait after creating a cluster, and allowing it to split/merge
gt Ground truth, when supplied, will perform NMI and VI analysis on every iteration.
outlier_weight constant weight of an extra non-spliting component
outlier_params hyperparams for an extra non-spliting component
smart_splits should use smart splits (Gaussian only, default is false)

Return Values

labels Labels assignments
clusters Cluster parameters
weights The cluster weights, does not sum to 1, but to 1 minus the weight of all uninstanistaed clusters.
iter_count Timing for each iteration
nmi_score_history NMI score per iteration (if gt suppled)
likelihood_history Log likelihood per iteration.
cluster_count_history Cluster counts per iteration.
sub_labels Sub labels assignments

Example:

julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
...

julia> ret_values= fit(x,10.0, iters = 100, verbose=false)

...

julia> unique(ret_values[1])
6-element Array{Int64,1}:
 3
 6
 1
 2
 5
 4

DPMMSubClustersStreaming.generate_gaussian_data — Method

generate_gaussian_data(N::Int64, D::Int64, K::Int64,MixtureVar::Number)

Generate N observations, generated from K D dimensions Gaussians, with the Gaussian means sampled from a Normal distribution with mean 0 and MixtureVar variance.

Returns (Samples, Labels, Clusters_means, Clusters_cov)

Example

julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
[3644, 2880, 119, 154, 33, 3170]
...

DPMMSubClustersStreaming.generate_mnmm_data — Method

 generate_mnmm_data(N::Int64, D::Int64, K::Int64, trials::Int64)

Generate N observations, generated from K D features Multinomial vectors, with trials draws from each vector.

Returns (Samples, Labels, Vectors)

Example

julia> generate_mnmm_data(10000, 10, 5, 100)
...

DPMMSubClustersStreaming.init_first_clusters! — Method

init_first_clusters!(dp_model::dp_parallel_sampling, initial_cluster_count::Int64))

Initialize the first clusters in the model, according to the number defined by initialclustercount

Mutates the model.

DPMMSubClustersStreaming.init_model — Method

init_model()

Initialize the model, loading the data from external npy files, specified in the params file. All prior data as been included previously, and is globaly accessed by the function.

Returns an dp_parallel_sampling (e.g. the main data structure) with the configured parameters and data.

DPMMSubClustersStreaming.init_model_from_data — Method

init_model(all_data)

Initialize the model, from all_data, should be Dimensions X Samples, type Float32 All prior data as been included previously, and is globaly accessed by the function.

Returns an dp_parallel_sampling (e.g. the main data structure) with the configured parameters and data.