DPMMSubClustersStreaming.cluster_statisticsMethod
cluster_statistics(points,labels, clusters)

Provide avg statsitcs of probabiliy and likelihood for given points, labels and clusters

Args and Kwargs

  • points a DxN array containing the data
  • labels points labels
  • clusters vector of clusters distributions

Return values

avgll, avgprob

  • avg_ll each cluster avg point ll
  • avg_prob each cluster avg point prob

Example:

julia> dp = run_model_from_checkpoint("checkpoint__50.jld2")
Loading Model:
  1.073261 seconds (2.27 M allocations: 113.221 MiB, 2.60% gc time)
Including params
Loading data:
  0.000881 seconds (10.02 k allocations: 378.313 KiB)
Creating model:
Node Leaders:
Dict{Any,Any}(2=>Any[2, 3])
Running model:
...
DPMMSubClustersStreaming.dp_parallelFunction
dp_parallel(all_data::AbstractArray{Float32,2},
    local_hyper_params::distribution_hyper_params,
    α_param::Float32,
     iters::Int64 = 100,
     init_clusters::Int64 = 1,
     seed = nothing,
     verbose = true,
     save_model = false,
     burnout = 15,
     gt = nothing,
     max_clusters = Inf,
     outlier_weight = 0,
     outlier_params = nothing)

Run the model.

Args and Kwargs

  • all_data::AbstractArray{Float32,2} a DxN array containing the data
  • local_hyper_params::distribution_hyper_params the prior hyperparams
  • α_param::Float32 the concetration parameter
  • iters::Int64 number of iterations to run the model
  • init_clusters::Int64 number of initial clusters
  • seed define a random seed to be used in all workers, if used must be preceeded with @everywhere using random.
  • verbose will perform prints on every iteration.
  • save_model will save a checkpoint every 25 iterations.
  • burnout how long to wait after creating a cluster, and allowing it to split/merge
  • gt Ground truth, when supplied, will perform NMI and VI analysis on every iteration.
  • max_clusters limit the number of cluster
  • outlier_weight constant weight of an extra non-spliting component
  • outlier_params hyperparams for an extra non-spliting component

Return values

dpmodel, itercount , nmiscorehistory, liklihoodhistory, clustercount_history

  • dp_model The DPMM model inferred
  • iter_count Timing for each iteration
  • nmi_score_history NMI score per iteration (if gt suppled)
  • likelihood_history Log likelihood per iteration.
  • cluster_count_history Cluster counts per iteration.
DPMMSubClustersStreaming.dp_parallelMethod
dp_parallel(model_params::String; verbose = true, save_model = true,burnout = 5, gt = nothing)

Run the model in advanced mode.

Args and Kwargs

  • model_params::String A path to a parameters file (see below)
  • verbose will perform prints on every iteration.
  • save_model will save a checkpoint every X iterations, where X is specified in the parameter file.
  • burnout how long to wait after creating a cluster, and allowing it to split/merge
  • gt Ground truth, when supplied, will perform NMI and VI analysis on every iteration.

Return values

dpmodel, itercount , nmiscorehistory, liklihoodhistory, clustercount_history

  • dp_model The DPMM model inferred
  • iter_count Timing for each iteration
  • nmi_score_history NMI score per iteration (if gt suppled)
  • likelihood_history Log likelihood per iteration.
  • cluster_count_history Cluster counts per iteration.
DPMMSubClustersStreaming.dp_parallel_streamingFunction
dp_parallel_streaming(all_data::AbstractArray{Float32,2},
    local_hyper_params::distribution_hyper_params,
    α_param::Float32,
     iters::Int64 = 100,
     init_clusters::Int64 = 1,
     seed = nothing,
     verbose = true,
     save_model = false,
     burnout = 15,
     gt = nothing,
     max_clusters = Inf,
     outlier_weight = 0,
     outlier_params = nothing)

Run the model.

Args and Kwargs

  • all_data::AbstractArray{Float32,2} a DxN array containing the data
  • local_hyper_params::distribution_hyper_params the prior hyperparams
  • α_param::Float32 the concetration parameter
  • iters::Int64 number of iterations to run the model
  • init_clusters::Int64 number of initial clusters
  • seed define a random seed to be used in all workers, if used must be preceeded with @everywhere using random.
  • verbose will perform prints on every iteration.
  • save_model will save a checkpoint every 25 iterations.
  • burnout how long to wait after creating a cluster, and allowing it to split/merge
  • gt Ground truth, when supplied, will perform NMI and VI analysis on every iteration.
  • max_clusters limit the number of cluster
  • outlier_weight constant weight of an extra non-spliting component
  • outlier_params hyperparams for an extra non-spliting component

Return values

dpmodel, itercount , nmiscorehistory, liklihoodhistory, clustercount_history

  • dp_model The DPMM model inferred
  • iter_count Timing for each iteration
  • nmi_score_history NMI score per iteration (if gt suppled)
  • likelihood_history Log likelihood per iteration.
  • cluster_count_history Cluster counts per iteration.
DPMMSubClustersStreaming.fitMethod
fit(all_data::AbstractArray{Float32,2},local_hyper_params::distribution_hyper_params,α_param::Float32;
   iters::Int64 = 100, init_clusters::Int64 = 1,seed = nothing, verbose = true, save_model = false, burnout = 20, gt = nothing, max_clusters = Inf, outlier_weight = 0, outlier_params = nothing,smart_splits = false)

Run the model (basic mode).

Args and Kwargs

  • all_data::AbstractArray{Float32,2} a DxN array containing the data
  • local_hyper_params::distribution_hyper_params the prior hyperparams
  • α_param::Float32 the concetration parameter
  • iters::Int64 number of iterations to run the model
  • init_clusters::Int64 number of initial clusters
  • seed define a random seed to be used in all workers, if used must be preceeded with @everywhere using random.
  • verbose will perform prints on every iteration.
  • save_model will save a checkpoint every 25 iterations.
  • burnout how long to wait after creating a cluster, and allowing it to split/merge
  • gt Ground truth, when supplied, will perform NMI and VI analysis on every iteration.
  • max_clusters limit the number of cluster
  • outlier_weight constant weight of an extra non-spliting component
  • outlier_params hyperparams for an extra non-spliting component
  • smart_splits should use smart splits (Gaussian only, default is false)

Return Values

  • labels Labels assignments
  • clusters Cluster parameters
  • weights The cluster weights, does not sum to 1, but to 1 minus the weight of all uninstanistaed clusters.
  • iter_count Timing for each iteration
  • nmi_score_history NMI score per iteration (if gt suppled)
  • likelihood_history Log likelihood per iteration.
  • cluster_count_history Cluster counts per iteration.
  • sub_labels Sub labels assignments

Example:

julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
...

julia> hyper_params = DPMMSubClusters.niw_hyperparams(1.0,
                  zeros(2),
                  5,
                  [1 0;0 1])
DPMMSubClusters.niw_hyperparams(1.0f0, Float32[0.0, 0.0], 5.0f0, Float32[1.0 0.0; 0.0 1.0])

julia> ret_values= fit(x,hyper_params,10.0, iters = 100, verbose=false)

...

julia> unique(ret_values[1])
6-element Array{Int64,1}:
 3
 6
 1
 2
 5
 4
DPMMSubClustersStreaming.fitMethod
fit(all_data::AbstractArray{Float32,2},α_param::Float32;
    iters::Int64 = 100, init_clusters::Int64 = 1,seed = nothing, verbose = true, save_model = false,burnout = 20, gt = nothing, max_clusters = Inf, outlier_weight = 0, outlier_params = nothing,smart_splits = false)

Run the model (basic mode) with default NIW prior.

Args and Kwargs

  • all_data::AbstractArray{Float32,2} a DxN array containing the data
  • α_param::Float32 the concetration parameter
  • iters::Int64 number of iterations to run the model
  • init_clusters::Int64 number of initial clusters
  • seed define a random seed to be used in all workers, if used must be preceeded with @everywhere using random.
  • verbose will perform prints on every iteration.
  • save_model will save a checkpoint every 25 iterations.
  • burnout how long to wait after creating a cluster, and allowing it to split/merge
  • gt Ground truth, when supplied, will perform NMI and VI analysis on every iteration.
  • outlier_weight constant weight of an extra non-spliting component
  • outlier_params hyperparams for an extra non-spliting component
  • smart_splits should use smart splits (Gaussian only, default is false)

Return Values

  • labels Labels assignments
  • clusters Cluster parameters
  • weights The cluster weights, does not sum to 1, but to 1 minus the weight of all uninstanistaed clusters.
  • iter_count Timing for each iteration
  • nmi_score_history NMI score per iteration (if gt suppled)
  • likelihood_history Log likelihood per iteration.
  • cluster_count_history Cluster counts per iteration.
  • sub_labels Sub labels assignments

Example:

julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
...

julia> ret_values= fit(x,10.0, iters = 100, verbose=false)

...

julia> unique(ret_values[1])
6-element Array{Int64,1}:
 3
 6
 1
 2
 5
 4
DPMMSubClustersStreaming.generate_gaussian_dataMethod
generate_gaussian_data(N::Int64, D::Int64, K::Int64,MixtureVar::Number)

Generate N observations, generated from K D dimensions Gaussians, with the Gaussian means sampled from a Normal distribution with mean 0 and MixtureVar variance.

Returns (Samples, Labels, Clusters_means, Clusters_cov)

Example

julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
[3644, 2880, 119, 154, 33, 3170]
...
DPMMSubClustersStreaming.generate_mnmm_dataMethod
 generate_mnmm_data(N::Int64, D::Int64, K::Int64, trials::Int64)

Generate N observations, generated from K D features Multinomial vectors, with trials draws from each vector.

Returns (Samples, Labels, Vectors)

Example

julia> generate_mnmm_data(10000, 10, 5, 100)
...
DPMMSubClustersStreaming.init_first_clusters!Method
init_first_clusters!(dp_model::dp_parallel_sampling, initial_cluster_count::Int64))

Initialize the first clusters in the model, according to the number defined by initialclustercount

Mutates the model.

DPMMSubClustersStreaming.init_modelMethod
init_model()

Initialize the model, loading the data from external npy files, specified in the params file. All prior data as been included previously, and is globaly accessed by the function.

Returns an dp_parallel_sampling (e.g. the main data structure) with the configured parameters and data.

DPMMSubClustersStreaming.init_model_from_dataMethod
init_model(all_data)

Initialize the model, from all_data, should be Dimensions X Samples, type Float32 All prior data as been included previously, and is globaly accessed by the function.

Returns an dp_parallel_sampling (e.g. the main data structure) with the configured parameters and data.