DPMMSubClustersStreaming.compact_mnm_hyper
— Typecompact_mnm_hyper(α::AbstractArray{Float32,1})
DPMMSubClustersStreaming.multinomial_dist
— Typemultinomial_hyper(α::AbstractArray{Float32,1})
DPMMSubClustersStreaming.multinomial_hyper
— Typemultinomial_hyper(α::AbstractArray{Float32,1})
DPMMSubClustersStreaming.mv_gaussian
— Typemv_gaussian(μ::AbstractArray{Float32,1}
Σ::AbstractArray{Float32,2}
invΣ::AbstractArray{Float32,2}
logdetΣ::Float32
invChol::UpperTriangular)
DPMMSubClustersStreaming.niw_hyperparams
— Typeniw_hyperparams(κ::Float32, m::AbstractArray{Float32}, ν::Float32, ψ::AbstractArray{Float32})
DPMMSubClustersStreaming.cluster_statistics
— Methodcluster_statistics(points,labels, clusters)
Provide avg statsitcs of probabiliy and likelihood for given points, labels and clusters
Args and Kwargs
points
aDxN
array containing the datalabels
points labelsclusters
vector of clusters distributions
Return values
avgll, avgprob
avg_ll
each cluster avg point llavg_prob
each cluster avg point prob
Example:
julia> dp = run_model_from_checkpoint("checkpoint__50.jld2")
Loading Model:
1.073261 seconds (2.27 M allocations: 113.221 MiB, 2.60% gc time)
Including params
Loading data:
0.000881 seconds (10.02 k allocations: 378.313 KiB)
Creating model:
Node Leaders:
Dict{Any,Any}(2=>Any[2, 3])
Running model:
...
DPMMSubClustersStreaming.dp_parallel
— Functiondp_parallel(all_data::AbstractArray{Float32,2},
local_hyper_params::distribution_hyper_params,
α_param::Float32,
iters::Int64 = 100,
init_clusters::Int64 = 1,
seed = nothing,
verbose = true,
save_model = false,
burnout = 15,
gt = nothing,
max_clusters = Inf,
outlier_weight = 0,
outlier_params = nothing)
Run the model.
Args and Kwargs
all_data::AbstractArray{Float32,2}
aDxN
array containing the datalocal_hyper_params::distribution_hyper_params
the prior hyperparamsα_param::Float32
the concetration parameteriters::Int64
number of iterations to run the modelinit_clusters::Int64
number of initial clustersseed
define a random seed to be used in all workers, if used must be preceeded with@everywhere using random
.verbose
will perform prints on every iteration.save_model
will save a checkpoint every 25 iterations.burnout
how long to wait after creating a cluster, and allowing it to split/mergegt
Ground truth, when supplied, will perform NMI and VI analysis on every iteration.max_clusters
limit the number of clusteroutlier_weight
constant weight of an extra non-spliting componentoutlier_params
hyperparams for an extra non-spliting component
Return values
dpmodel, itercount , nmiscorehistory, liklihoodhistory, clustercount_history
dp_model
The DPMM model inferrediter_count
Timing for each iterationnmi_score_history
NMI score per iteration (if gt suppled)likelihood_history
Log likelihood per iteration.cluster_count_history
Cluster counts per iteration.
DPMMSubClustersStreaming.dp_parallel
— Methoddp_parallel(model_params::String; verbose = true, save_model = true,burnout = 5, gt = nothing)
Run the model in advanced mode.
Args and Kwargs
model_params::String
A path to a parameters file (see below)verbose
will perform prints on every iteration.save_model
will save a checkpoint everyX
iterations, whereX
is specified in the parameter file.burnout
how long to wait after creating a cluster, and allowing it to split/mergegt
Ground truth, when supplied, will perform NMI and VI analysis on every iteration.
Return values
dpmodel, itercount , nmiscorehistory, liklihoodhistory, clustercount_history
dp_model
The DPMM model inferrediter_count
Timing for each iterationnmi_score_history
NMI score per iteration (if gt suppled)likelihood_history
Log likelihood per iteration.cluster_count_history
Cluster counts per iteration.
DPMMSubClustersStreaming.dp_parallel_streaming
— Functiondp_parallel_streaming(all_data::AbstractArray{Float32,2},
local_hyper_params::distribution_hyper_params,
α_param::Float32,
iters::Int64 = 100,
init_clusters::Int64 = 1,
seed = nothing,
verbose = true,
save_model = false,
burnout = 15,
gt = nothing,
max_clusters = Inf,
outlier_weight = 0,
outlier_params = nothing)
Run the model.
Args and Kwargs
all_data::AbstractArray{Float32,2}
aDxN
array containing the datalocal_hyper_params::distribution_hyper_params
the prior hyperparamsα_param::Float32
the concetration parameteriters::Int64
number of iterations to run the modelinit_clusters::Int64
number of initial clustersseed
define a random seed to be used in all workers, if used must be preceeded with@everywhere using random
.verbose
will perform prints on every iteration.save_model
will save a checkpoint every 25 iterations.burnout
how long to wait after creating a cluster, and allowing it to split/mergegt
Ground truth, when supplied, will perform NMI and VI analysis on every iteration.max_clusters
limit the number of clusteroutlier_weight
constant weight of an extra non-spliting componentoutlier_params
hyperparams for an extra non-spliting component
Return values
dpmodel, itercount , nmiscorehistory, liklihoodhistory, clustercount_history
dp_model
The DPMM model inferrediter_count
Timing for each iterationnmi_score_history
NMI score per iteration (if gt suppled)likelihood_history
Log likelihood per iteration.cluster_count_history
Cluster counts per iteration.
DPMMSubClustersStreaming.fit
— Methodfit(all_data::AbstractArray{Float32,2},local_hyper_params::distribution_hyper_params,α_param::Float32;
iters::Int64 = 100, init_clusters::Int64 = 1,seed = nothing, verbose = true, save_model = false, burnout = 20, gt = nothing, max_clusters = Inf, outlier_weight = 0, outlier_params = nothing,smart_splits = false)
Run the model (basic mode).
Args and Kwargs
all_data::AbstractArray{Float32,2}
aDxN
array containing the datalocal_hyper_params::distribution_hyper_params
the prior hyperparamsα_param::Float32
the concetration parameteriters::Int64
number of iterations to run the modelinit_clusters::Int64
number of initial clustersseed
define a random seed to be used in all workers, if used must be preceeded with@everywhere using random
.verbose
will perform prints on every iteration.save_model
will save a checkpoint every 25 iterations.burnout
how long to wait after creating a cluster, and allowing it to split/mergegt
Ground truth, when supplied, will perform NMI and VI analysis on every iteration.max_clusters
limit the number of clusteroutlier_weight
constant weight of an extra non-spliting componentoutlier_params
hyperparams for an extra non-spliting componentsmart_splits
should use smart splits (Gaussian only, default is false)
Return Values
labels
Labels assignmentsclusters
Cluster parametersweights
The cluster weights, does not sum to1
, but to1
minus the weight of all uninstanistaed clusters.iter_count
Timing for each iterationnmi_score_history
NMI score per iteration (if gt suppled)likelihood_history
Log likelihood per iteration.cluster_count_history
Cluster counts per iteration.sub_labels
Sub labels assignments
Example:
julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
...
julia> hyper_params = DPMMSubClusters.niw_hyperparams(1.0,
zeros(2),
5,
[1 0;0 1])
DPMMSubClusters.niw_hyperparams(1.0f0, Float32[0.0, 0.0], 5.0f0, Float32[1.0 0.0; 0.0 1.0])
julia> ret_values= fit(x,hyper_params,10.0, iters = 100, verbose=false)
...
julia> unique(ret_values[1])
6-element Array{Int64,1}:
3
6
1
2
5
4
DPMMSubClustersStreaming.fit
— Methodfit(all_data::AbstractArray{Float32,2},α_param::Float32;
iters::Int64 = 100, init_clusters::Int64 = 1,seed = nothing, verbose = true, save_model = false,burnout = 20, gt = nothing, max_clusters = Inf, outlier_weight = 0, outlier_params = nothing,smart_splits = false)
Run the model (basic mode) with default NIW
prior.
Args and Kwargs
all_data::AbstractArray{Float32,2}
aDxN
array containing the dataα_param::Float32
the concetration parameteriters::Int64
number of iterations to run the modelinit_clusters::Int64
number of initial clustersseed
define a random seed to be used in all workers, if used must be preceeded with@everywhere using random
.verbose
will perform prints on every iteration.save_model
will save a checkpoint every 25 iterations.burnout
how long to wait after creating a cluster, and allowing it to split/mergegt
Ground truth, when supplied, will perform NMI and VI analysis on every iteration.outlier_weight
constant weight of an extra non-spliting componentoutlier_params
hyperparams for an extra non-spliting componentsmart_splits
should use smart splits (Gaussian only, default is false)
Return Values
labels
Labels assignmentsclusters
Cluster parametersweights
The cluster weights, does not sum to1
, but to1
minus the weight of all uninstanistaed clusters.iter_count
Timing for each iterationnmi_score_history
NMI score per iteration (if gt suppled)likelihood_history
Log likelihood per iteration.cluster_count_history
Cluster counts per iteration.sub_labels
Sub labels assignments
Example:
julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
...
julia> ret_values= fit(x,10.0, iters = 100, verbose=false)
...
julia> unique(ret_values[1])
6-element Array{Int64,1}:
3
6
1
2
5
4
DPMMSubClustersStreaming.generate_gaussian_data
— Methodgenerate_gaussian_data(N::Int64, D::Int64, K::Int64,MixtureVar::Number)
Generate N
observations, generated from K
D
dimensions Gaussians, with the Gaussian means sampled from a Normal
distribution with mean 0
and MixtureVar
variance.
Returns (Samples, Labels, Clusters_means, Clusters_cov)
Example
julia> x,y,clusters = generate_gaussian_data(10000,2,6,100.0)
[3644, 2880, 119, 154, 33, 3170]
...
DPMMSubClustersStreaming.generate_mnmm_data
— Method generate_mnmm_data(N::Int64, D::Int64, K::Int64, trials::Int64)
Generate N
observations, generated from K
D
features Multinomial vectors, with trials
draws from each vector.
Returns (Samples, Labels, Vectors)
Example
julia> generate_mnmm_data(10000, 10, 5, 100)
...
DPMMSubClustersStreaming.init_first_clusters!
— Methodinit_first_clusters!(dp_model::dp_parallel_sampling, initial_cluster_count::Int64))
Initialize the first clusters in the model, according to the number defined by initialclustercount
Mutates the model.
DPMMSubClustersStreaming.init_model
— Methodinit_model()
Initialize the model, loading the data from external npy
files, specified in the params file. All prior data as been included previously, and is globaly accessed by the function.
Returns an dp_parallel_sampling
(e.g. the main data structure) with the configured parameters and data.
DPMMSubClustersStreaming.init_model_from_data
— Methodinit_model(all_data)
Initialize the model, from all_data
, should be Dimensions X Samples
, type Float32
All prior data as been included previously, and is globaly accessed by the function.
Returns an dp_parallel_sampling
(e.g. the main data structure) with the configured parameters and data.