Discreet

Build Status Coverage status

Discreet is a small opinionated toolbox to estimate entropy and mutual information from discrete samples. It contains methods to adjust results and correct over- or under-estimations.

Estimating entropy

Discreet uses StatsBase's FrequencyWeights and ProbabilityWeights types.

using StatsBase: FrequencyWeights, ProbabilityWeights
using Discreet

dist = FrequencyWeights([1, 1, 1, 1, 1, 1])
entropy(dist)  # Naive method: log(6) ≈ 1.792

entropy(dist; method=:CS)  # Chao-Shen correction: ≈ 3.840

entropy(dist; method=:Shrink)  # Shrinkage correction: ≈ 1.792

dist = ProbabilityWeights([.5, .5])
entropy(dist)  # log(2) ≈ 0.693

Discreet can also estimate the entropy of a sample:

using Discreet

data = ["tomato", "apple", "apple", "banana", "tomato"]
estimate_entropy(data)  # == entropy(FrequencyWeights([2, 2, 1]))

Estimate mutual information

Discrete provides similar routines to estimate mutual information.

using Discreet

labels_a = [1, 1, 1, 1, 1, 1, 2, 2, 2, 2, 2, 2, 3, 3, 3, 3, 3]
labels_b = [1, 1, 1, 1, 2, 1, 2, 2, 2, 2, 3, 1, 3, 3, 3, 2, 2]
mutual_information(labels_a, labels_b)  # Naive method: ≈ 0.410

mutual_information(labels_a, labels_b; method=:CS)  # Chao-Shen correction: ≈ 0.148

mutual_information(labels_a, labels_b; normalize=true)  # Normalized score (between 0 and 1): ≈ 0.382