# Empirical Estimation

## Histograms

StatsBase.HistogramType
Histogram <: AbstractHistogram

The Histogram type represents data that has been tabulated into intervals (known as bins) along the real line, or in higher dimensions, over a real space. Histograms can be fitted to data using the fit method.

Fields

• edges: An iterator that contains the boundaries of the bins in each dimension.
• weights: An array that contains the weight of each bin.
• closed: A symbol with value :right or :left indicating on which side bins (half-open intervals or higher-dimensional analogues thereof) are closed. See below for an example.
• isdensity: There are two interpretations of a Histogram. If isdensity=false the weight of a bin corresponds to the amount of a quantity in the bin. If isdensity=true then it corresponds to the density (amount / volume) of the quantity in the bin. See below for an example.

Examples

Example illustrating closed

julia> using StatsBase

julia> fit(Histogram, [2.],  1:3, closed=:left)
Histogram{Int64,1,Tuple{UnitRange{Int64}}}
edges:
1:3
weights: [0, 1]
closed: left
isdensity: false

julia> fit(Histogram, [2.],  1:3, closed=:right)
Histogram{Int64,1,Tuple{UnitRange{Int64}}}
edges:
1:3
weights: [1, 0]
closed: right
isdensity: false

Example illustrating isdensity

julia> using StatsBase, LinearAlgebra

julia> bins = [0,1,7]; # a small and a large bin

julia> obs = [0.5, 1.5, 1.5, 2.5]; # one observation in the small bin and three in the large

julia> h = fit(Histogram, obs, bins)
Histogram{Int64,1,Tuple{Array{Int64,1}}}
edges:
[0, 1, 7]
weights: [1, 3]
closed: left
isdensity: false

julia> # observe isdensity = false and the weights field records the number of observations in each bin

julia> normalize(h, mode=:density)
Histogram{Float64,1,Tuple{Array{Int64,1}}}
edges:
[0, 1, 7]
weights: [1.0, 0.5]
closed: left
isdensity: true

julia> # observe isdensity = true and weights tells us the number of observation per binsize in each bin

Histograms can be fitted to data using the fit method.

StatsBase.fitMethod
fit(Histogram, data[, weight][, edges]; closed=:left, nbins)

Fit a histogram to data.

Arguments

• data: either a vector (for a 1-dimensional histogram), or a tuple of vectors of equal length (for an n-dimensional histogram).

• weight: an optional AbstractWeights (of the same length as the data vectors), denoting the weight each observation contributes to the bin. If no weight vector is supplied, each observation has weight 1.

• edges: a vector (typically an AbstractRange object), or tuple of vectors, that gives the edges of the bins along each dimension. If no edges are provided, these are determined from the data.

Keyword arguments

• closed: if :left (the default), the bin intervals are left-closed [a,b); if :right, intervals are right-closed (a,b].

• nbins: if no edges argument is supplied, the approximate number of bins to use along each dimension (can be either a single integer, or a tuple of integers).

Examples

# Univariate
h = fit(Histogram, rand(100))
h = fit(Histogram, rand(100), 0:0.1:1.0)
h = fit(Histogram, rand(100), nbins=10)
h = fit(Histogram, rand(100), weights(rand(100)), 0:0.1:1.0)
h = fit(Histogram, [20], 0:20:100)
h = fit(Histogram, [20], 0:20:100, closed=:right)

# Multivariate
h = fit(Histogram, (rand(100),rand(100)))
h = fit(Histogram, (rand(100),rand(100)),nbins=10)

Base.merge!Function
merge!(target::Histogram, others::Histogram...)

Update histogram target by merging it with the histograms others. See merge(histogram::Histogram, others::Histogram...) for details.

Base.mergeFunction
merge(h::Histogram, others::Histogram...)

Construct a new histogram by merging h with others. All histograms must have the same binning, shape of weights and properties (closed and isdensity). The weights of all histograms are summed up for each bin, the weights of the resulting histogram will have the same type as those of h.

LinearAlgebra.normFunction
norm(h::Histogram)

Calculate the norm of histogram h as the absolute value of its integral.

LinearAlgebra.normalizeFunction
normalize(h::Histogram{T,N}; mode::Symbol=:pdf) where {T,N}

Normalize the histogram h.

Valid values for mode are:

• :pdf: Normalize by sum of weights and bin sizes. Resulting histogram has norm 1 and represents a PDF.
• :density: Normalize by bin sizes only. Resulting histogram represents count density of input and does not have norm 1. Will not modify the histogram if it already represents a density (h.isdensity == 1).
• :probability: Normalize by sum of weights only. Resulting histogram represents the fraction of probability mass for each bin and does not have norm 1.
• :none: Leaves histogram unchanged. Useful to simplify code that has to conditionally apply different modes of normalization.

Successive application of both :probability and :density normalization (in any order) is equivalent to :pdf normalization.

normalize(h::Histogram{T,N}, aux_weights::Array{T,N}...; mode::Symbol=:pdf) where {T,N}

Normalize the histogram h and rescales one or more auxiliary weight arrays at the same time (aux_weights may, e.g., contain estimated statistical uncertainties). The values of the auxiliary arrays are scaled by the same factor as the corresponding histogram weight values. Returns a tuple of the normalized histogram and scaled auxiliary weights.

LinearAlgebra.normalize!Function
normalize!(h::Histogram{T,N}, aux_weights::Array{T,N}...; mode::Symbol=:pdf) where {T<:AbstractFloat,N}

Normalize the histogram h and optionally scale one or more auxiliary weight arrays appropriately. See description of normalize for details. Returns h.

Base.zeroFunction
zero(h::Histogram)

Create a new histogram with the same binning, type and shape of weights and the same properties (closed and isdensity) as h, with all weights set to zero.

## Empirical Cumulative Distribution Function

StatsBase.ecdfFunction
ecdf(X; weights::AbstractWeights)

Return an empirical cumulative distribution function (ECDF) based on a vector of samples given in X. Optionally providing weights returns a weighted ECDF.

Note: this function that returns a callable composite type, which can then be applied to evaluate CDF values on other samples.

extrema, minimum, and maximum are supported to for obtaining the range over which function is inside the interval $(0,1)$; the function is defined for the whole real line.