Sampling from Population

Sampling API

The package provides functions for sampling from a given population (with or without replacement).

StatsBase.sampleFunction
sample([rng], a, [wv::AbstractWeights])

Select a single random element of a. Sampling probabilities are proportional to the weights given in wv, if provided.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

sample([rng], a, [wv::AbstractWeights], n::Integer; replace=true, ordered=false)

Select a random, optionally weighted sample of size n from an array a using a polyalgorithm. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement and order dictates whether an ordered sample, also called a sequential sample, should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

sample([rng], a, [wv::AbstractWeights], dims::Dims; replace=true, ordered=false)

Select a random, optionally weighted sample from an array a specifying the dimensions dims of the output array. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement and order dictates whether an ordered sample, also called a sequential sample, should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

sample([rng], wv::AbstractWeights)

Select a single random integer in 1:length(wv) with probabilities proportional to the weights given in wv.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

StatsBase.sample!Function
sample!([rng], a, [wv::AbstractWeights], x; replace=true, ordered=false)

Draw a random sample of length(x) elements from an array a and store the result in x. A polyalgorithm is used for sampling. Sampling probabilities are proportional to the weights given in wv, if provided. replace dictates whether sampling is performed with replacement and order dictates whether an ordered sample, also called a sequential sample, should be taken.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

Algorithms

Internally, this package implements multiple algorithms, and the sample (and sample!) methods integrate them into a poly-algorithm, which chooses a specific algorithm based on inputs.

Note that the choices made in sample are decided based on extensive benchmarking (see perf/sampling.jl and perf/wsampling.jl). It performs reasonably fast for most cases. That being said, if you know that a certain algorithm is particularly suitable for your context, directly calling an internal algorithm function might be slightly more efficient.

Here are a list of algorithms implemented in the package. The functions below are not exported (one can still import them from StatsBase via using though).

Notations

  • a: source array representing the population
  • x: the destination array
  • wv: the weight vector (of type AbstractWeights), for weighted sampling
  • n: the length of a
  • k: the length of x. For sampling without replacement, k must not exceed n.
  • rng: optional random number generator (defaults to Random.GLOBAL_RNG)

All following functions write results to x (pre-allocated) and return x.

Sampling Algorithms (Non-Weighted)

StatsBase.direct_sample!Method
direct_sample!([rng], a::AbstractArray, x::AbstractArray)

Direct sampling: for each j in 1:k, randomly pick i from 1:n, and set x[j] = a[i], with n=length(a) and k=length(x).

This algorithm consumes k random numbers.

StatsBase.samplepairFunction
samplepair([rng], n)

Draw a pair of distinct integers between 1 and n without replacement.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

samplepair([rng], a)

Draw a pair of distinct elements from the array a without replacement.

Optionally specify a random number generator rng as the first argument (defaults to Random.GLOBAL_RNG).

StatsBase.knuths_sample!Function
knuths_sample!([rng], a, x)

Knuth's Algorithm S for random sampling without replacement.

Reference: D. Knuth. The Art of Computer Programming. Vol 2, 3.4.2, p.142.

This algorithm consumes length(a) random numbers. It requires no additional memory space. Suitable for the case where memory is tight.

StatsBase.fisher_yates_sample!Function
fisher_yates_sample!([rng], a::AbstractArray, x::AbstractArray)

Fisher-Yates shuffling (with early termination).

Pseudo-code:

n = length(a)
k = length(x)

# Create an array of the indices
inds = collect(1:n)

for i = 1:k
    # swap element `i` with another random element in inds[i:n]
    # set element `i` in `x`
end

This algorithm consumes k=length(x) random numbers. It uses an integer array of length n=length(a) internally to maintain the shuffled indices. It is considerably faster than Knuth's algorithm especially when n is greater than k. It is $O(n)$ for initialization, plus $O(k)$ for random shuffling

StatsBase.self_avoid_sample!Function
self_avoid_sample!([rng], a::AbstractArray, x::AbstractArray)

Self-avoid sampling: use a set to maintain the index that has been sampled. Each time draw a new index, if the index has already been sampled, redraw until it draws an unsampled one.

This algorithm consumes about (or slightly more than) k=length(x) random numbers, and requires $O(k)$ memory to store the set of sampled indices. Very fast when $n >> k$, with n=length(a).

However, if k is large and approaches $n$, the rejection rate would increase drastically, resulting in poorer performance.

StatsBase.seqsample_a!Function
seqsample_a!([rng], a::AbstractArray, x::AbstractArray)

Random subsequence sampling using algorithm A described in the following paper (page 714): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(n)$ random numbers, with n=length(a). The outputs are ordered.

StatsBase.seqsample_c!Function
seqsample_c!([rng], a::AbstractArray, x::AbstractArray)

Random subsequence sampling using algorithm C described in the following paper (page 715): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(k^2)$ random numbers, with k=length(x). The outputs are ordered.

StatsBase.seqsample_d!Function
seqsample_d!([rng], a::AbstractArray, x::AbstractArray)

Random subsequence sampling using algorithm D described in the following paper (page 716-17): Jeffrey Scott Vitter. "Faster Methods for Random Sampling". Communications of the ACM, 27 (7), July 1984.

This algorithm consumes $O(k)$ random numbers, with k=length(x). The outputs are ordered.

Weighted Sampling Algorithms

StatsBase.direct_sample!Method
direct_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Direct sampling.

Draw each sample by scanning the weight vector.

Noting k=length(x) and n=length(a), this algorithm:

  • consumes k random numbers
  • has time complexity $O(n k)$, as scanning the weight vector each time takes $O(n)$
  • requires no additional memory space.
StatsBase.alias_sample!Function
alias_sample!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Alias method.

Build an alias table, and sample therefrom.

Reference: Walker, A. J. "An Efficient Method for Generating Discrete Random Variables with General Distributions." ACM Transactions on Mathematical Software 3 (3): 253, 1977.

Noting k=length(x) and n=length(a), this algorithm takes $O(n \log n)$ time for building the alias table, and then $O(1)$ to draw each sample. It consumes $2 k$ random numbers.

StatsBase.naive_wsample_norep!Function
naive_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Naive implementation of weighted sampling without replacement.

It makes a copy of the weight vector at initialization, and sets the weight to zero when the corresponding sample is picked.

Noting k=length(x) and n=length(a), this algorithm consumes $O(k)$ random numbers, and has overall time complexity $O(n k)$.

StatsBase.efraimidis_a_wsample_norep!Function
efraimidis_a_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Weighted sampling without replacement using Efraimidis-Spirakis A algorithm.

Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.

Noting k=length(x) and n=length(a), this algorithm takes $O(n + k \log k)$ processing time to draw $k$ elements. It consumes $n$ random numbers.

StatsBase.efraimidis_ares_wsample_norep!Function
efraimidis_ares_wsample_norep!([rng], a::AbstractArray, wv::AbstractWeights, x::AbstractArray)

Implementation of weighted sampling without replacement using Efraimidis-Spirakis A-Res algorithm.

Reference: Efraimidis, P. S., Spirakis, P. G. "Weighted random sampling with a reservoir." Information Processing Letters, 97 (5), 181-185, 2006. doi:10.1016/j.ipl.2005.11.003.

Noting k=length(x) and n=length(a), this algorithm takes $O(k \log(k) \log(n / k))$ processing time to draw $k$ elements. It consumes $n$ random numbers.