Bag aggregation

A wrapper type Aggregation and all subtypes of AggregationOperator it wraps are structures that are responsible for mapping of vector representations of multiple instances into a single vector. They all operate element-wise and independently of dimension and thus the output has the same size as representations on the input, unless the Concatenation of multiple operators is used or Bag count is enabled.

Some setup:

julia> d = 2
2

julia> X = Float32.([1 2 3 4; 8 7 6 5])
2×4 Array{Float32,2}:
 1.0  2.0  3.0  4.0
 8.0  7.0  6.0  5.0

julia> n = ArrayNode(X)
2×4 ArrayNode{Array{Float32,2},Nothing}:
 1.0  2.0  3.0  4.0
 8.0  7.0  6.0  5.0

julia> bags = AlignedBags([1:1, 2:3, 4:4])
AlignedBags{Int64}(UnitRange{Int64}[1:1, 2:3, 4:4])

Different choice of operator, or their combinations, are suitable for different problems. Nevertheless, because the input is interpreted as an unordered bag of instances, every operator is invariant to permutation and also does not scale when increasing size of the bag.

Non-parametric aggregation

Max aggregation

SegmentedMax is the most straightforward operator defined in one dimension as follows:

\[a_{\max}(\{x_1, \ldots, x_k\}) = \max_{i = 1, \ldots, k} x_i\]

where $\{x_1, \ldots, x_k\}$ are all instances of the given bag. In Mill.jl, the operator is constructed this way:

julia> a_max = max_aggregation(d)
Aggregation{Float32}:
 SegmentedMax(ψ = Float32[0.0, 0.0])

Dimension

The dimension of input is required so that the default parameters ψ can be properly instantiated (see Missing data for details).

Operator construction

It is also possible to get the operator by calling the constructor directly:

SegmentedMax(d)

However, it is recommended to use max_aggregation that returns Aggregation structure.

The application is straightforward and can be performed on both raw AbstractArrays or ArrayNodes:

julia> a_max(X, bags)
3×3 Array{Float32,2}:
 1.0       3.0      4.0
 8.0       7.0      5.0
 0.693147  1.09861  0.693147

julia> a_max(n, bags)
3×3 ArrayNode{Array{Float32,2},Nothing}:
 1.0        3.0        4.0
 8.0        7.0        5.0
 0.6931472  1.0986123  0.6931472

Since we have three bags, we have three columns in the output, each storing the maximal element over all instances of the given bag.

Mean aggregation

SegmentedMean is defined as:

\[a_{\operatorname{mean}}(\{x_1, \ldots, x_k\}) = \frac{1}{k} \sum_{i = 1}^{k} x_i\]

and used the same way:

julia> a_mean = mean_aggregation(d)
Aggregation{Float32}:
 SegmentedMean(ψ = Float32[0.0, 0.0])

julia> a_mean(X, bags)
3×3 Array{Float32,2}:
 1.0       2.5      4.0
 8.0       6.5      5.0
 0.693147  1.09861  0.693147

julia> a_mean(n, bags)
3×3 ArrayNode{Array{Float32,2},Nothing}:
 1.0        2.5        4.0
 8.0        6.5        5.0
 0.6931472  1.0986123  0.6931472

Sufficiency of the mean operator

In theory, mean aggregation is sufficient for approximation (Tomáš Pevný , Vojtěch Kovařík (2019)), but in practice, a combination of multiple operators performes better.

The max aggregation is suitable for cases when one instance in the bag may give evidence strong enough to predict the label. On the other side of the spectrum lies the mean aggregation function, which detects well trends identifiable globally over the whole bag.

Sum aggregation

The last non-parametric operator is SegmentedSum, defined as:

\[a_{\operatorname{mean}}(\{x_1, \ldots, x_k\}) = \sum_{i = 1}^{k} x_i\]

and used the same way:

julia> a_sum = sum_aggregation(d)
Aggregation{Float32}:
 SegmentedSum(ψ = Float32[0.0, 0.0])

julia> a_sum(X, bags)
3×3 Array{Float32,2}:
 1.0        5.0      4.0
 8.0       13.0      5.0
 0.693147   1.09861  0.693147

julia> a_sum(n, bags)
3×3 ArrayNode{Array{Float32,2},Nothing}:
 1.0         5.0        4.0
 8.0        13.0        5.0
 0.6931472   1.0986123  0.6931472

Parametric aggregation

Whereas non-parametric aggregations do not use any parameter, parametric aggregations represent an entire class of functions parametrized by one or more real vectors of parameters, which can be even learned during training.

Log-sum-exp (LSE) aggregation

SegmentedLSE (log-sum-exp) aggregation (Oren Z. Kraus , Lei Jimmy Ba , Brendan Frey (2015)) is parametrized by a vector of positive numbers $\bm{r} \in (\mathbb{R}^+)^d$ m that specifies one real parameter for computation in each output dimension:

\[a_{\operatorname{lse}}(\{x_1, \ldots, x_k\}; r) = \frac{1}{r}\log \left(\frac{1}{k} \sum_{i = 1}^{k} \exp({r\cdot x_i})\right)\]

With different values of $r$, LSE behaves differently and in fact both max and mean operators are limiting cases of LSE. If $r$ is very small, the output approaches simple mean, and on the other hand, if $r$ is a large number, LSE becomes a smooth approximation of the max function. Naively implementing the definition above may lead to numerical instabilities, however, the Mill.jl implementation is numerically stable.

julia> a_lse = lse_aggregation(d)
Aggregation{Float32}:
 SegmentedLSE(ψ = Float32[0.0, 0.0], ρ = Float32[1.87284, -0.136322])

julia> a_lse(X, bags)
3×3 Array{Float32,2}:
 1.0       2.71818  4.0
 8.0       6.57716  5.0
 0.693147  1.09861  0.693147

$p$-norm aggregation

(Normalized) $p$-norm operator (Caglar Gulcehre , Kyunghyun Cho , Razvan Pascanu , Yoshua Bengio (2014)) is parametrized by a vector of real numbers $\bm{p} \in (\mathbb{R}^+)^d$, where $\forall i \in \{1, \ldots ,m \} \colon p_i \geq 1$, and another vector $\bm{c} \in (\mathbb{R}^+)^d$. It is computed with formula:

\[a_{\operatorname{pnorm}}(\{x_1, \ldots, x_k\}; p, c) = \left(\frac{1}{k} \sum_{i = 1}^{k} \vert x_i - c \vert ^ {p} \right)^{\frac{1}{p}}\]

Again, the Mill.jl implementation is stable.

julia> a_pnorm = pnorm_aggregation(d)
Aggregation{Float32}:
 SegmentedPNorm(ψ = Float32[0.0, 0.0], ρ = Float32[-0.521878, -0.0447819], c = Float32[0.0019943, 0.925841])

julia> a_pnorm(X, bags)
3×3 Array{Float32,2}:
 0.998006  2.52133  3.99801
 7.07416   5.5892   4.07416
 0.693147  1.09861  0.693147

Because all parameter constraints are included implicitly (field ρ in both types is a real number that undergoes appropriate transformation before being used), both parametric operators are easy to use and do not require any special treatment. Replacing the definition of aggregation operators while constructing a model (either manually or with reflectinmodel) is enough.

Concatenation

To use a concatenation of two or more operators, one can use the Aggregation constructor:

julia> a = Aggregation(a_mean, a_max)
Aggregation{Float32}:
 SegmentedMean(ψ = Float32[0.0, 0.0])
 SegmentedMax(ψ = Float32[0.0, 0.0])

julia> a(X, bags)
5×3 Array{Float32,2}:
 1.0       2.5      4.0
 8.0       6.5      5.0
 1.0       3.0      4.0
 8.0       7.0      5.0
 0.693147  1.09861  0.693147

For the most common combinations, Mill.jl provides some convenience definitions:

julia> meanmax_aggregation(d)
Aggregation{Float32}:
 SegmentedMean(ψ = Float32[0.0, 0.0])
 SegmentedMax(ψ = Float32[0.0, 0.0])

julia> pnormlse_aggregation(d)
Aggregation{Float32}:
 SegmentedPNorm(ψ = Float32[0.0, 0.0], ρ = Float32[0.600357, 0.386686], c = Float32[-0.771236, -1.79629])
 SegmentedLSE(ψ = Float32[0.0, 0.0], ρ = Float32[-1.13419, -0.188264])

Weighted aggregation

Sometimes, different instances in the bag are not equally important and contribute to output to a different extent. For instance, this may come in handy when performing importance sampling over very large bags. SegmentedMean and SegmentedPNorm have definitions taking weights into account:

\[a_{\operatorname{mean}}(\{(x_i, w_i)\}_{i=1}^k) = \frac{1}{\sum_{i=1}^k w_i} \sum_{i = 1}^{k} w_i \cdot x_i\]

\[a_{\operatorname{pnorm}}(\{x_i, w_i\}_{i=1}^k; p, c) = \left(\frac{1}{\sum_{i=1}^k w_i} \sum_{i = 1}^{k} w_i\cdot\vert x_i - c \vert ^ {p} \right)^{\frac{1}{p}}\]

This is done in Mill.jl by passing an additional parameter:

julia> w = Float32.([1.0, 0.2, 0.8, 0.5])
4-element Array{Float32,1}:
 1.0
 0.2
 0.8
 0.5

julia> a_mean(X, bags, w)
3×3 Array{Float32,2}:
 1.0       2.8      4.0
 8.0       6.2      5.0
 0.693147  1.09861  0.693147

julia> a_pnorm(X, bags, w)
3×3 Array{Float32,2}:
 0.998006  2.81189  3.99801
 7.07416   5.28421  4.07416
 0.693147  1.09861  0.693147

For SegmentedMax (and SegmentedLSE) it is possible to pass in weights, but they are ignored during computation:

julia> a_max(X, bags, w) == a_max(X, bags)
true

Weighted nodes

WeightedBagNode is used to store instance weights into a dataset. It accepts weights in the constructor:

julia> wbn = WeightedBagNode(n, bags, w)
WeightedBagNode with 3 obs
  └── ArrayNode(2×4 Array with Float32 elements) with 4 obs

and passes them to aggregation operators:

julia> m = reflectinmodel(wbn)
BagModel … ↦ ⟨SegmentedMean(10), SegmentedMax(10)⟩ ↦ ArrayModel(Dense(21, 10))
  └── ArrayModel(Dense(2, 10))

julia> m(wbn)
10×3 ArrayNode{Array{Float32,2},Nothing}:
 -0.08577247  -1.06538     -2.3497334
  0.18075764   0.13136475   0.50428295
 -3.3338616   -1.4941525    0.52783906
  1.354449     3.4286468    4.715983
 -1.1131008   -1.5880636   -2.3200634
  1.4746492    2.2399156    2.5541785
 -0.48083752   0.32533184   1.2635081
  2.3174403    2.4020107    1.9191833
  2.6593637    1.5681298    0.62422305
  2.1784558    1.6751916    0.8876517

Otherwise, WeightedBagNode behaves exactly like the standard BagNode.

Bag count

For some problems, it may be beneficial to use the size of the bag directly and feed it to subsequent layers. This is controlled by Mill.bagcount! function (on by default).

In the aggregation phase, bag count appends one more element which stores the bag size to the output after all operators are applied. Furthermore, in Mill.jl, we opted to perform a mapping $x \mapsto \log(x) + 1$ on top of that:

julia> a_mean(X, bags)
3×3 Array{Float32,2}:
 1.0       2.5      4.0
 8.0       6.5      5.0
 0.693147  1.09861  0.693147

The matrix now has three rows, the last one storing the size of the bag.

When the bag count is on, one needs to have a model accepting corresponding sizes:

julia> bn = BagNode(n, bags)
BagNode with 3 obs
  └── ArrayNode(2×4 Array with Float32 elements) with 4 obs

julia> bm = reflectinmodel(bn)
BagModel … ↦ ⟨SegmentedMean(10), SegmentedMax(10)⟩ ↦ ArrayModel(Dense(21, 10))
  └── ArrayModel(Dense(2, 10))

Note that the bm (sub)model field of the BagNode has size of (11, 10), 10 for aggregation output and 1 for sizes of bags.

julia> bm(bn)
10×3 ArrayNode{Array{Float32,2},Nothing}:
  1.5167183    1.1164814    1.2121471
 -0.45304257  -1.7157838   -3.3462124
  3.6544676    2.4747255    1.7322227
 -0.91553223  -1.8787678   -2.3789873
  3.5021553    2.8474615    1.7630692
 -2.9335008   -2.8740404   -2.4392025
 -0.45712265   0.665173     1.6547372
 -0.50642437  -0.86437464  -0.6096624
 -1.4723998   -1.062075    -0.47388336
 -0.2643367   -0.51483434   0.19282724

Model reflection takes bag count toggle into account. If we disable it again, bm (sub)model has size (10, 10):

julia> Mill.bagcount!(false)
false

julia> bm = reflectinmodel(bn)
BagModel … ↦ ⟨SegmentedMean(10), SegmentedMax(10)⟩ ↦ ArrayModel(Dense(20, 10))
  └── ArrayModel(Dense(2, 10))

Aggregation bagcount

Only Aggregation supports bagcount. Lower level plain AggregationOperators (SegmentedMean, SegmentedMax and others) are intended for inner use and thus do not support it.

Default aggregation values

When all aggregation operators are printed, one may notice that all of them store one additional vector ψ. This is a vector of default parameters, initialized to all zeros, that are used for empty bags:

julia> bags = AlignedBags([1:1, 0:-1, 2:3, 0:-1, 4:4])
AlignedBags{Int64}(UnitRange{Int64}[1:1, 0:-1, 2:3, 0:-1, 4:4])

julia> a_mean(X, bags)
3×5 Array{Float32,2}:
 1.0       0.0  2.5      0.0  4.0
 8.0       0.0  6.5      0.0  5.0
 0.693147  0.0  1.09861  0.0  0.693147

See Missing data page for more information.