Mill – Multiple Instance Learning Library

Mill is a library build on top of Flux.jl aimed to prototype flexible multi-instance learning models as described in [1] and [2]

What is Multiple instance learning (MIL) problem?

In the prototypical machine learning problem the input sample $x$ is a vector or matrix of a fixed dimension, or a sequence. In MIL problems the sample $x$ is a set of vectors (or matrices) $(x_1, x_2, ..., x_n)$ , which means that order does not matter, and which is also the feature making MIL problems different from sequences. Pevny and Somol has proposed simple way to solve MIL problems with neural networks. The network consists from two non-linear layers, with mean (or maximum) operation sandwiched between nonlinearities. Denoting $f_1$ , $f_2$ layers of neural network, the output is calculated as $f(x) = f_2 \left(\frac{1}{n}\sum_{i=1}^{n} f_1(x_i)\right)$ . In [3], authors have further extended the universal approximation theorem to MIL problems.

Multiple instance learning on Musk 1

Musk dataset is a classic problem of the field used in publication [4], which has given the class of problems its name. Below is a little walk-through how to solve the problem using Mill library. The full example is shown in example/musk.jl, which also contains Julia environment to run the whole thing.

Let's start by importing all libraries

julia> using FileIO, JLD2, Statistics, Mill, Flux
julia> using Flux: throttle, @epochs
julia> using Mill: reflectinmodel
julia> using Base.Iterators: repeated

Loading a dataset from file and folding it in Mill's data-structures is done in the following function. musk.jld2 contains matrix with features, fMat, the id of sample (called bag in MIL terminology) to which each instance (column in fMat) belongs to, and finally a label of each instance in y. BagNode is a structure which holds feature matrix and ranges of columns of each bag. Finally, BagNode can be concatenated (use catobs) and you can get subset using getindex.

julia> fMat = load("musk.jld2", "fMat");         # matrix with instances, each column is one sample
julia> bagids = load("musk.jld2", "bagids");     # ties instances to bags
julia> x = BagNode(ArrayNode(fMat), bagids);     # create BagDataset
julia> y = load("musk.jld2", "y");               # load labels
julia> y = map(i -> maximum(y[i]) + 1, x.bags);  # create labels on bags
julia> y_oh = Flux.onehotbatch(y, 1:2);          # one-hot encoding

Once we have data, we can manually create a model. BagModel is designed to implement a basic multi-instance learning model as described above. Below, we use a simple model, where instances are first passed through a single layer with 10 neurons (input dimension is 166) with tanh non-linearity, then we use mean and max aggregation functions simultaneously (for some problems, max is better then mean, therefore we use both), and then we use one layer with 10 neurons and tanh nonlinearity followed by output linear layer with 2 neurons (output dimension).

julia> model = BagModel(
    ArrayModel(Dense(166, 10, Flux.tanh)),                      # model on the level of Flows
    SegmentedMeanMax(10),                                       # aggregation
    ArrayModel(Chain(Dense(20, 10, Flux.tanh), Dense(10, 2))))  # model on the level of bags
    
BagModel ↦ ⟨SegmentedMean(10), SegmentedMax(10)⟩ ↦ ArrayModel(Chain(Dense(20, 10, tanh), Dense(10, 2)))
  └── ArrayModel(Dense(166, 10, tanh))

The loss function is standard cross-entropy:

julia> loss(x, y_oh) = Flux.logitcrossentropy(model(x).data, y_oh);

Finally, we put everything together. The below code should resemble an example from Flux.jl library.

julia> evalcb = () -> @show(loss(x, y_oh));
julia> opt = Flux.ADAM();
julia> @epochs 10 Flux.train!(loss, params(model), repeated((x, y_oh), 1000), opt, cb=throttle(evalcb, 1))

[ Info: Epoch 1
loss(x, y_oh) = 87.793724f0
[ Info: Epoch 2
loss(x, y_oh) = 4.3207192f0
[ Info: Epoch 3
loss(x, y_oh) = 4.2778687f0
[ Info: Epoch 4
loss(x, y_oh) = 0.662226f0
[ Info: Epoch 5
loss(x, y_oh) = 5.76351f-6
[ Info: Epoch 6
loss(x, y_oh) = 3.8146973f-6
[ Info: Epoch 7
loss(x, y_oh) = 2.8195589f-6
[ Info: Epoch 8
loss(x, y_oh) = 2.4878461f-6
[ Info: Epoch 9
loss(x, y_oh) = 2.1561332f-6
[ Info: Epoch 10
loss(x, y_oh) = 1.7414923f-6

Because we did not leave any data for validation, we can only calculate error on the training data, which should be not so surprisingly low.

mean(mapslices(argmax, model(x).data, dims=1)' .!= y)

0.0

More complicated models

The main advantage of the Mill library is that it allows to arbitrarily nest and cross-product BagModels, as is described in Theorem 5 of [3]. Let's start the demonstration by nesting two MIL problems. The outer MIL model contains three samples. The first sample contains another bag (inner MIL) problem with two instances, the second sample contains two inner bags with total of three instances, and finally the third sample contains two inner bags with four instances.

julia> ds = BagNode(BagNode(ArrayNode(randn(4,10)),[1:2,3:4,5:5,6:7,8:10]),[1:1,2:3,4:5])
BagNode with 3 bag(s)
  └── BagNode with 5 bag(s)
        └── ArrayNode(4, 10)

We can create the model manually as in the case of Musk as

julia> m = BagModel(
    BagModel(
        ArrayModel(Dense(4, 3, Flux.relu)),   
        SegmentedMeanMax(3),
        ArrayModel(Dense(6, 3, Flux.relu))),
    SegmentedMeanMax(3),
    ArrayModel(Chain(Dense(6, 3, Flux.relu), Dense(3,2))))

BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Chain(Dense(6, 3, relu), Dense(3, 2)))
  └── BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Dense(6, 3, relu))
        └── ArrayModel(Dense(4, 3, relu))

and we can apply the model as

julia> m(ds)

ArrayNode(2, 3)

Since constructions of large models can be a process prone to errors, there is a function reflectinmodel which tries to automatize it keeping track of dimensions. It accepts as a first parameter a sample ds. Using the function on the above example creates a model:

julia> m = reflectinmodel(ds)

BagModel ↦ SegmentedMean(10) ↦ ArrayModel(Dense(10, 10))
  └── BagModel ↦ SegmentedMean(10) ↦ ArrayModel(Dense(10, 10))
        └── ArrayModel(Dense(4, 10))

To have better control over the topology, reflectinmodel accepts up to four additional parameters. The second parameter is a function returning layer (or set of layers) with input dimension d, and the third function is a function returning aggregation functions for BagModel:

julia> m = reflectinmodel(ds, d -> Dense(d, 5, relu), d -> SegmentedMeanMax(d))

BagModel ↦ ⟨SegmentedMean(5), SegmentedMax(5)⟩ ↦ ArrayModel(Dense(10, 5, relu))
  └── BagModel ↦ ⟨SegmentedMean(5), SegmentedMax(5)⟩ ↦ ArrayModel(Dense(10, 5, relu))
        └── ArrayModel(Dense(4, 5, relu))

Let's test the model

julia> m(ds).data

5×3 Array{Float32,2}:
 0.0542484   0.733629  0.553823
 0.062246    0.866254  1.03062 
 0.027454    1.04703   1.63135 
 0.00796955  0.36415   1.18108 
 0.034735    0.17383   0.0

Even more complicated models

As already mentioned above, the datasets can contain Cartesian products of MIL and normal (non-MIL) problems. Let's do a quick demo.

julia> ds = BagNode(
    ProductNode(
        (BagNode(ArrayNode(randn(4,10)),[1:2,3:4,5:5,6:7,8:10]),
        ArrayNode(randn(3,5)),
        BagNode(
            BagNode(ArrayNode(randn(2,30)),[i:i+1 for i in 1:2:30]),
            [1:3,4:6,7:9,10:12,13:15]),
        ArrayNode(randn(2,5)))),
    [1:1,2:3,4:5])

BagNode with 3 bag(s)
  └── ProductNode
        ├── BagNode with 5 bag(s)
        │     ⋮
        ├── ArrayNode(3, 5)
        ├── BagNode with 5 bag(s)
        │     ⋮
        └── ArrayNode(2, 5)

For this, we really want to create model automatically despite it being sub-optimal.

julia> m = reflectinmodel(ds, d -> Dense(d, 3, relu), d -> SegmentedMeanMax(d))

BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Dense(6, 3, relu))
  └── ProductModel ↦ ArrayModel(Dense(12, 3, relu))
        ├── BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Dense(6, 3, relu))
        │     ⋮
        ├── ArrayModel(Dense(3, 3, relu))
        ├── BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Dense(6, 3, relu))
        │     ⋮
        └── ArrayModel(Dense(2, 3, relu))

Hierarchical utils

Mill.jl uses HierarchicalUtils.jl which brings a lot of additional features. For instance, if you want to print a non-truncated version of a model, call:

julia> printtree(m; trunc_level=Inf)

BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Dense(6, 3, relu))
  └── ProductModel ↦ ArrayModel(Dense(12, 3, relu))
        ├── BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Dense(6, 3, relu))
        │     └── ArrayModel(Dense(4, 3, relu))
        ├── ArrayModel(Dense(3, 3, relu))
        ├── BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Dense(6, 3, relu))
        │     └── BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Dense(6, 3, relu))
        │           └── ArrayModel(Dense(2, 3, relu))
        └── ArrayModel(Dense(2, 3, relu))

Callling with trav=true enables convenient traversal functionality with string indexing:

julia>  printtree(m; trunc_level=Inf, trav=true)

BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Dense(6, 3, relu)) [""]
  └── ProductModel ↦ ArrayModel(Dense(12, 3, relu)) ["U"]
        ├── BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Dense(6, 3, relu)) ["Y"]
        │     └── ArrayModel(Dense(4, 3, relu)) ["a"]
        ├── ArrayModel(Dense(3, 3, relu)) ["c"]
        ├── BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Dense(6, 3, relu)) ["g"]
        │     └── BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Dense(6, 3, relu)) ["i"]
        │           └── ArrayModel(Dense(2, 3, relu)) ["j"]
        └── ArrayModel(Dense(2, 3, relu)) ["k"]

This way any node in the model tree is swiftly accessible, which may come in handy when inspecting model parameters or simply deleting/replacing/inserting nodes to tree (for instance when constructing adversarial samples). All tree nodes are accessible by indexing with the traversal code:.

julia> m["Y"]

BagModel ↦ ⟨SegmentedMean(3), SegmentedMax(3)⟩ ↦ ArrayModel(Dense(6, 3, relu))
  └── ArrayModel(Dense(4, 3, relu))

The following two approaches give the same result:

julia> m["Y"] === m.im.ms[1]

true

Other functions provided by HierarchicalUtils.jl:

julia> nnodes(m)

9

julia> nleafs(m)

4

julia> NodeIterator(m) |> collect

9-element Array{AbstractMillModel,1}:
 BagModel
 ProductModel
 BagModel
 ArrayModel
 ArrayModel
 BagModel
 BagModel
 ArrayModel
 ArrayModel

julia> LeafIterator(m) |> collect

4-element Array{ArrayModel{Dense{typeof(relu),Array{Float32,2},Array{Float32,1}}},1}:
 ArrayModel
 ArrayModel
 ArrayModel
 ArrayModel

julia> TypeIterator{BagModel}(m) |> collect

4-element Array{BagModel{T,Aggregation{2},ArrayModel{Dense{typeof(relu),Array{Float32,2},Array{Float32,1}}}} where T<:AbstractMillModel,1}:
 BagModel
 BagModel
 BagModel
 BagModel

... and many others.

Default aggregation values

With the latest version of Mill, it is also possible to work with missing data, replacing a missing bag with a default constant value, and even to learn this value as well. Everything is done automatically.

Representing missing values

The library currently support two ways to represent bags with missing values. First one represent missing data using missing as a = BagNode(missing, [0:-1]) while the second as an empty vector as a = BagNode(zero(4,0), [0:-1]). While off the shelf the library supports both approaches transparently, the difference is mainly when one uses getindex, and therefore there is a switch Mill.emptyismissing(false), which is by default false. Let me demonstrate the difference.

julia> a = BagNode(ArrayNode(rand(3,2)), [1:2, 0:-1])
BagNode with 2 bag(s)
  └── ArrayNode(3, 2)

julia> Mill.emptyismissing(false);

julia> a[2].data
ArrayNode(3, 0)

julia> Mill.emptyismissing(true)
true

julia> a[2].data
missing

The advantage of the first approach, default, is that types are always the same, which is nice to the compiler (and Zygote). The advantage of the latter is that it is more compact and nicer.

References

1Discriminative models for multi-instance problems with tree-structure, Tomáš Pevný, Petr Somol, 2016, https://arxiv.org/abs/1703.02868

2Using Neural Network Formalism to Solve Multiple-Instance Problems, Tomáš Pevný, Petr Somol, 2016, https://arxiv.org/abs/1609.07257.

3Approximation capability of neural networks on sets of probability measures and tree-structured data, Tomáš Pevný, Vojtěch Kovařík, 2019, https://openreview.net/forum?id=HklJV3A9Ym

4Solving the multiple instance problem with axis-parallel rectangles, Dietterich, Thomas G., Richard H. Lathrop, and Tomás Lozano-Pérez, 1997