Data in leaves

In Mill.jl tree-like data representations, there are always some raw data on the leaf level, whereas on higher levels instances are grouped into bags (BagNodes), and different sets are joined together with Cartesion products (ProductNodes) and thus more abstract concepts are created. In this section we look into several examples how the lowest-level data can be represented.

For this purpose, let's assume that we would like to identify infected computers in a network from their HTTP traffic. Since one computer can make an arbitrary number of connections during the observation period, modelling the computer as a bag of connections seems like the most natural approach:

julia> connections = AlignedBags([1:2, 3:3, 4:7, 8:8, 9:10])
AlignedBags{Int64}(UnitRange{Int64}[1:2, 3:3, 4:7, 8:8, 9:10])

Thus, each of the ten connections becomes an instance in one of the bags. How to represent this instance? Each HTTP flow has properties that can be expressed as standard numerical features, categorical features or strings of characters.

Numerical features

We have already shown how to represent standard numerical features in previous parts of this manual. It is as simple as wrapping a type that behaves like a matrix into an ArrayNode:

julia> content_lengths = [4446, 1957, 4310, 11604, 17019, 13947, 13784, 15495, 3888, 11853]
10-element Array{Int64,1}:
  4446
  1957
  4310
 11604
 17019
 13947
 13784
 15495
  3888
 11853

julia> dates = [1435420950, 1376190532, 1316869962, 1302775198, 1555598383,
                1562237892, 1473173059, 1325242539, 1508048391, 1532722821]
10-element Array{Int64,1}:
 1435420950
 1376190532
 1316869962
 1302775198
 1555598383
 1562237892
 1473173059
 1325242539
 1508048391
 1532722821

julia> numerical_node = ArrayNode([content_lengths'; dates'])
2×10 ArrayNode{Array{Int64,2},Nothing}:
       4446        1957        4310  …       15495        3888       11853
 1435420950  1376190532  1316869962     1325242539  1508048391  1532722821

We use Content-Length and Date request headers, the latter converted to Unix timestamp.

Categorical features

For categorical variables, we proceed in the same way, but we use one-hot encoding implemented in Flux.jl. This way, we can encode for example a verb of the request:

julia> using Flux

julia> ALL_VERBS = ["GET", "HEAD", "POST", "PUT", "DELETE"] # etc...
5-element Array{String,1}:
 "GET"
 "HEAD"
 "POST"
 "PUT"
 "DELETE"

julia> verbs = ["GET", "GET", "POST", "HEAD", "HEAD", "HEAD", "HEAD", "PUT", "DELETE", "PUT"]
10-element Array{String,1}:
 "GET"
 "GET"
 "POST"
 "HEAD"
 "HEAD"
 "HEAD"
 "HEAD"
 "PUT"
 "DELETE"
 "PUT"

julia> verb_node = ArrayNode(Flux.onehotbatch(verbs, ALL_VERBS))
5×10 ArrayNode{Flux.OneHotMatrix{Array{Flux.OneHotVector,1}},Nothing}:
 1  1  0  0  0  0  0  0  0  0
 0  0  0  1  1  1  1  0  0  0
 0  0  1  0  0  0  0  0  0  0
 0  0  0  0  0  0  0  1  0  1
 0  0  0  0  0  0  0  0  1  0

or Content-Encoding header:

julia> ALL_ENCODINGS = ["bzip2", "gzip", "xz", "identity"] # etc...
4-element Array{String,1}:
 "bzip2"
 "gzip"
 "xz"
 "identity"

julia> encodings = ["xz", "gzip", "bzip2", "xz", "identity", "bzip2", "identity", "identity", "xz", "xz"]
10-element Array{String,1}:
 "xz"
 "gzip"
 "bzip2"
 "xz"
 "identity"
 "bzip2"
 "identity"
 "identity"
 "xz"
 "xz"

julia> encoding_node = ArrayNode(Flux.onehotbatch(encodings, ALL_ENCODINGS))
4×10 ArrayNode{Flux.OneHotMatrix{Array{Flux.OneHotVector,1}},Nothing}:
 0  0  1  0  0  1  0  0  0  0
 0  1  0  0  0  0  0  0  0  0
 1  0  0  1  0  0  0  0  1  1
 0  0  0  0  1  0  1  1  0  0

Because Flux.OneHotMatrix supports multiplication it is possible to wrap it into an ArrayNode.

Strings

The last example we will consider are string features. This could for example be the Host header:

julia> hosts = [
           "www.foo.com",
           "www.foo.com",
           "www.baz.com",
           "www.foo.com",
           "www.baz.com",
           "www.foo.com",
           "www.baz.com",
           "www.baz.com",
           "www.bar.com",
           "www.baz.com",
       ]
10-element Array{String,1}:
 "www.foo.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"
 "www.baz.com"
 "www.baz.com"
 "www.bar.com"
 "www.baz.com"

Mill.jl offers ngram histogram-based representation for strings. To get started, we pass the vector of strings into the constructor of NGramMatrix:

julia> hosts_ngrams = NGramMatrix(hosts, 3, 256, 7)
7×10 NGramMatrix{String,Array{String,1},Int64}:
 "www.foo.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"
 "www.baz.com"
 "www.baz.com"
 "www.bar.com"
 "www.baz.com"

Each string gets processed into ngrams (trigram in this case as specified in the first parameter). Then, each character is transformed into an integer via the codeunits function and the whole trigram is interpreted as a three digit number using a base b specified in the second parameter. Here, we use a base of 256, which is the most reasonable choice for ascii URLs. For example, for foo trigram, we obtain:

julia> c = codeunits("foo")
3-element Base.CodeUnits{UInt8,String}:
 0x66
 0x6f
 0x6f

julia> c[1] * 256^2 + c[2] * 256 + c[3]
6713199

The last step is taking the modulo of this result with respect to some prime modulo m, in this case 7 (last parameter in the constructor), leaving us with 3 as a result. Therefore, for this trigram foo, we would add 1 to the third row[1]. We can convert this NGramMatrix into a sparse array and then to the standard array:

using SparseArrays
julia> hosts_dense = hosts_ngrams |> SparseMatrixCSC |> Matrix
7×10 Array{Int64,2}:
 1  1  3  1  3  1  3  3  3  3
 1  1  0  1  0  1  0  0  0  0
 3  3  4  3  4  3  4  4  3  4
 2  2  1  2  1  2  1  1  2  1
 3  3  3  3  3  3  3  3  3  3
 2  2  1  2  1  2  1  1  2  1
 1  1  1  1  1  1  1  1  0  1

Again, we get one column for each string, and the matrix has the same number of rows as modulo m. For each string s, we get length(s) + n - 1ngrams:

julia> sum(hosts_dense; dims=1)
1×10 Array{Int64,2}:
 13  13  13  13  13  13  13  13  13  13

This is because we use special abstract characters (or tokens) for the start and the end of the string. If we denote these ^ and $, respectively, from string "foo", we get trigrams ^^f, ^fo, foo, oo$, o$$. Note that these special characters are purely abstract whereas ^ and $ used only for illustration purposes here are characters like any other. Both string start and string end special characters have a unique mapping to integers, which can be obtained as well as set:

julia> Mill.string_start_code()
0x02

julia> Mill.string_end_code()
0x03

julia> Mill.string_start_code!(42)
42

julia> Mill.string_start_code()
0x2a

NGramMatrix behaves like a matrix, implements an efficient left-multiplication and thus can be used in ArrayNode:

julia> hosts_ngrams::AbstractMatrix{Int64}
7×10 NGramMatrix{String,Array{String,1},Int64}:
 "www.foo.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"
 "www.baz.com"
 "www.baz.com"
 "www.bar.com"
 "www.baz.com"

julia> host_node = ArrayNode(hosts_ngrams)
7×10 ArrayNode{NGramMatrix{String,Array{String,1},Int64},Nothing}:
 "www.foo.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"
 "www.baz.com"
 "www.foo.com"
 "www.baz.com"
 "www.baz.com"
 "www.bar.com"
 "www.baz.com"

Adding custom nodes section shows one more slightly more complex way of processing strings, specifically Unix paths.

Putting it all together

Now, we can finally put wrap everything into one ProductNode:

julia> ds = ProductNode((
           numerical=numerical_node,
           verb=verb_node,
           encoding=encoding_node,
           hosts=host_node
       ))
ProductNode with 10 obs
  ├── encoding: ArrayNode(4×10 OneHotMatrix with Bool elements) with 10 obs
  ├───── hosts: ArrayNode(7×10 NGramMatrix with Int64 elements) with 10 obs
  ⋮
  └────── verb: ArrayNode(5×10 OneHotMatrix with Bool elements) with 10 obs

create a model for training and compute some gradients:

julia> m = reflectinmodel(ds)
ProductModel … ↦ ArrayModel(Dense(40, 10))
  ├── encoding: ArrayModel(Dense(4, 10))
  ├───── hosts: ArrayModel(Dense(7, 10))
  ⋮
  └────── verb: ArrayModel(Dense(5, 10))

julia> gradient(() -> sum(Mill.data(m(ds))), params(m))
Grads(...)
Numerical features

To put all numerical features into one ArrayNode is a design choice. We could as well introduce more keys in the final ProductNode. The model treats these two cases differently (see Nodes section).

This dummy example illustrates the versatility of Mill.jl. With little to no preprocessing we are able to process complex hierarchical structures and avoid manually designing feature extraction procedures. For a more involved study on processing Internet traffic with Mill.jl, see for example Tomáš Pevný , Marek Dědič (2020).

  • 1One appropriate value for modulo m for real problems is 2053