Data in leaves
In Mill.jl
tree-like data representations, there are always some raw data on the leaf level, whereas on higher levels instances are grouped into bags (BagNode
s), and different sets are joined together with Cartesion products (ProductNode
s) and thus more abstract concepts are created. In this section we look into several examples how the lowest-level data can be represented.
For this purpose, let's assume that we would like to identify infected computers in a network from their HTTP traffic. Since one computer can make an arbitrary number of connections during the observation period, modelling the computer as a bag of connections seems like the most natural approach:
julia> connections = AlignedBags([1:2, 3:3, 4:7, 8:8, 9:10])
AlignedBags{Int64}(UnitRange{Int64}[1:2, 3:3, 4:7, 8:8, 9:10])
Thus, each of the ten connections becomes an instance in one of the bags. How to represent this instance? Each HTTP flow has properties that can be expressed as standard numerical features, categorical features or strings of characters.
Numerical features
We have already shown how to represent standard numerical features in previous parts of this manual. It is as simple as wrapping a type that behaves like a matrix into an ArrayNode
:
julia> content_lengths = [4446, 1957, 4310, 11604, 17019, 13947, 13784, 15495, 3888, 11853]
10-element Array{Int64,1}:
4446
1957
4310
11604
17019
13947
13784
15495
3888
11853
julia> dates = [1435420950, 1376190532, 1316869962, 1302775198, 1555598383,
1562237892, 1473173059, 1325242539, 1508048391, 1532722821]
10-element Array{Int64,1}:
1435420950
1376190532
1316869962
1302775198
1555598383
1562237892
1473173059
1325242539
1508048391
1532722821
julia> numerical_node = ArrayNode([content_lengths'; dates'])
2×10 ArrayNode{Array{Int64,2},Nothing}:
4446 1957 4310 … 15495 3888 11853
1435420950 1376190532 1316869962 1325242539 1508048391 1532722821
We use Content-Length
and Date
request headers, the latter converted to Unix timestamp.
Categorical features
For categorical variables, we proceed in the same way, but we use one-hot encoding implemented in Flux.jl
. This way, we can encode for example a verb of the request:
julia> using Flux
julia> ALL_VERBS = ["GET", "HEAD", "POST", "PUT", "DELETE"] # etc...
5-element Array{String,1}:
"GET"
"HEAD"
"POST"
"PUT"
"DELETE"
julia> verbs = ["GET", "GET", "POST", "HEAD", "HEAD", "HEAD", "HEAD", "PUT", "DELETE", "PUT"]
10-element Array{String,1}:
"GET"
"GET"
"POST"
"HEAD"
"HEAD"
"HEAD"
"HEAD"
"PUT"
"DELETE"
"PUT"
julia> verb_node = ArrayNode(Flux.onehotbatch(verbs, ALL_VERBS))
5×10 ArrayNode{Flux.OneHotMatrix{Array{Flux.OneHotVector,1}},Nothing}:
1 1 0 0 0 0 0 0 0 0
0 0 0 1 1 1 1 0 0 0
0 0 1 0 0 0 0 0 0 0
0 0 0 0 0 0 0 1 0 1
0 0 0 0 0 0 0 0 1 0
or Content-Encoding
header:
julia> ALL_ENCODINGS = ["bzip2", "gzip", "xz", "identity"] # etc...
4-element Array{String,1}:
"bzip2"
"gzip"
"xz"
"identity"
julia> encodings = ["xz", "gzip", "bzip2", "xz", "identity", "bzip2", "identity", "identity", "xz", "xz"]
10-element Array{String,1}:
"xz"
"gzip"
"bzip2"
"xz"
"identity"
"bzip2"
"identity"
"identity"
"xz"
"xz"
julia> encoding_node = ArrayNode(Flux.onehotbatch(encodings, ALL_ENCODINGS))
4×10 ArrayNode{Flux.OneHotMatrix{Array{Flux.OneHotVector,1}},Nothing}:
0 0 1 0 0 1 0 0 0 0
0 1 0 0 0 0 0 0 0 0
1 0 0 1 0 0 0 0 1 1
0 0 0 0 1 0 1 1 0 0
Because Flux.OneHotMatrix
supports multiplication it is possible to wrap it into an ArrayNode
.
Strings
The last example we will consider are string features. This could for example be the Host
header:
julia> hosts = [
"www.foo.com",
"www.foo.com",
"www.baz.com",
"www.foo.com",
"www.baz.com",
"www.foo.com",
"www.baz.com",
"www.baz.com",
"www.bar.com",
"www.baz.com",
]
10-element Array{String,1}:
"www.foo.com"
"www.foo.com"
"www.baz.com"
"www.foo.com"
"www.baz.com"
"www.foo.com"
"www.baz.com"
"www.baz.com"
"www.bar.com"
"www.baz.com"
Mill.jl
offers n
gram histogram-based representation for strings. To get started, we pass the vector of strings into the constructor of NGramMatrix
:
julia> hosts_ngrams = NGramMatrix(hosts, 3, 256, 7)
7×10 NGramMatrix{String,Array{String,1},Int64}:
"www.foo.com"
"www.foo.com"
"www.baz.com"
"www.foo.com"
"www.baz.com"
"www.foo.com"
"www.baz.com"
"www.baz.com"
"www.bar.com"
"www.baz.com"
Each string gets processed into n
grams (trigram in this case as specified in the first parameter). Then, each character is transformed into an integer via the codeunits
function and the whole trigram is interpreted as a three digit number using a base b
specified in the second parameter. Here, we use a base of 256
, which is the most reasonable choice for ascii URLs. For example, for foo
trigram, we obtain:
julia> c = codeunits("foo")
3-element Base.CodeUnits{UInt8,String}:
0x66
0x6f
0x6f
julia> c[1] * 256^2 + c[2] * 256 + c[3]
6713199
The last step is taking the modulo of this result with respect to some prime modulo m
, in this case 7
(last parameter in the constructor), leaving us with 3
as a result. Therefore, for this trigram foo
, we would add 1
to the third row[1]. We can convert this NGramMatrix
into a sparse array and then to the standard array:
using SparseArrays
julia> hosts_dense = hosts_ngrams |> SparseMatrixCSC |> Matrix
7×10 Array{Int64,2}:
1 1 3 1 3 1 3 3 3 3
1 1 0 1 0 1 0 0 0 0
3 3 4 3 4 3 4 4 3 4
2 2 1 2 1 2 1 1 2 1
3 3 3 3 3 3 3 3 3 3
2 2 1 2 1 2 1 1 2 1
1 1 1 1 1 1 1 1 0 1
Again, we get one column for each string, and the matrix has the same number of rows as modulo m
. For each string s
, we get length(s) + n - 1
n
grams:
julia> sum(hosts_dense; dims=1)
1×10 Array{Int64,2}:
13 13 13 13 13 13 13 13 13 13
This is because we use special abstract characters (or tokens) for the start and the end of the string. If we denote these ^
and $
, respectively, from string "foo"
, we get trigrams ^^f
, ^fo
, foo
, oo$
, o$$
. Note that these special characters are purely abstract whereas ^
and $
used only for illustration purposes here are characters like any other. Both string start and string end special characters have a unique mapping to integers, which can be obtained as well as set:
julia> Mill.string_start_code()
0x02
julia> Mill.string_end_code()
0x03
julia> Mill.string_start_code!(42)
42
julia> Mill.string_start_code()
0x2a
NGramMatrix
behaves like a matrix, implements an efficient left-multiplication and thus can be used in ArrayNode
:
julia> hosts_ngrams::AbstractMatrix{Int64}
7×10 NGramMatrix{String,Array{String,1},Int64}:
"www.foo.com"
"www.foo.com"
"www.baz.com"
"www.foo.com"
"www.baz.com"
"www.foo.com"
"www.baz.com"
"www.baz.com"
"www.bar.com"
"www.baz.com"
julia> host_node = ArrayNode(hosts_ngrams)
7×10 ArrayNode{NGramMatrix{String,Array{String,1},Int64},Nothing}:
"www.foo.com"
"www.foo.com"
"www.baz.com"
"www.foo.com"
"www.baz.com"
"www.foo.com"
"www.baz.com"
"www.baz.com"
"www.bar.com"
"www.baz.com"
Adding custom nodes section shows one more slightly more complex way of processing strings, specifically Unix paths.
Putting it all together
Now, we can finally put wrap everything into one ProductNode
:
julia> ds = ProductNode((
numerical=numerical_node,
verb=verb_node,
encoding=encoding_node,
hosts=host_node
))
ProductNode with 10 obs
├── encoding: ArrayNode(4×10 OneHotMatrix with Bool elements) with 10 obs
├───── hosts: ArrayNode(7×10 NGramMatrix with Int64 elements) with 10 obs
⋮
└────── verb: ArrayNode(5×10 OneHotMatrix with Bool elements) with 10 obs
create a model for training and compute some gradients:
julia> m = reflectinmodel(ds)
ProductModel … ↦ ArrayModel(Dense(40, 10))
├── encoding: ArrayModel(Dense(4, 10))
├───── hosts: ArrayModel(Dense(7, 10))
⋮
└────── verb: ArrayModel(Dense(5, 10))
julia> gradient(() -> sum(Mill.data(m(ds))), params(m))
Grads(...)
To put all numerical features into one ArrayNode
is a design choice. We could as well introduce more keys in the final ProductNode
. The model treats these two cases differently (see Nodes section).
This dummy example illustrates the versatility of Mill.jl
. With little to no preprocessing we are able to process complex hierarchical structures and avoid manually designing feature extraction procedures. For a more involved study on processing Internet traffic with Mill.jl
, see for example Tomáš Pevný , Marek Dědič (2020).
- 1One appropriate value for modulo
m
for real problems is2053