JsonGrinder.jl

JsonGrinder is a collection of routines that facilitates conversion of JSON documents into structures used by Mill.jl project.

Motivation

Imagine that you want to train a classifier on data looking like this

{
  "services": [
    {
      "protocol": "tcp",
      "port": 80
    },
    {
      "protocol": "tcp",
      "port": 443
    },
  ],
  "ip": "192.168.1.109",
  "device_id": "2717684b-3937-4644-a33a-33f4226c43ec",
  "upnp": [
    {
      "device_type": "urn:schemas-upnp-org:device:MediaServer:1",
      "services": [
        "urn:upnp-org:serviceId:ContentDirectory",
        "urn:upnp-org:serviceId:ConnectionManager"
      ],
      "manufacturer": "ARRIS",
      "model_name": "Verizon Media Server",
      "model_description": "Media Server"
    }
  ],
  "device_class": "MEDIA_BOX",
  "ssdp": [
    {
      "st": "",
      "location": "http://192.168.1.109:9098/device_description.xml",
      "method": "",
      "nt": "upnp:rootdevice",
      "server": "ARRIS DIAL/1.7.2 UPnP/1.0 ARRIS Settop Box",
      "user_agent": ""
    },
    {
      "st": "",
      "location": "http://192.168.1.109:8091/XD/21e13e66-1dd2-11b2-9b87-44e137a2ec6a",
      "method": "",
      "nt": "upnp:rootdevice",
      "server": "Allegro-Software-RomPager/5.41 UPnP/1.0 ARRIS Settop Box",
      "user_agent": ""
    },
   ],
  "mac": "44:e1:37:a2:ec:c1"
}

and the task is to predict the value in key device_class (in this sample it's MEDIA_BOX) from the rest of the JSON.

With most machine learning libraries assuming your data being stored as tensors of a fixed dimension, or a sequence, you will have a bad time. Contrary, JsonGrider.jl assumes your data to be stored in a flexible JSON format and tries to automate most labor using reasonable default, but it still gives you an option to control and tweak almost everything. JsonGrinder.jl is built on top of Mill.jl which itself is built on top of Flux.jl (we do not reinvent the wheel). Although JsonGrinder was designed for JSON files, you can easily adapt it to XML, Protocol Buffers, MessagePack, and other similar structures

There are four steps to create a classifier once you load the data.

Create a schema of JSON files (using sch = JsonGrinder.schema).
Create an extractor converting JSONs to Mill structures (extractor = suggestextractor(sch))). Schema sch from previous step is very helpful, as it helps to identify, how to convert nodes (Dict, Array) to (Mill.ProductNode and Mill.BagNode) and how to convert values in leaves to (Float32, Vector{Float32}, String, Categorical).
Create a model for your JSONs, which can be easily done by (using model = reflectinmodel(sch, extractor,...))
Use your favourite methods to train the model, it is 100% compatible with Flux.jl tooling.

The first two steps are handled by JsonGrinder.jl the third step by Mill.jl and the fourth by a combination of Mill.jl and Flux.jl.

Authors see the biggest advantage in the model being hierarchical and reflecting the JSON structure. Thanks to Mill.jl, it can handle missing values at all levels.

Example

Our idealized workflow is demonstrated in examples/identification.jl solving device identification challenge looks as follows (for many datasets which fits in memory it suggest just to change the key with labels (:device_class) and names of files):

using Flux, MLDataPattern, Mill, JsonGrinder, JSON, IterTools, Statistics, ThreadTools, StatsBase
using JsonGrinder: suggestextractor
using Mill: reflectinmodel

samples = map(readlines("/Users/tomas.pevny/Work/Presentations/JuliaMeetup/dataset/train.json")) do s
           JSON.parse(s)
       end;

labelkey = "device_class"
minibatchsize = 100
iterations = 10_000
neurons = 20 		# neurons per layer

targets = map(i -> i[labelkey], samples)
foreach(i -> delete!(i, labelkey), samples)
foreach(i -> delete!(i, "device_id"), samples)

#####
#  Create the schema and extractor
#####
sch = schema(samples)
extractor = suggestextractor(sch)

#####
#  Convert samples to Mill structure and extract targets
#####
data = tmap(extractor, samples)
labelnames = unique(targets)

#####
#  Create the model
#####
model = reflectinmodel(sch, extractor,
	k -> Dense(k, neurons, relu),
	d -> SegmentedMeanMax(d),
	fsm = Dict("" => k -> Dense(k, length(labelnames))),
)

#####
#  Train the model
#####
function minibatch()
	idx = sample(1:length(data), minibatchsize, replace = false)
	reduce(catobs, data[idx]), Flux.onehotbatch(targets[idx], labelnames)
end

accuracy(x,y) = map(xy -> labelnames[argmax(model(xy[1]).data[:])] == xy[2], x, y) |> mean

cb = () -> println("accuracy = ", accuracy(data, targets))
ps = Flux.params(model)
loss = (x,y) -> Flux.logitcrossentropy(model(x).data, y)
Flux.Optimise.train!(loss, ps, repeatedly(minibatch, iterations), ADAM(), cb = Flux.throttle(cb, 2))

#####
#  Classify test data
#####
test_samples = map(JSON.parse, readlines("data/dataset/test.json"))
test_data = tmap(extractor, test_samples)
o = Flux.onecold(model(reduce(catobs, test_data)).data)
predicted_classes = labelnames[o]

A walkthrough of the example

Include libraries and load the data.

using Flux, MLDataPattern, Mill, JsonGrinder, JSON, IterTools, Statistics, BenchmarkTools, ThreadTools, StatsBase
using JsonGrinder: suggestextractor
using Mill: reflectinmodel

samples = map(readlines("train.json")) do s
	JSON.parse(s)
end;

labelkey = "device_class"
minibatchsize = 100
iterations = 5_000
neurons = 20 		# neurons per layer

Create labels and remove them from data, such that we do not use them as features. We also remove device_id key, such that we do not predict it

targets = map(i -> i[labelkey], samples)
foreach(i -> delete!(i, labelkey), samples)
foreach(i -> delete!(i, "device_id"), samples)

Create the schema of data

sch = JsonGrinder.schema(samples)

Create the extractor converting jsons to Mill structure. The suggestextractor is executed below with default setting, but it allows you heavy customization.

extractor = suggestextractor(sch)

Convert jsons to mill data samples.

data = tmap(extractor, samples)
labelnames = unique(targets)

Create the model according to the data

model = reflectinmodel(sch, extractor,
	k -> Dense(k, neurons, relu),
	d -> SegmentedMeanMax(d),
	fsm = Dict("" => k -> Dense(k, length(labelnames))),
)

individual arguments of reflectinmodel are explained in Mill.jl documentation

Lastly, we define few handy functions and then we start training.

function minibatch()
	idx = sample(1:length(data), minibatchsize, replace = false)
	reduce(catobs, data[idx]), Flux.onehotbatch(targets[idx], labelnames)
end

accuracy(x,y) = map(xy -> labelnames[argmax(model(xy[1]).data[:])] == xy[2], x, y) |> mean

cb = () -> println("accuracy = ", accuracy(data, targets))
ps = Flux.params(model)
loss = (x,y) -> Flux.logitcrossentropy(model(x).data, y)
Flux.Optimise.train!(loss, ps, repeatedly(minibatch, iterations), ADAM(), cb = Flux.throttle(cb, 2))

We should see something like

accuracy = 0.1104894138776638
accuracy = 0.45656754049666703
accuracy = 0.8238869892584534
accuracy = 0.893102614582254
accuracy = 0.9316651124235831
accuracy = 0.9554795703381342
accuracy = 0.9693468725175284
accuracy = 0.975166649397299
accuracy = 0.9758056159983421
accuracy = 0.978465098608089
accuracy = 0.9825752080958795
accuracy = 0.9840949124443062
accuracy = 0.9837495250923911
accuracy = 0.9853037681760094
accuracy = 0.9850965357648603
accuracy = 0.9861499671882016
accuracy = 0.9881359444617138
accuracy = 0.9886540254895866
accuracy = 0.9903118847787794
accuracy = 0.9901219217352261
accuracy = 0.9905363865575243
accuracy = 0.9911408144233759
accuracy = 0.9913135080993334
accuracy = 0.9903809622491624

accuracy rising and obtaining over 98% on training set quite quickly.

Last part is inference on test data

test_samples = map(JSON.parse, readlines("data/dataset/test.json"))
test_data = tmap(extractor, test_samples)
o = Flux.onecold(model(reduce(catobs, test_data)).data)
predicted_classes = labelnames[o]

predicted_classes contains the predictions for our test set.

We can look at individual samples. For instance, test_samples[2] is

{
    "mac":"64:b5:c6:66:2b:ab",
    "ip":"192.168.1.46",
    "dhcp":[
        {
            "paramlist":"1,3,6,15,28,33",
            "classid":""
        }
    ],
    "device_id":"addb3142-6b4a-4aef-9d00-ce7ab250c05c"
}

and the corresponding classification is

julia> predicted_classes[2]
"GAME_CONSOLE"

if you want to see the probability distribution, it can be obtained by applying softmax to the output of the network.

julia> softmax(model(test_data[2]).data)
13×1 Array{Float32,2}:
2.2447991f-6
0.0006994973
7.356086f-5
0.9131056
0.00015438742
2.277255f-6
1.2209773f-5
0.07608723
0.0024369168
0.0012505687
0.006140974
3.3941535f-5
3.9533225f-7

so we can see that the probability that given sample is GAME_CONSOLE is ~91% (in 4th element of array).

This concludes a simple classifier for JSON data.

But keep in mind the framework is general and given its ability to embed hierarchical data into fixed-size vectors, it can be used for classification, regression, and various other ML tasks.