JsonGrinder.jl

Imagine that you want to train a classifier on data looking like

{
  "services": [
    {
      "protocol": "tcp",
      "port": 80
    },
    {
      "protocol": "tcp",
      "port": 443
    },
  ],
  "ip": "192.168.1.109",
  "device_id": "2717684b-3937-4644-a33a-33f4226c43ec",
  "upnp": [
    {
      "device_type": "urn:schemas-upnp-org:device:MediaServer:1",
      "services": [
        "urn:upnp-org:serviceId:ContentDirectory",
        "urn:upnp-org:serviceId:ConnectionManager"
      ],
      "manufacturer": "ARRIS",
      "model_name": "Verizon Media Server",
      "model_description": "Media Server"
    }
  ],
  "device_class": "MEDIA_BOX",
  "ssdp": [
    {
      "st": "",
      "location": "http://192.168.1.109:9098/device_description.xml",
      "method": "",
      "nt": "upnp:rootdevice",
      "server": "ARRIS DIAL/1.7.2 UPnP/1.0 ARRIS Settop Box",
      "user_agent": ""
    },
    {
      "st": "",
      "location": "http://192.168.1.109:8091/XD/21e13e66-1dd2-11b2-9b87-44e137a2ec6a",
      "method": "",
      "nt": "upnp:rootdevice",
      "server": "Allegro-Software-RomPager/5.41 UPnP/1.0 ARRIS Settop Box",
      "user_agent": ""
    },
   ],
  "mac": "44:e1:37:a2:ec:c1"
}

With most machine learning libraries assuming your data being stored as tensors of a fixed dimension, or a sequence, you will have a bad time. Contrary, JsonGrider.jl assumes your data to be stored in a flexible JSON format and tries to automatize most labor using reasonable default, but it still gives you an option to control and tweak almost everything. JsonGrinder.jl is built on top of Mill.jl which itself is built on top of Flux.jl (we do not reinvent the wheel). Although JsonGrinder was designed for JSON files, you can easily adapt it to XML, ProtoBuffers, MessagePacks,...

There are four steps to create a classifier once you load the data.

Create a schema of JSON files (using sch = JsonGrinder.schema).
Create an extractor converting JSONs to Mill structures (extractor = suggestextractor(sch))). Schema sch from previous step is very helpful, as it helps to identify, how to convert nodes (Dict, Array) to (Mill.ProductNode and Mill.BagNode) and how to convert values in leafs to (Float32, Vector{Float32}, String, Categorical).
Create a model for your JSONs, which can be easily done by (using model = reflectinmodel(sch, extractor,...))
Use your favourite methods to train the model, it is 100% compatible with Flux.jl tooling.

The first two steps are handled by JsonGrinder.jl the third step by Mill.jl and the fourth by a combination of Mill.jl and Flux.jl.

Authors see the biggest advantage in the model being hierarchical and reflecting the JSON structure. Thanks to Mill.jl, it can handle missing values at all levels.

Example

Our idealized workflow is demonstrated in examples/identification.jl solving device identification challenge looks as follows (for many datasets which fits in memory it suggest just to change the key with labels (:device_class) and names of files):

using Flux, MLDataPattern, Mill, JsonGrinder, JSON, IterTools, Statistics, BenchmarkTools, ThreadTools, StatsBase
using JsonGrinder: suggestextractor
using Mill: reflectinmodel

samples = map(readlines("/Users/tomas.pevny/Work/Presentations/JuliaMeetup/dataset/train.json")) do s
           JSON.parse(s)
       end;

labelkey = "device_class"
minibatchsize = 100
iterations = 10_000
neurons = 20 		# neurons per layer

targets = map(i -> i[labelkey], samples)
foreach(i -> delete!(i, labelkey), samples)
foreach(i -> delete!(i, "id"), samples)

#####
#  Create the schema and extractor
#####
sch = JsonGrinder.schema(samples)
extractor = suggestextractor(sch)

#####
#  Convert samples to Mill structure and extract targets
#####
data = tmap(extractor, samples)
labelnames = unique(targets)

#####
#  Create the model
#####
model = reflectinmodel(sch, extractor,
	k -> Dense(k, neurons, relu),
	d -> SegmentedMeanMax(d),
	b = Dict("" => k -> Dense(k, length(labelnames))),
)

#####
#  Train the model
#####
function minibatch()
	idx = sample(1:length(data), minibatchsize, replace = false)
	reduce(catobs, data[idx]), Flux.onehotbatch(targets[idx], labelnames)
end

accuracy(x,y) = mean(map(xy -> labelnames[argmax(model(xy[1]).data[:])] == xy[2], zip(x, y)))

cb = () -> println("accuracy = ", accuracy(data, targets))
ps = Flux.params(model)
loss = (x,y) -> Flux.logitcrossentropy(model(x).data, y)
Flux.Optimise.train!(loss, ps, repeatedly(minibatch, iterations), ADAM(), cb = Flux.throttle(cb, 2))

#####
#  Classify test data
#####
test_samples = map(readlines("test.json")) do s
	extractor(JSON.parse(s))
end
o = Flux.onecold(model(reduce(catobs, test_samples)).data);
ns = extract_target[:device_class].keyvalemap
ns = Dict([ v => k for (k,v) in ns]...)
o = [ns[i] for i in o]

A walkthrough of the example

Include libraries and load the data.

using Flux, MLDataPattern, Mill, JsonGrinder, JSON, IterTools, Statistics, BenchmarkTools, ThreadTools, StatsBase
using JsonGrinder: suggestextractor
using Mill: reflectinmodel

samples = map(readlines("train.json")) do s
	JSON.parse(s)
end;

labelkey = "device_class"
minibatchsize = 100
iterations = 10_000
neurons = 20 		# neurons per layer

Create labels and remove them from data, such that we do not use them as features. We also remove id key, such that we do not predict it

targets = map(i -> i[labelkey], samples)
foreach(i -> delete!(i, labelkey), samples)
foreach(i -> delete!(i, "id"), samples)

Create the schema of data

sch = JsonGrinder.schema(samples)

Create the extractor converting jsons to Mill structure. The suggestextractor is executed below with default setting, but it allows you heavy customizing.

extractor = suggestextractor(sch)

Convert jsons to mill data samples.

data = tmap(extractor, samples)
labelnames = unique(targets)

Create the model according to the data

model = reflectinmodel(sch, extractor,
	k -> Dense(k, neurons, relu),
	d -> SegmentedMeanMax(d),
	b = Dict("" => k -> Dense(k, length(labelnames))),
)

After definiting few usual function, we start training.

function minibatch()
	idx = sample(1:length(data), minibatchsize, replace = false)
	reduce(catobs, data[idx]), Flux.onehotbatch(targets[idx], labelnames)
end

accuracy(x,y) = mean(map(xy -> labelnames[argmax(model(xy[1]).data[:])] == xy[2], zip(x, y)))

cb = () -> println("accuracy = ", accuracy(data, targets))
ps = Flux.params(model)
loss = (x,y) -> Flux.logitcrossentropy(model(x).data, y)
Flux.Optimise.train!(loss, ps, repeatedly(minibatch, iterations), ADAM(), cb = Flux.throttle(cb, 2))