Extractors overview

Below, we first describe extractors of values (i.e. leaves of JSON tree), then proceed to description of extractors of Array and Dict, and finish with some specials.

Extractors of scalar values are arguably the most important, but also fortunately the most understood ones. They control, how values are converted to a Vector (or generally tensor) for the neural networks. For example they control, if number should be represented as a number, or as one-hot encoded categorical variable. Similarly, they control how String should be treated, although we admit to natively support only n-grams.

Because mapping from JSON (or different hierarchical structure) to Mill structures can be non-trivial, extractors have keyword argument store_input, which, if true, causes input data to be stored as metadata of respective Mill structure. By default, it's false, because it can cause type-instability in case of irregular input data and thus suffer from performance loss. The store_input argument is propagated to leaves and is used to store primarily leaf values.

Recall

Numbers

struct ExtractScalar{T} <: AbstractExtractor
	c::T
	s::T
end

Extracts a numerical value, centered by subtracting c and scaled by multiplying by s. Strings are converted to numbers. The extractor returns ArrayNode{Matrix{T}} with a single row if uniontypes if false, and ArrayNode{Matrix{Union{Missing, T}}} with a single row if uniontypes if true.

e = ExtractScalar(Float32, 0.5, 4.0)
e("1").data

1×1 Array{Float32,2}:
 2.0

missing value is extracted as a missing value, as it is automatically handled downstream by Mill.

e(missing)

1×1 Mill.ArrayNode{Array{Missing,2},Nothing}:
 missing

the e("1") is equivalent to e("1", store_input=false). To see input data in metadata of ArrayNode, we can run

e("1", store_input=true).metadata

1×1 Array{String,2}:
 "1"

data remain unchanged

e("1", store_input=true).data

1×1 Array{Float32,2}:
 2.0

by default, metadata contains nothing

e("1").metadata

Strings

struct ExtractString{T} <: AbstractExtractor
	n::Int
	b::Int
	m::Int
end

Represents String as n-grams (NGramMatrix from Mill.jl) with base b and modulo m.

e = ExtractString()
e("Hello")

2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
 "Hello"

missing value is extracted as a missing value, as it is automatically handled downstream by Mill.

e(missing)

2053×1 Mill.ArrayNode{Mill.NGramMatrix{Missing,Missing},Nothing}:
 missing

Storing input works in the same manner as for ExtractScalar, see

e("Hello", store_input=true).metadata

1-element Array{String,1}:
 "Hello"

it works the same also with missing values

e(missing, store_input=true).metadata

1-element Array{Missing,1}:
 missing

Categorical

struct ExtractCategorical{V,I} <: AbstractExtractor
	keyvalemap::Dict{V,I}
	n::Int
end

Converts a single item to a one-hot encoded vector. For a safety, there is always an extra item reserved for an unknown value.

e = ExtractCategorical(["A","B","C"])
e(["A","B","C","D"]).data

4×4 Mill.MaybeHotMatrix{Int64,Int64,Bool}:
 1  0  0  0
 0  1  0  0
 0  0  1  0
 0  0  0  1

missing value is extracted as a missing value, as it is automatically handled downstream by Mill.

e(missing)

4×1 Mill.ArrayNode{Mill.MaybeHotMatrix{Missing,Int64,Missing},Nothing}:
 missing
 missing
 missing
 missing

Storing input in this case looks as follows

e(["A","B","C","D"], store_input=true).metadata

1-element Array{Array{String,1},1}:
 ["A", "B", "C", "D"]

Array (Lists / Sets)

struct ExtractArray{T}
	item::T
end

Convert array of values to a Mill.BagNode with items converted by item. The entire array is assumed to be a single bag.

sc = ExtractArray(ExtractCategorical(["A","B","C"]))
sc(["A","B","C","D"])

BagNode with 1 obs
  └── ArrayNode(4×4 MaybeHotMatrix with Bool elements) with 4 obs

Empty arrays are represented as an empty bag.

sc([]).bags

Mill.AlignedBags{Int64}(UnitRange{Int64}[0:-1])

The data of empty bag can be either missing or a empty sample, which is more convenient as it makes all samples of the same type, which is nicer to AD. This behavior is controlled by Mill.emptyismissing. The extractor of a BagNode can signal to child extractors to extract a sample with zero observations using a special singleton JsonGrinder.extractempty. For example

Mill.emptyismissing!(true)
sc([]).data

missing

Mill.emptyismissing!(false)
sc([]).data

4×0 Mill.ArrayNode{Mill.MaybeHotMatrix{Int64,Int64,Bool},Nothing}

Storing input is delegated to leaf extractors, so metadata of bag itself are empty

sc(["A","B","C","D"], store_input=true).metadata

but metadata of underlying ArrayNode contain inputs.

sc(["A","B","C","D"], store_input=true).data.metadata

4-element Array{String,1}:
 "A"
 "B"
 "C"
 "D"

In case of empty arrays, input is stored in metadata of BagNode itself, because there might not be any underlying ArrayNode.

sc([], store_input=true).metadata

1-element Array{Array{Any,1},1}:
 []

Dict

struct ExtractDict{S} <: AbstractExtractor
	dict::S
end

Extracts all items in dict and return them as a ProductNode. Key in dict corresponds to keys in JSON.

ex = ExtractDict(Dict(:a => ExtractScalar(),
	:b => ExtractString(),
	:c => ExtractCategorical(["A","B"]),
	:d => ExtractArray(ExtractString())))
ex(Dict(:a => "1",
	:b => "Hello",
	:c => "A",
	:d => ["Hello", "world"]))

ProductNode with 1 obs
  ├── a: ArrayNode(1×1 Array with Float32 elements) with 1 obs
  ├── b: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
  ⋮
  └── d: BagNode with 1 obs
           └── ArrayNode(2053×2 NGramMatrix with Int64 elements) with 2 obs

Missing keys are replaced by missing and handled by child extractors.

ex(Dict(:a => "1",
	:c => "A"))

ProductNode with 1 obs
  ├── a: ArrayNode(1×1 Array with Float32 elements) with 1 obs
  ├── b: ArrayNode(2053×1 NGramMatrix with Missing elements) with 1 obs
  ⋮
  └── d: BagNode with 1 obs
           └── ArrayNode(2053×0 NGramMatrix with Int64 elements) with 0 obs

Storing input data works in similar manner as for ExtractArray, input data are delegated to leaf extractors.

julia> ex(Dict(:a => "1",
       	:c => "A"), store_input=true).metadata

julia> ex(Dict(:a => "1",
       	:c => "A"), store_input=true)[:a].metadata
1×1 Array{String,2}:
 "1"

julia> ex(Dict(:a => "1",
       	:c => "A"), store_input=true)[:b].metadata
1-element Array{Nothing,1}:
 nothing

julia> ex(Dict(:a => "1",
       	:c => "A"), store_input=true)[:c].metadata
1-element Array{String,1}:
 "A"

julia> ex(Dict(:a => "1",
       	:c => "A"), store_input=true)[:d].metadata
1-element Array{Nothing,1}:
 nothing

julia> ex(Dict(:a => "1",
       	:b => "Hello",
       	:c => "A",
       	:d => ["Hello", "world"]), store_input=true).metadata

julia> ex(Dict(:a => "1",
       	:b => "Hello",
       	:c => "A",
       	:d => ["Hello", "world"]), store_input=true)[:a].metadata
1×1 Array{String,2}:
 "1"

julia> ex(Dict(:a => "1",
       	:b => "Hello",
       	:c => "A",
       	:d => ["Hello", "world"]), store_input=true)[:b].metadata
1-element Array{String,1}:
 "Hello"

julia> ex(Dict(:a => "1",
       	:b => "Hello",
       	:c => "A",
       	:d => ["Hello", "world"]), store_input=true)[:c].metadata
1-element Array{String,1}:
 "A"

julia> ex(Dict(:a => "1",
       	:b => "Hello",
       	:c => "A",
       	:d => ["Hello", "world"]), store_input=true)[:d].metadata

julia> ex(Dict(:a => "1",
       	:b => "Hello",
       	:c => "A",
       	:d => ["Hello", "world"]), store_input=true)[:d].data.metadata
2-element Array{String,1}:
 "Hello"
 "world"

Specials

ExtractKeyAsField

Some JSONs we have encountered use Dicts to hold an array of named lists (or other types). Having computer security background a prototypical example is storing a list of DLLs with a corresponding list of imported function in a single structure. For example a JSON

{ "foo.dll" : ["print","write", "open","close"],
  "bar.dll" : ["send", "recv"]
}

should be better written as

[{"key": "foo.dll",
  "item": ["print","write", "open","close"]},
  {"key": "bar.dll",
  "item": ["send", "recv"]}
]

JsonGrinder tries to detect these cases, as they are typically manifested by Dicts with excessively large number of keys in a schema. The detection logic of this case in suggestextractor(e::DictEntry) is simple, if the number of unique keys in a specific Dict is greater than settings.key_as_field = 500, such Dict is considered to hold values in keys and ExtractKeyAsField is used instead of ExtractDict. key_as_field can be set to any value based on specific data or domain, but we have found 500 to be reasonable default.

The extractor itself is simple as well. For the case above, it would look like

s = JSON.parse("{ \"foo.dll\" : [\"print\",\"write\", \"open\",\"close\"],
  \"bar.dll\" : [\"send\", \"recv\"]
}")
ex = ExtractKeyAsField(ExtractString(),ExtractArray(ExtractString()))
ex(s)

BagNode with 1 obs
  └── ProductNode with 2 obs
        ├── item: BagNode with 2 obs
        │           ⋮
        └─── key: ArrayNode(2053×2 NGramMatrix with Int64 elements) with 2 obs

As you might expect, inputs are stored in leaf metadata if needed

julia> ex(s, store_input=true).metadata
1-element Array{Dict{String,Any},1}:
 Dict("bar.dll" => Any["send", "recv"],"foo.dll" => Any["print", "write", "open", "close"])

julia> ex(s, store_input=true).data[:key].metadata
2-element Array{String,1}:
 "bar.dll"
 "foo.dll"

julia> ex(s, store_input=true).data[:item].data.metadata
6-element Array{String,1}:
 "send"
 "recv"
 "print"
 "write"
 "open"
 "close"

Because it returns BagNode, missing values are treated in similar manner as in ExtractArray and settings of Mill.emptyismissing applies here too.

Mill.emptyismissing!(true)
ex(Dict()).data

missing

Mill.emptyismissing!(false)
ex(Dict()).data

ProductNode with 0 obs
  ├── item: BagNode with 0 obs
  │           └── ArrayNode(2053×0 NGramMatrix with Int64 elements) with 0 obs
  └─── key: ArrayNode(2053×0 NGramMatrix with Int64 elements) with 0 obs

MultipleRepresentation

Provides a way to have multiple representations for a single value or subtree in JSON. For example imagine that are extracting strings with some very frequently occurring values and a lots of clutter, which might be important and you do not know about it. MultipleRepresentation(extractors::Tuple) contains a Tuple or NamedTuple of extractors and apply them to a single sub-tree in a json. The corresponding Mill structure will contain ProductNode of both representation.

For example String with Categorical and NGram representation will look like

ex = MultipleRepresentation((c = ExtractCategorical(["Hello","world"]), s = ExtractString()))
reduce(catobs,ex.(["Hello","world","from","Prague"]))

ProductNode with 4 obs
  ├── c: ArrayNode(3×4 MaybeHotMatrix with Bool elements) with 4 obs
  └── s: ArrayNode(2053×4 NGramMatrix with Int64 elements) with 4 obs

Because it produces ProductNode, missing values are delegated to leaf extractors.

ex(missing)

ProductNode with 1 obs
  ├── c: ArrayNode(3×1 MaybeHotMatrix with Missing elements) with 1 obs
  └── s: ArrayNode(2053×1 NGramMatrix with Missing elements) with 1 obs

MultipleRepresentation together with handling of missing values enables JsonGrinder to deal with JSONs with non-stable schema.

Minimalistic example of such non-stable schema can be json which sometimes has string and sometimes has array of numbers under same key. Let's create appropriate MultipleRepresentation (although in real-world usage most suitable MultipleRepresentation is proposed based on observed data in suggestextractor):

julia> ex = MultipleRepresentation((ExtractString(), ExtractArray(ExtractScalar(Float32))));

julia> e_hello = ex("Hello")
ProductNode with 1 obs
  ├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
  └── e2: BagNode with 1 obs
            └── ArrayNode(1×0 Array with Float32 elements) with 0 obs

julia> e_hello[:e1].data
2053×1 Mill.NGramMatrix{String,Int64}:
 "Hello"

julia> e_hello[:e2].data
1×0 Mill.ArrayNode{Array{Float32,2},Nothing}

julia> e_123 = ex([1,2,3])
ProductNode with 1 obs
  ├── e1: ArrayNode(2053×1 NGramMatrix with Missing elements) with 1 obs
  └── e2: BagNode with 1 obs
            └── ArrayNode(1×3 Array with Float32 elements) with 3 obs

julia> e_123[:e1].data
2053×1 Mill.NGramMatrix{Missing,Missing}:
 missing

julia> e_123[:e2].data
1×3 Mill.ArrayNode{Array{Float32,2},Nothing}:
 1.0  2.0  3.0

julia> e_2 = ex([2])
ProductNode with 1 obs
  ├── e1: ArrayNode(2053×1 NGramMatrix with Missing elements) with 1 obs
  └── e2: BagNode with 1 obs
            └── ArrayNode(1×1 Array with Float32 elements) with 1 obs

julia> e_2[:e1].data
2053×1 Mill.NGramMatrix{Missing,Missing}:
 missing

julia> e_2[:e2].data
1×1 Mill.ArrayNode{Array{Float32,2},Nothing}:
 2.0

julia> e_world = ex("world")
ProductNode with 1 obs
  ├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
  └── e2: BagNode with 1 obs
            └── ArrayNode(1×0 Array with Float32 elements) with 0 obs

julia> e_world[:e1].data
2053×1 Mill.NGramMatrix{String,Int64}:
 "world"

julia> e_world[:e2].data
1×0 Mill.ArrayNode{Array{Float32,2},Nothing}

in this example we can see that every time one representation is always missing, and the other one contains data.

ExtractEmpty

As mentioned in earlier, ExtractEmpty is a type used to extract observation with 0 samples. There is singleton extractempty which can be used to obtain instance of instance of ExtractEmpty type. StatsBase.nobs(ex(JsonGrinder.extractempty)) == 0 is required to hold for every extractor in order to work correctly.

All above-mentioned extractors are able to extract this, as we can see here

julia> ExtractString()(JsonGrinder.extractempty)
2053×0 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}

julia> ExtractString()(JsonGrinder.extractempty) |> nobs
0

julia> ExtractCategorical(["A","B"])(JsonGrinder.extractempty)
3×0 Mill.ArrayNode{Mill.MaybeHotMatrix{Int64,Int64,Bool},Nothing}

julia> ExtractCategorical(["A","B"])(JsonGrinder.extractempty) |> nobs
0

julia> ExtractScalar()(JsonGrinder.extractempty)
1×0 Mill.ArrayNode{Array{Float32,2},Nothing}

julia> ExtractScalar()(JsonGrinder.extractempty) |> nobs
0

julia> ExtractArray(ExtractString())(JsonGrinder.extractempty)
BagNode with 0 obs
  └── ArrayNode(2053×0 NGramMatrix with Int64 elements) with 0 obs

julia> ExtractArray(ExtractString())(JsonGrinder.extractempty) |> nobs
0

julia> ExtractDict(Dict(:a => ExtractScalar(),
       	:b => ExtractString(),
       	:c => ExtractCategorical(["A","B"]),
       	:d => ExtractArray(ExtractString())))(JsonGrinder.extractempty)
ProductNode with 0 obs
  ├── a: ArrayNode(1×0 Array with Float32 elements) with 0 obs
  ├── b: ArrayNode(2053×0 NGramMatrix with Int64 elements) with 0 obs
  ⋮
  └── d: BagNode with 0 obs
           └── ArrayNode(2053×0 NGramMatrix with Int64 elements) with 0 obs

julia> ExtractDict(Dict(:a => ExtractScalar(),
       	:b => ExtractString(),
       	:c => ExtractCategorical(["A","B"]),
       	:d => ExtractArray(ExtractString())))(JsonGrinder.extractempty) |> nobs
0