Extractors overview
Below, we first describe extractors of values (i.e. leaves of JSON tree), then proceed to description of extractors of Array
and Dict
, and finish with some specials.
Extractors of scalar values are arguably the most important, but also fortunately the most understood ones. They control, how values are converted to a Vector
(or generally tensor) for the neural networks. For example they control, if number should be represented as a number, or as one-hot encoded categorical variable. Similarly, they control how String
should be treated, although we admit to natively support only n-grams.
Because mapping from JSON (or different hierarchical structure) to Mill
structures can be non-trivial, extractors have keyword argument store_input
, which, if true
, causes input data to be stored as metadata of respective Mill
structure. By default, it's false, because it can cause type-instability in case of irregular input data and thus suffer from performance loss. The store_input
argument is propagated to leaves and is used to store primarily leaf values.
Recall
Numbers
struct ExtractScalar{T} <: AbstractExtractor
c::T
s::T
end
Extracts a numerical value, centered by subtracting c
and scaled by multiplying by s
. Strings are converted to numbers. The extractor returns ArrayNode{Matrix{T}}
with a single row if uniontypes
if false
, and ArrayNode{Matrix{Union{Missing, T}}}
with a single row if uniontypes
if true
.
e = ExtractScalar(Float32, 0.5, 4.0)
e("1").data
1×1 Array{Float32,2}: 2.0
missing
value is extracted as a missing value, as it is automatically handled downstream by Mill
.
e(missing)
1×1 Mill.ArrayNode{Array{Missing,2},Nothing}: missing
the e("1")
is equivalent to e("1", store_input=false)
. To see input data in metadata of ArrayNode
, we can run
e("1", store_input=true).metadata
1×1 Array{String,2}: "1"
data remain unchanged
e("1", store_input=true).data
1×1 Array{Float32,2}: 2.0
by default, metadata contains nothing
e("1").metadata
Strings
struct ExtractString{T} <: AbstractExtractor
n::Int
b::Int
m::Int
end
Represents String
as n-
grams (NGramMatrix
from Mill.jl
) with base b
and modulo m
.
e = ExtractString()
e("Hello")
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}: "Hello"
missing
value is extracted as a missing value, as it is automatically handled downstream by Mill
.
e(missing)
2053×1 Mill.ArrayNode{Mill.NGramMatrix{Missing,Missing},Nothing}: missing
Storing input works in the same manner as for ExtractScalar
, see
e("Hello", store_input=true).metadata
1-element Array{String,1}: "Hello"
it works the same also with missing values
e(missing, store_input=true).metadata
1-element Array{Missing,1}: missing
Categorical
struct ExtractCategorical{V,I} <: AbstractExtractor
keyvalemap::Dict{V,I}
n::Int
end
Converts a single item to a one-hot encoded vector. For a safety, there is always an extra item reserved for an unknown value.
e = ExtractCategorical(["A","B","C"])
e(["A","B","C","D"]).data
4×4 Mill.MaybeHotMatrix{Int64,Int64,Bool}: 1 0 0 0 0 1 0 0 0 0 1 0 0 0 0 1
missing
value is extracted as a missing value, as it is automatically handled downstream by Mill
.
e(missing)
4×1 Mill.ArrayNode{Mill.MaybeHotMatrix{Missing,Int64,Missing},Nothing}: missing missing missing missing
Storing input in this case looks as follows
e(["A","B","C","D"], store_input=true).metadata
1-element Array{Array{String,1},1}: ["A", "B", "C", "D"]
Array (Lists / Sets)
struct ExtractArray{T}
item::T
end
Convert array of values to a Mill.BagNode
with items converted by item
. The entire array is assumed to be a single bag.
sc = ExtractArray(ExtractCategorical(["A","B","C"]))
sc(["A","B","C","D"])
BagNode with 1 obs └── ArrayNode(4×4 MaybeHotMatrix with Bool elements) with 4 obs
Empty arrays are represented as an empty bag.
sc([]).bags
Mill.AlignedBags{Int64}(UnitRange{Int64}[0:-1])
The data of empty bag can be either missing
or a empty sample, which is more convenient as it makes all samples of the same type, which is nicer to AD. This behavior is controlled by Mill.emptyismissing
. The extractor of a BagNode
can signal to child extractors to extract a sample with zero observations using a special singleton JsonGrinder.extractempty
. For example
Mill.emptyismissing!(true)
sc([]).data
missing
Mill.emptyismissing!(false)
sc([]).data
4×0 Mill.ArrayNode{Mill.MaybeHotMatrix{Int64,Int64,Bool},Nothing}
Storing input is delegated to leaf extractors, so metadata of bag itself are empty
sc(["A","B","C","D"], store_input=true).metadata
but metadata of underlying ArrayNode
contain inputs.
sc(["A","B","C","D"], store_input=true).data.metadata
4-element Array{String,1}: "A" "B" "C" "D"
In case of empty arrays, input is stored in metadata of BagNode
itself, because there might not be any underlying ArrayNode
.
sc([], store_input=true).metadata
1-element Array{Array{Any,1},1}: []
Dict
struct ExtractDict{S} <: AbstractExtractor
dict::S
end
Extracts all items in dict
and return them as a ProductNode. Key in dict corresponds to keys in JSON.
ex = ExtractDict(Dict(:a => ExtractScalar(),
:b => ExtractString(),
:c => ExtractCategorical(["A","B"]),
:d => ExtractArray(ExtractString())))
ex(Dict(:a => "1",
:b => "Hello",
:c => "A",
:d => ["Hello", "world"]))
ProductNode with 1 obs ├── a: ArrayNode(1×1 Array with Float32 elements) with 1 obs ├── b: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs ⋮ └── d: BagNode with 1 obs └── ArrayNode(2053×2 NGramMatrix with Int64 elements) with 2 obs
Missing keys are replaced by missing
and handled by child extractors.
ex(Dict(:a => "1",
:c => "A"))
ProductNode with 1 obs ├── a: ArrayNode(1×1 Array with Float32 elements) with 1 obs ├── b: ArrayNode(2053×1 NGramMatrix with Missing elements) with 1 obs ⋮ └── d: BagNode with 1 obs └── ArrayNode(2053×0 NGramMatrix with Int64 elements) with 0 obs
Storing input data works in similar manner as for ExtractArray
, input data are delegated to leaf extractors.
julia> ex(Dict(:a => "1",
:c => "A"), store_input=true).metadata
julia> ex(Dict(:a => "1",
:c => "A"), store_input=true)[:a].metadata
1×1 Array{String,2}:
"1"
julia> ex(Dict(:a => "1",
:c => "A"), store_input=true)[:b].metadata
1-element Array{Nothing,1}:
nothing
julia> ex(Dict(:a => "1",
:c => "A"), store_input=true)[:c].metadata
1-element Array{String,1}:
"A"
julia> ex(Dict(:a => "1",
:c => "A"), store_input=true)[:d].metadata
1-element Array{Nothing,1}:
nothing
or
julia> ex(Dict(:a => "1",
:b => "Hello",
:c => "A",
:d => ["Hello", "world"]), store_input=true).metadata
julia> ex(Dict(:a => "1",
:b => "Hello",
:c => "A",
:d => ["Hello", "world"]), store_input=true)[:a].metadata
1×1 Array{String,2}:
"1"
julia> ex(Dict(:a => "1",
:b => "Hello",
:c => "A",
:d => ["Hello", "world"]), store_input=true)[:b].metadata
1-element Array{String,1}:
"Hello"
julia> ex(Dict(:a => "1",
:b => "Hello",
:c => "A",
:d => ["Hello", "world"]), store_input=true)[:c].metadata
1-element Array{String,1}:
"A"
julia> ex(Dict(:a => "1",
:b => "Hello",
:c => "A",
:d => ["Hello", "world"]), store_input=true)[:d].metadata
julia> ex(Dict(:a => "1",
:b => "Hello",
:c => "A",
:d => ["Hello", "world"]), store_input=true)[:d].data.metadata
2-element Array{String,1}:
"Hello"
"world"
Specials
ExtractKeyAsField
Some JSONs we have encountered use Dict
s to hold an array of named lists (or other types). Having computer security background a prototypical example is storing a list of DLLs with a corresponding list of imported function in a single structure. For example a JSON
{ "foo.dll" : ["print","write", "open","close"],
"bar.dll" : ["send", "recv"]
}
should be better written as
[{"key": "foo.dll",
"item": ["print","write", "open","close"]},
{"key": "bar.dll",
"item": ["send", "recv"]}
]
JsonGrinder tries to detect these cases, as they are typically manifested by Dicts
with excessively large number of keys in a schema. The detection logic of this case in suggestextractor(e::DictEntry)
is simple, if the number of unique keys in a specific Dict
is greater than settings.key_as_field = 500
, such Dict
is considered to hold values in keys and ExtractKeyAsField
is used instead of ExtractDict
. key_as_field
can be set to any value based on specific data or domain, but we have found 500
to be reasonable default.
The extractor itself is simple as well. For the case above, it would look like
s = JSON.parse("{ \"foo.dll\" : [\"print\",\"write\", \"open\",\"close\"],
\"bar.dll\" : [\"send\", \"recv\"]
}")
ex = ExtractKeyAsField(ExtractString(),ExtractArray(ExtractString()))
ex(s)
BagNode with 1 obs └── ProductNode with 2 obs ├── item: BagNode with 2 obs │ ⋮ └─── key: ArrayNode(2053×2 NGramMatrix with Int64 elements) with 2 obs
As you might expect, inputs are stored in leaf metadata if needed
julia> ex(s, store_input=true).metadata
1-element Array{Dict{String,Any},1}:
Dict("bar.dll" => Any["send", "recv"],"foo.dll" => Any["print", "write", "open", "close"])
julia> ex(s, store_input=true).data[:key].metadata
2-element Array{String,1}:
"bar.dll"
"foo.dll"
julia> ex(s, store_input=true).data[:item].data.metadata
6-element Array{String,1}:
"send"
"recv"
"print"
"write"
"open"
"close"
Because it returns BagNode
, missing values are treated in similar manner as in ExtractArray
and settings of Mill.emptyismissing
applies here too.
Mill.emptyismissing!(true)
ex(Dict()).data
missing
Mill.emptyismissing!(false)
ex(Dict()).data
ProductNode with 0 obs ├── item: BagNode with 0 obs │ └── ArrayNode(2053×0 NGramMatrix with Int64 elements) with 0 obs └─── key: ArrayNode(2053×0 NGramMatrix with Int64 elements) with 0 obs
MultipleRepresentation
Provides a way to have multiple representations for a single value or subtree in JSON. For example imagine that are extracting strings with some very frequently occurring values and a lots of clutter, which might be important and you do not know about it. MultipleRepresentation(extractors::Tuple)
contains a Tuple
or NamedTuple
of extractors and apply them to a single sub-tree in a json. The corresponding Mill
structure will contain ProductNode
of both representation.
For example String
with Categorical and NGram representation will look like
ex = MultipleRepresentation((c = ExtractCategorical(["Hello","world"]), s = ExtractString()))
reduce(catobs,ex.(["Hello","world","from","Prague"]))
ProductNode with 4 obs ├── c: ArrayNode(3×4 MaybeHotMatrix with Bool elements) with 4 obs └── s: ArrayNode(2053×4 NGramMatrix with Int64 elements) with 4 obs
Because it produces ProductNode
, missing values are delegated to leaf extractors.
ex(missing)
ProductNode with 1 obs ├── c: ArrayNode(3×1 MaybeHotMatrix with Missing elements) with 1 obs └── s: ArrayNode(2053×1 NGramMatrix with Missing elements) with 1 obs
MultipleRepresentation
together with handling of missing
values enables JsonGrinder
to deal with JSONs with non-stable schema.
Minimalistic example of such non-stable schema can be json which sometimes has string and sometimes has array of numbers under same key. Let's create appropriate MultipleRepresentation
(although in real-world usage most suitable MultipleRepresentation
is proposed based on observed data in suggestextractor
):
julia> ex = MultipleRepresentation((ExtractString(), ExtractArray(ExtractScalar(Float32))));
julia> e_hello = ex("Hello")
ProductNode with 1 obs
├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
└── e2: BagNode with 1 obs
└── ArrayNode(1×0 Array with Float32 elements) with 0 obs
julia> e_hello[:e1].data
2053×1 Mill.NGramMatrix{String,Int64}:
"Hello"
julia> e_hello[:e2].data
1×0 Mill.ArrayNode{Array{Float32,2},Nothing}
julia> e_123 = ex([1,2,3])
ProductNode with 1 obs
├── e1: ArrayNode(2053×1 NGramMatrix with Missing elements) with 1 obs
└── e2: BagNode with 1 obs
└── ArrayNode(1×3 Array with Float32 elements) with 3 obs
julia> e_123[:e1].data
2053×1 Mill.NGramMatrix{Missing,Missing}:
missing
julia> e_123[:e2].data
1×3 Mill.ArrayNode{Array{Float32,2},Nothing}:
1.0 2.0 3.0
julia> e_2 = ex([2])
ProductNode with 1 obs
├── e1: ArrayNode(2053×1 NGramMatrix with Missing elements) with 1 obs
└── e2: BagNode with 1 obs
└── ArrayNode(1×1 Array with Float32 elements) with 1 obs
julia> e_2[:e1].data
2053×1 Mill.NGramMatrix{Missing,Missing}:
missing
julia> e_2[:e2].data
1×1 Mill.ArrayNode{Array{Float32,2},Nothing}:
2.0
julia> e_world = ex("world")
ProductNode with 1 obs
├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
└── e2: BagNode with 1 obs
└── ArrayNode(1×0 Array with Float32 elements) with 0 obs
julia> e_world[:e1].data
2053×1 Mill.NGramMatrix{String,Int64}:
"world"
julia> e_world[:e2].data
1×0 Mill.ArrayNode{Array{Float32,2},Nothing}
in this example we can see that every time one representation is always missing, and the other one contains data.
ExtractEmpty
As mentioned in earlier, ExtractEmpty
is a type used to extract observation with 0 samples. There is singleton extractempty
which can be used to obtain instance of instance of ExtractEmpty
type. StatsBase.nobs(ex(JsonGrinder.extractempty)) == 0
is required to hold for every extractor in order to work correctly.
All above-mentioned extractors are able to extract this, as we can see here
julia> ExtractString()(JsonGrinder.extractempty)
2053×0 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}
julia> ExtractString()(JsonGrinder.extractempty) |> nobs
0
julia> ExtractCategorical(["A","B"])(JsonGrinder.extractempty)
3×0 Mill.ArrayNode{Mill.MaybeHotMatrix{Int64,Int64,Bool},Nothing}
julia> ExtractCategorical(["A","B"])(JsonGrinder.extractempty) |> nobs
0
julia> ExtractScalar()(JsonGrinder.extractempty)
1×0 Mill.ArrayNode{Array{Float32,2},Nothing}
julia> ExtractScalar()(JsonGrinder.extractempty) |> nobs
0
julia> ExtractArray(ExtractString())(JsonGrinder.extractempty)
BagNode with 0 obs
└── ArrayNode(2053×0 NGramMatrix with Int64 elements) with 0 obs
julia> ExtractArray(ExtractString())(JsonGrinder.extractempty) |> nobs
0
julia> ExtractDict(Dict(:a => ExtractScalar(),
:b => ExtractString(),
:c => ExtractCategorical(["A","B"]),
:d => ExtractArray(ExtractString())))(JsonGrinder.extractempty)
ProductNode with 0 obs
├── a: ArrayNode(1×0 Array with Float32 elements) with 0 obs
├── b: ArrayNode(2053×0 NGramMatrix with Int64 elements) with 0 obs
⋮
└── d: BagNode with 0 obs
└── ArrayNode(2053×0 NGramMatrix with Int64 elements) with 0 obs
julia> ExtractDict(Dict(:a => ExtractScalar(),
:b => ExtractString(),
:c => ExtractCategorical(["A","B"]),
:d => ExtractArray(ExtractString())))(JsonGrinder.extractempty) |> nobs
0