API Documentation
Base.delete!
— MethodDeletes field
at the specified path
from the schema sch
. For instance, the following: delete!(schema, ".field.subfield.[]", "x")
deletes the field x
from schema
at: schema.childs[:field].childs[:subfield].items.childs
Base.merge
— MethodDispatch of Base.merge on JsonGrinder.JSONEntry structures. Allows to merge multiple schemas to single one.
merge(es::Entry...)
merge(es::DictEntry...)
merge(es::ArrayEntry...)
merge(es::MultiEntry...)
merge(es::JsonGrinder.JSONEntry...)
it can be used to distribute calculation of schema across multiple workers to merge their partial results into bigger one.
Example
If we want to calculate schema from e.g. array of jsons in a distributed manner, if we have jsons
array and , we can do it using
using ThreadsX
ThreadsX.mapreduce(schema, merge, Iterators.partition(jsons, length(jsons) ÷ Threads.nthreads()))
or
using ThreadTools
merge(tmap(schema, Threads.nthreads(), Iterators.partition(jsons, length(jsons) ÷ Threads.nthreads()))
or, if you like to split it into multiple jobs and having them processed by multiple threads, it can look like
using ThreadTools
merge(tmap(schema, Threads.nthreads(), Iterators.partition(jsons, 1_000))
where we split array to smaller array of size 1k and let all available threads create partial schemas.
If your data is too large to fit into ram, following approach works well also with filenames and similar other ways to process large data.
JsonGrinder.default_scalar_extractor
— MethodDefault scalar extractor, it contains reasonable defaults, but sometimes it does not fullfill all needs. We advice to use it as base which you will modify for your needs or as a default that you will append to your other rules.
JsonGrinder.extract_missing_bag
— Methodreturns missing bag of 1 observation
JsonGrinder.extractscalar
— Methodextractscalar(Type{String}, n = 3, b = 256, m = 2053)
represents strings as ngrams with
n
(the degree of ngram),b
base of string,m
modulo on index of the token to reduce dimension
extractscalar(Type{Number}, m = 0, s = 1)
extracts number subtracting m
and multiplying by s
Example
julia> JsonGrinder.extractscalar(String, 3, 256, 2053)("5")
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
"5"
julia> JsonGrinder.extractscalar(Int32, 3, 256)("5")
1×1 Mill.ArrayNode{Array{Int32,2},Nothing}:
512
JsonGrinder.generate_html
— Methodgenerate_html(sch::DictEntry; max_vals=100, max_len=1_000)
generate_html(file_name, sch::DictEntry; max_vals=100, max_len=1_000)
exports schema to HTML including CSS style and JS allowing to expand / hide sub-parts of schema, countmaps, and lengthmaps.
Arguments
max_vals
controls maximum number of exported values in countmapmax_len
controls maximum number of exported lengts of arraysfile_name
a name of file to save HTML with schema
Return
If provided filename, it does not return anything. If not, it returns the generated HTML+CSS+JS as a String.
Example
You can either open the html file in any browser, or open it directly using ElectronDisplay
using ElectronDisplay
using ElectronDisplay: newdisplay
generated_html = generate_html(sch, max_vals = 100)
display(newdisplay(), MIME{Symbol("text/html")}(), generated_html)
JsonGrinder.make_empty_bag
— Methodreturns empty bag of 0 observations
JsonGrinder.newentry
— Methodnewentry(v)
creates new entry describing json according to the type of v
JsonGrinder.prune_json
— Methodprune_json(json, schema)
Removes keys from json
which are not part of the schema
.
Example
julia> using JSON
julia> j1 = JSON.parse("{\"a\": 4, \"b\": {\"a\":1, \"b\": 1}}");
julia> j2 = JSON.parse("{\"a\": 4, \"b\": {\"a\":1}}");
julia> sch = JsonGrinder.schema([j1,j2])
[Dict] (updated = 2)
├── a: [Scalar - Int64], 1 unique values, updated = 2
└── b: [Dict] (updated = 2)
├── a: [Scalar - Int64], 1 unique values, updated = 2
└── b: [Scalar - Int64], 1 unique values, updated = 1
julia> j3 = Dict("a" => 4, "b" => Dict("a"=>1), "c" => 1, "d" => 2)
Dict{String,Any} with 4 entries:
"c" => 1
"b" => Dict("a"=>1)
"a" => 4
"d" => 2
julia> JsonGrinder.prune_json(j3, sch)
Dict{Any,Any} with 2 entries:
"b" => Dict{Any,Any}("a"=>1)
"a" => 4
so the JsonGrinder.prune_json
removes keys c
and d
.
JsonGrinder.schema
— Methodschema(samples::AbstractArray{<:Dict})
schema(samples::AbstractArray{<:AbstractString})
schema(samples::AbstractArray, map_fun::Function)
schema(map_fun::Function, samples::AbstractArray)
creates schema from an array of parsed or unparsed JSONs.
JsonGrinder.suggestextractor
— Functionsuggestextractor(e::DictEntry, settings = NamedTuple())
create convertor of json to tree-structure of DataNode
e
top-level of json hierarchy, typically returned by invoking schemasettings
can be any container supportingget
functionsettings.mincountkey
contains minimum repetition of the key to be included into the extractor (if missing it is equal to zero)settings.key_as_field
of the number of keys exceeds this value, it is assumed that keys contains a value, which means that they will be treated as strings.settings.scalar_extractors
contains rules for determining which extractor to use for leaves. Default value is return value ofdefault_scalar_extractor()
, it's array of pairs where first element is predicate and if it matches, second element, function which maps schema to specific extractor, is called.
JsonGrinder.update!
— Methodfunction update!(a::Entry, v)
updates the entry when seeing value v
JsonGrinder.updatemaxkeys!
— Methodupdatemaxkeys!(n::Int)
limits the maximum number of keys in statistics of leaves in JSON. Default value is 10_000
.
JsonGrinder.updatemaxlen!
— Methodupdatemaxlen!(n::Int)
limits the maximum length of string values in statistics of nodes in JSON. Default value is 10_000
. Longer strings will be trimmed and their length and hash will be appended to retain the uniqueness. This is due to some strings being very long and causing the schema to be even order of magnitute larger than needed.
JsonGrinder.ArrayEntry
— Typemutable struct ArrayEntry <: JSONEntry
items
l::Dict{Int,Int}
updated::Int
end
keeps statistics about an array entry in JSON.
items
is typeofEntry
or nothing and keeps statistics about the elements of the arrayl
keeps histogram of message lengthupdated
counts how many times the struct was updated.
JsonGrinder.AuxiliaryExtractor
— Typestruct AuxiliaryExtractor <: AbstractExtractor
extractor::AbstractExtractor
extract_fun::Function
end
Universal extractor for applying any function, which lets you ambed any transformation into the AbstractExtractor machinery. Useful e.g. for extractors accompanying trained models, where you need to apply yet another transformation.
julia> e1 = ExtractDict(Dict(:a=>ExtractString(), :b=>ExtractString()));
julia> e2 = AuxiliaryExtractor(e1, (e,x)->e[:a](x["a"]))
Auxiliary extractor with
└── Dict
├── a: String
└── b: String
julia> e2(Dict("a"=>"Hello", "b"=>"World"))
ArrayNode{NGramMatrix{String,Array{String,1},Int64},Nothing}:
"Hello"
JsonGrinder.Entry
— Typemutable struct Entry <: JSONEntry
counts::Dict{Any,Int}
updated::Int
end
Keeps statistics about scalar values of a one key and also about items inside a key
counts
counts how many times given value appeared (at most max_keys is held)updated
counts how many times the entry was updated
JsonGrinder.ExtractArray
— Typestruct ExtractArray{T}
item::T
end
Convert array of values to a Mill.BagNode
with items converted by item
. The entire array is assumed to be a single bag.
Examples
julia> ec = ExtractArray(ExtractCategorical(2:4));
julia> ec([2,3,1,4]).data
4×4 Mill.ArrayNode{Mill.MaybeHotMatrix{Int64,Int64,Bool},Nothing}:
1 0 0 0
0 1 0 0
0 0 0 1
0 0 1 0
julia> es = ExtractArray(ExtractScalar());
julia> es([2,3,4])
BagNode with 1 obs
└── ArrayNode(1×3 Array with Float32 elements) with 3 obs
julia> es([2,3,4]).data
1×3 Mill.ArrayNode{Array{Float32,2},Nothing}:
2.0 3.0 4.0
JsonGrinder.ExtractCategorical
— TypeExtractCategorical(s::Entry)
ExtractCategorical(s::UnitRange)
ExtractCategorical(s::Vector)
Converts a single item to a one-hot encoded vector. Converts array of items into matrix of one-hot encoded columns. There is always alocated an extra element for a unknown value. If passed missing
, returns column of missing values.
Examples
julia> e = ExtractCategorical(2:4);
julia> e([2,3,1,4]).data
4×4 Mill.MaybeHotMatrix{Int64,Int64,Bool}:
1 0 0 0
0 1 0 0
0 0 0 1
0 0 1 0
julia> e([1,missing,5]).data
4×3 Mill.MaybeHotMatrix{Union{Missing, Int64},Int64,Union{Missing, Bool}}:
false missing false
false missing false
false missing false
true missing true
julia> e(4).data
4×1 Mill.MaybeHotMatrix{Int64,Int64,Bool}:
0
0
1
0
julia> e(missing).data
4×1 Mill.MaybeHotMatrix{Missing,Int64,Missing}:
missing
missing
missing
missing
JsonGrinder.ExtractDict
— Typestruct ExtractDict{S} <: AbstractExtractor
dict::S
end
extracts all items in dict
and return them as a Mill.ProductNode
. If a key is missing in extracted dict, nothing
is passed to the child extractors.
Examples
julia> e = ExtractDict(Dict(:a=>ExtractScalar(Float32, 2, 3), :b=>ExtractCategorical(1:5)))
Dict
├── a: Float32
└── b: Categorical d = 6
julia> res1 = e(Dict("a"=>1, "b"=>1))
ProductNode with 1 obs
├── a: ArrayNode(1×1 Array with Float32 elements) with 1 obs
└── b: ArrayNode(6×1 MaybeHotMatrix with Bool elements) with 1 obs
julia> res1[:a].data
1×1 Array{Float32,2}:
-3.0
julia> res1[:b].data
6×1 Mill.MaybeHotMatrix{Int64,Int64,Bool}:
1
0
0
0
0
0
julia> res2 = e(Dict("a"=>0))
ProductNode with 1 obs
├── a: ArrayNode(1×1 Array with Float32 elements) with 1 obs
└── b: ArrayNode(6×1 MaybeHotMatrix with Missing elements) with 1 obs
julia> res2[:a].data
1×1 Array{Float32,2}:
-6.0
julia> res2[:b].data
6×1 Mill.MaybeHotMatrix{Missing,Int64,Missing}:
missing
missing
missing
missing
missing
missing
JsonGrinder.ExtractEmpty
— Typestruct ExtractEmpty end
Concrete type to dispatch on for extraction of empty samples.
JsonGrinder.ExtractKeyAsField
— Typestruct ExtractKeyAsField{S,V} <: AbstractExtractor
key::S
item::V
end
extracts all items in vec
and in other
and return them as a ProductNode.
JsonGrinder.ExtractScalar
— Typestruct ExtractScalar{T} <: AbstractExtractor
c::T
s::T
end
Extracts a numerical value, centred by subtracting c
and scaled by multiplying by s
. Strings are converted to numbers.
The extractor returns ArrayNode{Matrix{Union{Missing, Int64}},Nothing}
or it subtypes. If passed missing
, it extracts missing values which Mill understands and can work with.
It can be created also using extractscalar(Float32, 5, 2)
Example
julia> ExtractScalar(Float32, 2, 3)(1)
1×1 Mill.ArrayNode{Array{Float32,2},Nothing}:
-3.0
julia> ExtractScalar(Float32, 2, 3)(missing)
1×1 Mill.ArrayNode{Array{Missing,2},Nothing}:
missing
JsonGrinder.ExtractString
— Typestruct ExtractString{T} <: AbstractExtractor
n::Int
b::Int
m::Int
end
Represents String
as n-
grams (NGramMatrix
from Mill.jl
) with base b
and modulo m
.
Example
julia> ExtractString()("hello")
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
"hello"
julia> ExtractString()(missing)
2053×1 Mill.ArrayNode{Mill.NGramMatrix{Missing,Missing},Nothing}:
missing
julia> ExtractString()(["hello", "world"])
2053×2 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
"hello"
"world"
JsonGrinder.ExtractVector
— Typestruct ExtractVector{T}
item::Int
end
represents an array of a fixed length, typically a feature vector of numbers of type T
julia> sc = ExtractVector(4)
julia> sc([2,3,1,4]).data
3×1 Array{Float32,2}:
2.0
3.0
1.0
JsonGrinder.MultiEntry
— Typemutable struct MultiEntry <: JSONEntry
childs::Vector{Any}
end
support for JSON which does not adhere to a fixed type. Container for multiple types of entry which are observed on the same place in JSON.
JsonGrinder.MultipleRepresentation
— TypeMultipleRepresentation(extractors::Tuple)
Extractor extracts item to a ProductNode
where each item is different extractor and item is extracted by all extractors in multirepresentation.
Examples
Example of both categorical and string representation
One of usecases is to use string representation for strings and categorical variable representation for most frequent values. This allows model to more easily learn frequent or somehow else significant values, which creating meaningful representation for previously unseen inputs.
julia> e = MultipleRepresentation((ExtractString(), ExtractCategorical(["tcp", "udp", "dhcp"])));
julia> s1 = e("tcp")
ProductNode with 1 obs
├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
└── e2: ArrayNode(4×1 MaybeHotMatrix with Bool elements) with 1 obs
julia> s1[:e1]
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
"tcp"
julia> s1[:e2]
4×1 Mill.ArrayNode{Mill.MaybeHotMatrix{Int64,Int64,Bool},Nothing}:
0
1
0
0
julia> s2 = e("http")
ProductNode with 1 obs
├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
└── e2: ArrayNode(4×1 MaybeHotMatrix with Bool elements) with 1 obs
julia> s2[:e1]
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
"http"
julia> s2[:e2]
4×1 Mill.ArrayNode{Mill.MaybeHotMatrix{Int64,Int64,Bool},Nothing}:
0
0
0
1
Example of irregular schema representation
The other usecase is to handle irregular schema, where extractor returns missing
representation if it's unable to extract it properly. Of course there do not have to be only leaf value extractors, some extractors may be ExtractDict, while other are extracting leaves etc.
julia> e = MultipleRepresentation((ExtractString(), ExtractScalar(Float32, 2, 3)));
julia> s1 = e(5)
ProductNode with 1 obs
├── e1: ArrayNode(2053×1 NGramMatrix with Missing elements) with 1 obs
└── e2: ArrayNode(1×1 Array with Float32 elements) with 1 obs
julia> s1[:e1]
2053×1 Mill.ArrayNode{Mill.NGramMatrix{Missing,Missing},Nothing}:
missing
julia> s1[:e2]
1×1 Mill.ArrayNode{Array{Float32,2},Nothing}:
9.0
julia> s2 = e("hi")
ProductNode with 1 obs
├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
└── e2: ArrayNode(1×1 Array with Missing elements) with 1 obs
julia> s2[:e1]
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
"hi"
julia> s2[:e2]
1×1 Mill.ArrayNode{Array{Missing,2},Nothing}:
missing
JsonGrinder.extractempty
— Constantextractempty
A singleton of type ExtractEmpty
is used to signal downstream extractors that they should extract an empty sample.