API Documentation

Base.delete!Method

Deletes field at the specified path from the schema sch. For instance, the following: delete!(schema, ".field.subfield.[]", "x") deletes the field x from schema at: schema.childs[:field].childs[:subfield].items.childs

Base.mergeMethod

Dispatch of Base.merge on JsonGrinder.JSONEntry structures. Allows to merge multiple schemas to single one.

merge(es::Entry...)
merge(es::DictEntry...)
merge(es::ArrayEntry...)
merge(es::MultiEntry...)
merge(es::JsonGrinder.JSONEntry...)

it can be used to distribute calculation of schema across multiple workers to merge their partial results into bigger one.

Example

If we want to calculate schema from e.g. array of jsons in a distributed manner, if we have jsons array and , we can do it using

using ThreadsX
ThreadsX.mapreduce(schema, merge, Iterators.partition(jsons, length(jsons) ÷ Threads.nthreads()))

or

using ThreadTools
merge(tmap(schema, Threads.nthreads(), Iterators.partition(jsons, length(jsons) ÷ Threads.nthreads()))

or, if you like to split it into multiple jobs and having them processed by multiple threads, it can look like

using ThreadTools
merge(tmap(schema, Threads.nthreads(), Iterators.partition(jsons, 1_000))

where we split array to smaller array of size 1k and let all available threads create partial schemas.

If your data is too large to fit into ram, following approach works well also with filenames and similar other ways to process large data.

JsonGrinder.default_scalar_extractorMethod

Default scalar extractor, it contains reasonable defaults, but sometimes it does not fullfill all needs. We advice to use it as base which you will modify for your needs or as a default that you will append to your other rules.

JsonGrinder.extractscalarMethod
extractscalar(Type{String}, n = 3, b = 256, m = 2053)

represents strings as ngrams with

  • n (the degree of ngram),
  • b base of string,
  • m modulo on index of the token to reduce dimension
extractscalar(Type{Number}, m = 0, s = 1)

extracts number subtracting m and multiplying by s

Example

julia> JsonGrinder.extractscalar(String, 3, 256, 2053)("5")
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
 "5"

julia> JsonGrinder.extractscalar(Int32, 3, 256)("5")
1×1 Mill.ArrayNode{Array{Int32,2},Nothing}:
 512
JsonGrinder.generate_htmlMethod
generate_html(sch::DictEntry; max_vals=100, max_len=1_000)
generate_html(file_name, sch::DictEntry; max_vals=100, max_len=1_000)

exports schema to HTML including CSS style and JS allowing to expand / hide sub-parts of schema, countmaps, and lengthmaps.

Arguments

  • max_vals controls maximum number of exported values in countmap
  • max_len controls maximum number of exported lengts of arrays
  • file_name a name of file to save HTML with schema

Return

If provided filename, it does not return anything. If not, it returns the generated HTML+CSS+JS as a String.

Example

You can either open the html file in any browser, or open it directly using ElectronDisplay

using ElectronDisplay
using ElectronDisplay: newdisplay
generated_html = generate_html(sch, max_vals = 100)
display(newdisplay(), MIME{Symbol("text/html")}(), generated_html)
JsonGrinder.newentryMethod
newentry(v)

creates new entry describing json according to the type of v

JsonGrinder.prune_jsonMethod
prune_json(json, schema)

Removes keys from json which are not part of the schema.

Example

julia> using JSON

julia> j1 = JSON.parse("{\"a\": 4, \"b\": {\"a\":1, \"b\": 1}}");

julia> j2 = JSON.parse("{\"a\": 4, \"b\": {\"a\":1}}");

julia> sch = JsonGrinder.schema([j1,j2])
[Dict] (updated = 2)
  ├── a: [Scalar - Int64], 1 unique values, updated = 2
  └── b: [Dict] (updated = 2)
           ├── a: [Scalar - Int64], 1 unique values, updated = 2
           └── b: [Scalar - Int64], 1 unique values, updated = 1

julia> j3 = Dict("a" => 4, "b" => Dict("a"=>1), "c" => 1, "d" => 2)
Dict{String,Any} with 4 entries:
  "c" => 1
  "b" => Dict("a"=>1)
  "a" => 4
  "d" => 2

julia> JsonGrinder.prune_json(j3, sch)
Dict{Any,Any} with 2 entries:
  "b" => Dict{Any,Any}("a"=>1)
  "a" => 4

so the JsonGrinder.prune_json removes keys c and d.

JsonGrinder.schemaMethod
schema(samples::AbstractArray{<:Dict})
schema(samples::AbstractArray{<:AbstractString})
schema(samples::AbstractArray, map_fun::Function)
schema(map_fun::Function, samples::AbstractArray)

creates schema from an array of parsed or unparsed JSONs.

JsonGrinder.suggestextractorFunction
suggestextractor(e::DictEntry, settings = NamedTuple())

create convertor of json to tree-structure of DataNode

  • e top-level of json hierarchy, typically returned by invoking schema
  • settings can be any container supporting get function
  • settings.mincountkey contains minimum repetition of the key to be included into the extractor (if missing it is equal to zero)
  • settings.key_as_field of the number of keys exceeds this value, it is assumed that keys contains a value, which means that they will be treated as strings.
  • settings.scalar_extractors contains rules for determining which extractor to use for leaves. Default value is return value of default_scalar_extractor(), it's array of pairs where first element is predicate and if it matches, second element, function which maps schema to specific extractor, is called.
JsonGrinder.update!Method
function update!(a::Entry, v)

updates the entry when seeing value v

JsonGrinder.updatemaxkeys!Method
updatemaxkeys!(n::Int)

limits the maximum number of keys in statistics of leaves in JSON. Default value is 10_000.

JsonGrinder.updatemaxlen!Method
updatemaxlen!(n::Int)

limits the maximum length of string values in statistics of nodes in JSON. Default value is 10_000. Longer strings will be trimmed and their length and hash will be appended to retain the uniqueness. This is due to some strings being very long and causing the schema to be even order of magnitute larger than needed.

JsonGrinder.ArrayEntryType
mutable struct ArrayEntry <: JSONEntry
	items
	l::Dict{Int,Int}
	updated::Int
end

keeps statistics about an array entry in JSON.

  • items is typeof Entry or nothing and keeps statistics about the elements of the array
  • l keeps histogram of message length
  • updated counts how many times the struct was updated.
JsonGrinder.AuxiliaryExtractorType
struct AuxiliaryExtractor <: AbstractExtractor
	extractor::AbstractExtractor
	extract_fun::Function
end

Universal extractor for applying any function, which lets you ambed any transformation into the AbstractExtractor machinery. Useful e.g. for extractors accompanying trained models, where you need to apply yet another transformation.

julia> e1 = ExtractDict(Dict(:a=>ExtractString(), :b=>ExtractString()));

julia> e2 = AuxiliaryExtractor(e1, (e,x)->e[:a](x["a"]))
Auxiliary extractor with
  └── Dict
        ├── a: String
        └── b: String

julia> e2(Dict("a"=>"Hello", "b"=>"World"))
ArrayNode{NGramMatrix{String,Array{String,1},Int64},Nothing}:
 "Hello"
JsonGrinder.EntryType
mutable struct Entry <: JSONEntry
	counts::Dict{Any,Int}
	updated::Int
end

Keeps statistics about scalar values of a one key and also about items inside a key

  • counts counts how many times given value appeared (at most max_keys is held)
  • updated counts how many times the entry was updated
JsonGrinder.ExtractArrayType
struct ExtractArray{T}
	item::T
end

Convert array of values to a Mill.BagNode with items converted by item. The entire array is assumed to be a single bag.

Examples

julia> ec = ExtractArray(ExtractCategorical(2:4));

julia> ec([2,3,1,4]).data
4×4 Mill.ArrayNode{Mill.MaybeHotMatrix{Int64,Int64,Bool},Nothing}:
 1  0  0  0
 0  1  0  0
 0  0  0  1
 0  0  1  0

julia> es = ExtractArray(ExtractScalar());

julia> es([2,3,4])
BagNode with 1 obs
  └── ArrayNode(1×3 Array with Float32 elements) with 3 obs

julia> es([2,3,4]).data
1×3 Mill.ArrayNode{Array{Float32,2},Nothing}:
 2.0  3.0  4.0
JsonGrinder.ExtractCategoricalType
ExtractCategorical(s::Entry)
ExtractCategorical(s::UnitRange)
ExtractCategorical(s::Vector)

Converts a single item to a one-hot encoded vector. Converts array of items into matrix of one-hot encoded columns. There is always alocated an extra element for a unknown value. If passed missing, returns column of missing values.

Examples

julia> e = ExtractCategorical(2:4);

julia> e([2,3,1,4]).data
4×4 Mill.MaybeHotMatrix{Int64,Int64,Bool}:
 1  0  0  0
 0  1  0  0
 0  0  0  1
 0  0  1  0

julia> e([1,missing,5]).data
4×3 Mill.MaybeHotMatrix{Union{Missing, Int64},Int64,Union{Missing, Bool}}:
 false  missing  false
 false  missing  false
 false  missing  false
  true  missing   true

julia> e(4).data
4×1 Mill.MaybeHotMatrix{Int64,Int64,Bool}:
 0
 0
 1
 0

julia> e(missing).data
4×1 Mill.MaybeHotMatrix{Missing,Int64,Missing}:
 missing
 missing
 missing
 missing
JsonGrinder.ExtractDictType
struct ExtractDict{S} <: AbstractExtractor
	dict::S
end

extracts all items in dict and return them as a Mill.ProductNode. If a key is missing in extracted dict, nothing is passed to the child extractors.

Examples

julia> e = ExtractDict(Dict(:a=>ExtractScalar(Float32, 2, 3), :b=>ExtractCategorical(1:5)))
Dict
  ├── a: Float32
  └── b: Categorical d = 6

julia> res1 = e(Dict("a"=>1, "b"=>1))
ProductNode with 1 obs
  ├── a: ArrayNode(1×1 Array with Float32 elements) with 1 obs
  └── b: ArrayNode(6×1 MaybeHotMatrix with Bool elements) with 1 obs

julia> res1[:a].data
1×1 Array{Float32,2}:
 -3.0

julia> res1[:b].data
6×1 Mill.MaybeHotMatrix{Int64,Int64,Bool}:
 1
 0
 0
 0
 0
 0

julia> res2 = e(Dict("a"=>0))
ProductNode with 1 obs
  ├── a: ArrayNode(1×1 Array with Float32 elements) with 1 obs
  └── b: ArrayNode(6×1 MaybeHotMatrix with Missing elements) with 1 obs

julia> res2[:a].data
1×1 Array{Float32,2}:
 -6.0

julia> res2[:b].data
6×1 Mill.MaybeHotMatrix{Missing,Int64,Missing}:
 missing
 missing
 missing
 missing
 missing
 missing
JsonGrinder.ExtractKeyAsFieldType
struct ExtractKeyAsField{S,V} <: AbstractExtractor
	key::S
	item::V
end

extracts all items in vec and in other and return them as a ProductNode.

JsonGrinder.ExtractScalarType
struct ExtractScalar{T} <: AbstractExtractor
	c::T
	s::T
end

Extracts a numerical value, centred by subtracting c and scaled by multiplying by s. Strings are converted to numbers.

The extractor returns ArrayNode{Matrix{Union{Missing, Int64}},Nothing} or it subtypes. If passed missing, it extracts missing values which Mill understands and can work with.

It can be created also using extractscalar(Float32, 5, 2)

Example

julia> ExtractScalar(Float32, 2, 3)(1)
1×1 Mill.ArrayNode{Array{Float32,2},Nothing}:
 -3.0

julia> ExtractScalar(Float32, 2, 3)(missing)
1×1 Mill.ArrayNode{Array{Missing,2},Nothing}:
 missing
JsonGrinder.ExtractStringType
struct ExtractString{T} <: AbstractExtractor
	n::Int
	b::Int
	m::Int
end

Represents String as n-grams (NGramMatrix from Mill.jl) with base b and modulo m.

Example

julia> ExtractString()("hello")
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
 "hello"

julia> ExtractString()(missing)
2053×1 Mill.ArrayNode{Mill.NGramMatrix{Missing,Missing},Nothing}:
 missing

julia> ExtractString()(["hello", "world"])
2053×2 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
 "hello"
 "world"
JsonGrinder.ExtractVectorType
struct ExtractVector{T}
	item::Int
end

represents an array of a fixed length, typically a feature vector of numbers of type T

julia> sc = ExtractVector(4)
julia> sc([2,3,1,4]).data
3×1 Array{Float32,2}:
 2.0
 3.0
 1.0
JsonGrinder.MultiEntryType
mutable struct MultiEntry <: JSONEntry
	childs::Vector{Any}
end

support for JSON which does not adhere to a fixed type. Container for multiple types of entry which are observed on the same place in JSON.

JsonGrinder.MultipleRepresentationType
MultipleRepresentation(extractors::Tuple)

Extractor extracts item to a ProductNode where each item is different extractor and item is extracted by all extractors in multirepresentation.

Examples

Example of both categorical and string representation

One of usecases is to use string representation for strings and categorical variable representation for most frequent values. This allows model to more easily learn frequent or somehow else significant values, which creating meaningful representation for previously unseen inputs.

julia> e = MultipleRepresentation((ExtractString(), ExtractCategorical(["tcp", "udp", "dhcp"])));

julia> s1 = e("tcp")
ProductNode with 1 obs
  ├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
  └── e2: ArrayNode(4×1 MaybeHotMatrix with Bool elements) with 1 obs

julia> s1[:e1]
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
 "tcp"

julia> s1[:e2]
4×1 Mill.ArrayNode{Mill.MaybeHotMatrix{Int64,Int64,Bool},Nothing}:
 0
 1
 0
 0

julia> s2 = e("http")
ProductNode with 1 obs
  ├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
  └── e2: ArrayNode(4×1 MaybeHotMatrix with Bool elements) with 1 obs

julia> s2[:e1]
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
 "http"

julia> s2[:e2]
4×1 Mill.ArrayNode{Mill.MaybeHotMatrix{Int64,Int64,Bool},Nothing}:
 0
 0
 0
 1

Example of irregular schema representation

The other usecase is to handle irregular schema, where extractor returns missing representation if it's unable to extract it properly. Of course there do not have to be only leaf value extractors, some extractors may be ExtractDict, while other are extracting leaves etc.

julia> e = MultipleRepresentation((ExtractString(), ExtractScalar(Float32, 2, 3)));

julia> s1 = e(5)
ProductNode with 1 obs
  ├── e1: ArrayNode(2053×1 NGramMatrix with Missing elements) with 1 obs
  └── e2: ArrayNode(1×1 Array with Float32 elements) with 1 obs

julia> s1[:e1]
2053×1 Mill.ArrayNode{Mill.NGramMatrix{Missing,Missing},Nothing}:
 missing

julia> s1[:e2]
1×1 Mill.ArrayNode{Array{Float32,2},Nothing}:
 9.0

julia> s2 = e("hi")
ProductNode with 1 obs
  ├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
  └── e2: ArrayNode(1×1 Array with Missing elements) with 1 obs

julia> s2[:e1]
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
 "hi"

julia> s2[:e2]
1×1 Mill.ArrayNode{Array{Missing,2},Nothing}:
 missing