API Documentation

Base.delete! — Method

Deletes field at the specified path from the schema sch. For instance, the following: delete!(schema, ".field.subfield.[]", "x") deletes the field x from schema at: schema.childs[:field].childs[:subfield].items.childs

Base.merge — Method

Dispatch of Base.merge on JsonGrinder.JSONEntry structures. Allows to merge multiple schemas to single one.

merge(es::Entry...)
merge(es::DictEntry...)
merge(es::ArrayEntry...)
merge(es::MultiEntry...)
merge(es::JsonGrinder.JSONEntry...)

it can be used to distribute calculation of schema across multiple workers to merge their partial results into bigger one.

Example

If we want to calculate schema from e.g. array of jsons in a distributed manner, if we have jsons array and , we can do it using

using ThreadsX
ThreadsX.mapreduce(schema, merge, Iterators.partition(jsons, length(jsons) ÷ Threads.nthreads()))

using ThreadTools
merge(tmap(schema, Threads.nthreads(), Iterators.partition(jsons, length(jsons) ÷ Threads.nthreads()))

or, if you like to split it into multiple jobs and having them processed by multiple threads, it can look like

using ThreadTools
merge(tmap(schema, Threads.nthreads(), Iterators.partition(jsons, 1_000))

where we split array to smaller array of size 1k and let all available threads create partial schemas.

If your data is too large to fit into ram, following approach works well also with filenames and similar other ways to process large data.

JsonGrinder.default_scalar_extractor — Method

Default scalar extractor, it contains reasonable defaults, but sometimes it does not fullfill all needs. We advice to use it as base which you will modify for your needs or as a default that you will append to your other rules.

JsonGrinder.extract_missing_bag — Method

returns missing bag of 1 observation

JsonGrinder.extractscalar — Method

extractscalar(Type{String}, n = 3, b = 256, m = 2053)

represents strings as ngrams with

n (the degree of ngram),
b base of string,
m modulo on index of the token to reduce dimension

extractscalar(Type{Number}, m = 0, s = 1)

extracts number subtracting m and multiplying by s

Example

julia> JsonGrinder.extractscalar(String, 3, 256, 2053)("5")
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
 "5"

julia> JsonGrinder.extractscalar(Int32, 3, 256)("5")
1×1 Mill.ArrayNode{Array{Int32,2},Nothing}:
 512

JsonGrinder.generate_html — Method

generate_html(sch::DictEntry; max_vals=100, max_len=1_000)
generate_html(file_name, sch::DictEntry; max_vals=100, max_len=1_000)

exports schema to HTML including CSS style and JS allowing to expand / hide sub-parts of schema, countmaps, and lengthmaps.

Arguments

max_vals controls maximum number of exported values in countmap
max_len controls maximum number of exported lengts of arrays
file_name a name of file to save HTML with schema

Return

If provided filename, it does not return anything. If not, it returns the generated HTML+CSS+JS as a String.

Example

You can either open the html file in any browser, or open it directly using ElectronDisplay

using ElectronDisplay
using ElectronDisplay: newdisplay
generated_html = generate_html(sch, max_vals = 100)
display(newdisplay(), MIME{Symbol("text/html")}(), generated_html)

JsonGrinder.make_empty_bag — Method

returns empty bag of 0 observations

JsonGrinder.newentry — Method

newentry(v)

creates new entry describing json according to the type of v

JsonGrinder.prune_json — Method

prune_json(json, schema)

Removes keys from json which are not part of the schema.

Example

julia> using JSON

julia> j1 = JSON.parse("{\"a\": 4, \"b\": {\"a\":1, \"b\": 1}}");

julia> j2 = JSON.parse("{\"a\": 4, \"b\": {\"a\":1}}");

julia> sch = JsonGrinder.schema([j1,j2])
[Dict] (updated = 2)
  ├── a: [Scalar - Int64], 1 unique values, updated = 2
  └── b: [Dict] (updated = 2)
           ├── a: [Scalar - Int64], 1 unique values, updated = 2
           └── b: [Scalar - Int64], 1 unique values, updated = 1

julia> j3 = Dict("a" => 4, "b" => Dict("a"=>1), "c" => 1, "d" => 2)
Dict{String,Any} with 4 entries:
  "c" => 1
  "b" => Dict("a"=>1)
  "a" => 4
  "d" => 2

julia> JsonGrinder.prune_json(j3, sch)
Dict{Any,Any} with 2 entries:
  "b" => Dict{Any,Any}("a"=>1)
  "a" => 4

so the JsonGrinder.prune_json removes keys c and d.

JsonGrinder.schema — Method

schema(samples::AbstractArray{<:Dict})
schema(samples::AbstractArray{<:AbstractString})
schema(samples::AbstractArray, map_fun::Function)
schema(map_fun::Function, samples::AbstractArray)

creates schema from an array of parsed or unparsed JSONs.

JsonGrinder.suggestextractor — Function

suggestextractor(e::DictEntry, settings = NamedTuple())

create convertor of json to tree-structure of DataNode

e top-level of json hierarchy, typically returned by invoking schema
settings can be any container supporting get function
settings.mincountkey contains minimum repetition of the key to be included into the extractor (if missing it is equal to zero)
settings.key_as_field of the number of keys exceeds this value, it is assumed that keys contains a value, which means that they will be treated as strings.
settings.scalar_extractors contains rules for determining which extractor to use for leaves. Default value is return value of default_scalar_extractor(), it's array of pairs where first element is predicate and if it matches, second element, function which maps schema to specific extractor, is called.

JsonGrinder.update! — Method

function update!(a::Entry, v)

updates the entry when seeing value v

JsonGrinder.updatemaxkeys! — Method

updatemaxkeys!(n::Int)

limits the maximum number of keys in statistics of leaves in JSON. Default value is 10_000.

JsonGrinder.updatemaxlen! — Method

updatemaxlen!(n::Int)

limits the maximum length of string values in statistics of nodes in JSON. Default value is 10_000. Longer strings will be trimmed and their length and hash will be appended to retain the uniqueness. This is due to some strings being very long and causing the schema to be even order of magnitute larger than needed.

JsonGrinder.ArrayEntry — Type

mutable struct ArrayEntry <: JSONEntry
	items
	l::Dict{Int,Int}
	updated::Int
end

keeps statistics about an array entry in JSON.

items is typeof Entry or nothing and keeps statistics about the elements of the array
l keeps histogram of message length
updated counts how many times the struct was updated.

JsonGrinder.AuxiliaryExtractor — Type

struct AuxiliaryExtractor <: AbstractExtractor
	extractor::AbstractExtractor
	extract_fun::Function
end

Universal extractor for applying any function, which lets you ambed any transformation into the AbstractExtractor machinery. Useful e.g. for extractors accompanying trained models, where you need to apply yet another transformation.

julia> e1 = ExtractDict(Dict(:a=>ExtractString(), :b=>ExtractString()));

julia> e2 = AuxiliaryExtractor(e1, (e,x)->e[:a](x["a"]))
Auxiliary extractor with
  └── Dict
        ├── a: String
        └── b: String

julia> e2(Dict("a"=>"Hello", "b"=>"World"))
ArrayNode{NGramMatrix{String,Array{String,1},Int64},Nothing}:
 "Hello"

JsonGrinder.Entry — Type

mutable struct Entry <: JSONEntry
	counts::Dict{Any,Int}
	updated::Int
end

Keeps statistics about scalar values of a one key and also about items inside a key

counts counts how many times given value appeared (at most max_keys is held)
updated counts how many times the entry was updated

JsonGrinder.ExtractArray — Type

struct ExtractArray{T}
	item::T
end

Convert array of values to a Mill.BagNode with items converted by item. The entire array is assumed to be a single bag.

Examples

julia> ec = ExtractArray(ExtractCategorical(2:4));

julia> ec([2,3,1,4]).data
4×4 Mill.ArrayNode{Mill.MaybeHotMatrix{Int64,Int64,Bool},Nothing}:
 1  0  0  0
 0  1  0  0
 0  0  0  1
 0  0  1  0

julia> es = ExtractArray(ExtractScalar());

julia> es([2,3,4])
BagNode with 1 obs
  └── ArrayNode(1×3 Array with Float32 elements) with 3 obs

julia> es([2,3,4]).data
1×3 Mill.ArrayNode{Array{Float32,2},Nothing}:
 2.0  3.0  4.0

JsonGrinder.ExtractCategorical — Type

ExtractCategorical(s::Entry)
ExtractCategorical(s::UnitRange)
ExtractCategorical(s::Vector)

Converts a single item to a one-hot encoded vector. Converts array of items into matrix of one-hot encoded columns. There is always alocated an extra element for a unknown value. If passed missing, returns column of missing values.

Examples

julia> e = ExtractCategorical(2:4);

julia> e([2,3,1,4]).data
4×4 Mill.MaybeHotMatrix{Int64,Int64,Bool}:
 1  0  0  0
 0  1  0  0
 0  0  0  1
 0  0  1  0

julia> e([1,missing,5]).data
4×3 Mill.MaybeHotMatrix{Union{Missing, Int64},Int64,Union{Missing, Bool}}:
 false  missing  false
 false  missing  false
 false  missing  false
  true  missing   true

julia> e(4).data
4×1 Mill.MaybeHotMatrix{Int64,Int64,Bool}:
 0
 0
 1
 0

julia> e(missing).data
4×1 Mill.MaybeHotMatrix{Missing,Int64,Missing}:
 missing
 missing
 missing
 missing

JsonGrinder.ExtractDict — Type

struct ExtractDict{S} <: AbstractExtractor
	dict::S
end

extracts all items in dict and return them as a Mill.ProductNode. If a key is missing in extracted dict, nothing is passed to the child extractors.

Examples

julia> e = ExtractDict(Dict(:a=>ExtractScalar(Float32, 2, 3), :b=>ExtractCategorical(1:5)))
Dict
  ├── a: Float32
  └── b: Categorical d = 6

julia> res1 = e(Dict("a"=>1, "b"=>1))
ProductNode with 1 obs
  ├── a: ArrayNode(1×1 Array with Float32 elements) with 1 obs
  └── b: ArrayNode(6×1 MaybeHotMatrix with Bool elements) with 1 obs

julia> res1[:a].data
1×1 Array{Float32,2}:
 -3.0

julia> res1[:b].data
6×1 Mill.MaybeHotMatrix{Int64,Int64,Bool}:
 1
 0
 0
 0
 0
 0

julia> res2 = e(Dict("a"=>0))
ProductNode with 1 obs
  ├── a: ArrayNode(1×1 Array with Float32 elements) with 1 obs
  └── b: ArrayNode(6×1 MaybeHotMatrix with Missing elements) with 1 obs

julia> res2[:a].data
1×1 Array{Float32,2}:
 -6.0

julia> res2[:b].data
6×1 Mill.MaybeHotMatrix{Missing,Int64,Missing}:
 missing
 missing
 missing
 missing
 missing
 missing

JsonGrinder.ExtractEmpty — Type

struct ExtractEmpty end

Concrete type to dispatch on for extraction of empty samples.

JsonGrinder.ExtractKeyAsField — Type

struct ExtractKeyAsField{S,V} <: AbstractExtractor
	key::S
	item::V
end

extracts all items in vec and in other and return them as a ProductNode.

JsonGrinder.ExtractScalar — Type

struct ExtractScalar{T} <: AbstractExtractor
	c::T
	s::T
end

Extracts a numerical value, centred by subtracting c and scaled by multiplying by s. Strings are converted to numbers.

The extractor returns ArrayNode{Matrix{Union{Missing, Int64}},Nothing} or it subtypes. If passed missing, it extracts missing values which Mill understands and can work with.

It can be created also using extractscalar(Float32, 5, 2)

Example

julia> ExtractScalar(Float32, 2, 3)(1)
1×1 Mill.ArrayNode{Array{Float32,2},Nothing}:
 -3.0

julia> ExtractScalar(Float32, 2, 3)(missing)
1×1 Mill.ArrayNode{Array{Missing,2},Nothing}:
 missing

JsonGrinder.ExtractString — Type

struct ExtractString{T} <: AbstractExtractor
	n::Int
	b::Int
	m::Int
end

Represents String as n-grams (NGramMatrix from Mill.jl) with base b and modulo m.

Example

julia> ExtractString()("hello")
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
 "hello"

julia> ExtractString()(missing)
2053×1 Mill.ArrayNode{Mill.NGramMatrix{Missing,Missing},Nothing}:
 missing

julia> ExtractString()(["hello", "world"])
2053×2 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
 "hello"
 "world"

JsonGrinder.ExtractVector — Type

struct ExtractVector{T}
	item::Int
end

represents an array of a fixed length, typically a feature vector of numbers of type T

julia> sc = ExtractVector(4)
julia> sc([2,3,1,4]).data
3×1 Array{Float32,2}:
 2.0
 3.0
 1.0

JsonGrinder.MultiEntry — Type

mutable struct MultiEntry <: JSONEntry
	childs::Vector{Any}
end

support for JSON which does not adhere to a fixed type. Container for multiple types of entry which are observed on the same place in JSON.

JsonGrinder.MultipleRepresentation — Type

MultipleRepresentation(extractors::Tuple)

Extractor extracts item to a ProductNode where each item is different extractor and item is extracted by all extractors in multirepresentation.

Examples

Example of both categorical and string representation

One of usecases is to use string representation for strings and categorical variable representation for most frequent values. This allows model to more easily learn frequent or somehow else significant values, which creating meaningful representation for previously unseen inputs.

julia> e = MultipleRepresentation((ExtractString(), ExtractCategorical(["tcp", "udp", "dhcp"])));

julia> s1 = e("tcp")
ProductNode with 1 obs
  ├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
  └── e2: ArrayNode(4×1 MaybeHotMatrix with Bool elements) with 1 obs

julia> s1[:e1]
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
 "tcp"

julia> s1[:e2]
4×1 Mill.ArrayNode{Mill.MaybeHotMatrix{Int64,Int64,Bool},Nothing}:
 0
 1
 0
 0

julia> s2 = e("http")
ProductNode with 1 obs
  ├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
  └── e2: ArrayNode(4×1 MaybeHotMatrix with Bool elements) with 1 obs

julia> s2[:e1]
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
 "http"

julia> s2[:e2]
4×1 Mill.ArrayNode{Mill.MaybeHotMatrix{Int64,Int64,Bool},Nothing}:
 0
 0
 0
 1

Example of irregular schema representation

The other usecase is to handle irregular schema, where extractor returns missing representation if it's unable to extract it properly. Of course there do not have to be only leaf value extractors, some extractors may be ExtractDict, while other are extracting leaves etc.

julia> e = MultipleRepresentation((ExtractString(), ExtractScalar(Float32, 2, 3)));

julia> s1 = e(5)
ProductNode with 1 obs
  ├── e1: ArrayNode(2053×1 NGramMatrix with Missing elements) with 1 obs
  └── e2: ArrayNode(1×1 Array with Float32 elements) with 1 obs

julia> s1[:e1]
2053×1 Mill.ArrayNode{Mill.NGramMatrix{Missing,Missing},Nothing}:
 missing

julia> s1[:e2]
1×1 Mill.ArrayNode{Array{Float32,2},Nothing}:
 9.0

julia> s2 = e("hi")
ProductNode with 1 obs
  ├── e1: ArrayNode(2053×1 NGramMatrix with Int64 elements) with 1 obs
  └── e2: ArrayNode(1×1 Array with Missing elements) with 1 obs

julia> s2[:e1]
2053×1 Mill.ArrayNode{Mill.NGramMatrix{String,Int64},Nothing}:
 "hi"

julia> s2[:e2]
1×1 Mill.ArrayNode{Array{Missing,2},Nothing}:
 missing

JsonGrinder.extractempty — Constant

extractempty

A singleton of type ExtractEmpty is used to signal downstream extractors that they should extract an empty sample.