Schema

The schema helps to understand the structure of JSON files, by which we understand the types of nodes (Dict, Array, Values) and frequency of occurrences of values and lengths of arrays. The schema also holds statistics about how many times the node has been present. All this information is taken into the account by suggestextractor function, which takes a schema and using few reasonable heuristic, suggests an extractor, which convert jsons to Mill structure. The schema might be also useful for formats with enforced schema to collect statistics on leaves.

The main function to create schema is schema, which accepts a list of (unparsed) JSONs and producing schema. Schema can be always updated to reflect new JSONs and allow streaming by update! function. Moreover, schema accepts an optional argument, a function converting an element of an array to a JSON. This a function creating schema from all jsons in a dictionary can look like

schema(readdir("jsons", join = true)) do s
	open(s,"r") do fio
		read(fio, String)
	end |> JSON.parse
end

schema function has following default behavior: If passed array of strings, it consideres them to be json documents as strings passes each element as an argument to JSON.parse function.

A schema can be further updated by calling function update!(sch, json). Schemas can be merged using the overloaded merge function, which facilitates distributed creation of schema following map-reduce paradigm.

Schema can be saved in html by generate_html allowing their interactive exploration. Calling generate_html(filename, sch) will generate self-contained file with HTML+CSS+JS. The generated visualization is interactive, implemented using VanillaJS.

Schema assumes the root of each JSON is dictionary.

Implementation details

Statistics are collected in a hierarchical structure reflecting the structured composed of DictEntry, ArrayEntry, and Entry. These structures reflect those in JSON: Dict, Array, and Value (either String or a Number). Sometimes, data are stored in JSONs not adhering to a stable schema, which happens if one key have children of different type. An example of such would be

{"a": [1,2,3]}
{"a": {"b": 1}}
{"a": "hello"}

For such cases, we have introduced additional JSONEntry, a MultiEntry, but we discourage to rely on this feature and recommend to adapt JSONs to have stable schema (if possible). This can be achieved by modifying each sample before it's passed into the schema.

Each subtype of JSONEntry implements the update! function, which recursively updates the schema.

Entry

mutable struct Entry{T} <: JSONEntry
	counts::Dict{T,Int}
	updated::Int
end

Entry keeps information about leaf-values (e.g. "a" = 3) (strings or numbers) in JSONs. It consists of two statistics

updated counts how many times the leaf in a given position in JSON was observed,
counts counts how many times a particular value of that leaf was observed.

To keep counts dictionary from becoming too large, once its length exceeds JsonGrinder.max_keys (default is 10_000), then the new values will be dropped. This value can be changed by JsonGrinder.updatemaxkeys!(some_higher_value), but of course the new limit will be applied only to newly processed values, so it's advised to set it in the beginning of your scripts.

ArrayEntry

mutable struct ArrayEntry <: JSONEntry
	items
	l::Dict{Int,Int}
	updated::Int
end

ArrayEntry keeps information about arrays (e.g. "a": [1,2,3,4]). Statistics about individual items of the array are deferred to item, which can be <:JSONEntry. l keeps histogram of lengths of arrays, and updated is number of times an array has been observed in particular place in JSON.

DictEntry

mutable struct DictEntry <: JSONEntry
	childs::Dict{Symbol, Any}
	updated::Int
end

defers all statistics about its children to them, and the only statistic is again a counter updated about number of observations. Fields childs contains all keys which were observed in specific Dictionary and their corresponding <:JSONEntry values with statistics about values observed under given key.

MultiEntry

mutable struct MultiEntry <: JSONEntry
	childs::Vector{JSONEntry}
	updated::Int
end

is a failsafe for cases, where the schema is not stable. For example in following two JSONs

{"a": "Hello"}
{"a": ["Hello"," world"]}

the type of a value of a key "a" is String, whereas in the second it is "Vector". The JsonGrinder will deal with this by first creating an Entry, since the value is scalar, and upon encountering the second JSON, it will replace Entry with MultiEntry having Entry and ArrayEntry as children (this is the reason why entries are declared mutable).

While JsonGrinder can deal with non-stable jsons, it is strongly discouraged as it might have negative effect on the performance.

Usefulness of such feature comes into play also when you don't know if your schema is stable or not. In that case, you can calculate the schema, and then search for MultiEntry nodes.

Illustrative example

Let's say we have following jsons. We take them and create a schema.

using JSON, JsonGrinder
jsons = [
       """{"a": "Hello", "b":{"c":1, "d":1}}""",
       """{"a": ["Hi", "Julia"], "b":{"c":1, "d":[1,2,3]}}""",
       """{"a": "World", "b":{"c":2, "d":2}}""",
]
sch = schema(JSON.parse, jsons)

you can visualize schema by

julia> display(sch)
[Dict] (updated = 3)
  ├── a: [MultiEntry] (updated = 3)
  │        ├── 1: [Scalar - String], 2 unique values, updated = 2
  │        └── 2: [List] (updated = 1)
  │                 ⋮
  └── b: [Dict] (updated = 3)
           ├── c: [Scalar - Int64], 2 unique values, updated = 3
           └── d: [MultiEntry] (updated = 3)
                    ⋮

which shows only reasonable part.

To see whole schema, we can use printtree(ds; htrunc=Inf, vtrunc=Inf, trav=true) from HierarchicalUtils.jl which prints the whole schema, together with identifiers of individual nodes:

julia> printtree(sch; htrunc=Inf, vtrunc=Inf, trav=true)
[Dict] (updated = 3) [""]
  ├── a: [MultiEntry] (updated = 3) ["E"]
  │        ├── 1: [Scalar - String], 2 unique values, updated = 2 ["I"]
  │        └── 2: [List] (updated = 1) ["M"]
  │                 └── [Scalar - String], 2 unique values, updated = 2 ["O"]
  └── b: [Dict] (updated = 3) ["U"]
           ├── c: [Scalar - Int64], 2 unique values, updated = 3 ["Y"]
           └── d: [MultiEntry] (updated = 3) ["c"]
                    ├── 1: [Scalar - Int64], 2 unique values, updated = 2 ["d"]
                    └── 2: [List] (updated = 1) ["e"]
                             └── [Scalar - Int64], 3 unique values, updated = 3 ["eU"]

Strings at the end of each row can be used as a key to access individual elements of the schema. To learn more about HierarchicalUtils.jl check their docs or section about HierarchicalUtils.jl in Mill.jl documentation

Here, we see that we have 2 MultiEntry, thus 2 type instabilities in our jsons. The first MultiEntry (key "E") has 2 children: Entry and ArrayEntry.

The sch["E"].updated is 3, because value under key a in json has been observed 3 times. The sch["I"].updated is 2, because string value was seen 2 times under a. As expected, we can see

julia> sch["I"].counts
Dict{String,Int64} with 2 entries:
  "Hello" => 1
  "World" => 1

and in the ArrayEntry we can see sch["M"].updated is 1, because array has been observed once in key a. The freqency of lengths is following:

julia> sch["M"].l
Dict{Int64,Int64} with 1 entry:
  2 => 1

because we have observed one array of length 2. sch["M"].items is Entry.

The Entry (can be accessed by sch["M"].items or by sch["O"]) has fields with following values:

sch["O"].updated is 2, because we have observed 2 elements in array under key a.

counts is

julia> sch["O"].counts
Dict{String,Int64} with 2 entries:
"Hi"    => 1
"Julia" => 1

which corresponds to individual elements of an array we have observed.

Extra functions

While schema can be printed to REPL, it can contain quite a lot of information. Therefore JsonGrinder.generate_html exports it to HTML, where parts can be expanded at wish.

JsonGrinder.generate_html — Function

generate_html(sch::DictEntry; max_vals=100, max_len=1_000)
generate_html(file_name, sch::DictEntry; max_vals=100, max_len=1_000)

exports schema to HTML including CSS style and JS allowing to expand / hide sub-parts of schema, countmaps, and lengthmaps.

Arguments

max_vals controls maximum number of exported values in countmap
max_len controls maximum number of exported lengts of arrays
file_name a name of file to save HTML with schema

Return

If provided filename, it does not return anything. If not, it returns the generated HTML+CSS+JS as a String.

Example

You can either open the html file in any browser, or open it directly using ElectronDisplay

using ElectronDisplay
using ElectronDisplay: newdisplay
generated_html = generate_html(sch, max_vals = 100)
display(newdisplay(), MIME{Symbol("text/html")}(), generated_html)

Schema supports merging using Base.merge, which facilitates parallel computation of schemas. An example might be

ThreadsX.mapreduce(schema, merge, Iterators.partition(jsons, div(length(jsons), Threads.nthreads())))

JsonGrinder.prune_json — Function

prune_json(json, schema)

Removes keys from json which are not part of the schema.

Example

julia> using JSON

julia> j1 = JSON.parse("{\"a\": 4, \"b\": {\"a\":1, \"b\": 1}}");

julia> j2 = JSON.parse("{\"a\": 4, \"b\": {\"a\":1}}");

julia> sch = JsonGrinder.schema([j1,j2])
[Dict] (updated = 2)
  ├── a: [Scalar - Int64], 1 unique values, updated = 2
  └── b: [Dict] (updated = 2)
           ├── a: [Scalar - Int64], 1 unique values, updated = 2
           └── b: [Scalar - Int64], 1 unique values, updated = 1

julia> j3 = Dict("a" => 4, "b" => Dict("a"=>1), "c" => 1, "d" => 2)
Dict{String,Any} with 4 entries:
  "c" => 1
  "b" => Dict("a"=>1)
  "a" => 4
  "d" => 2

julia> JsonGrinder.prune_json(j3, sch)
Dict{Any,Any} with 2 entries:
  "b" => Dict{Any,Any}("a"=>1)
  "a" => 4

so the JsonGrinder.prune_json removes keys c and d.

JsonGrinder.updatemaxkeys! — Function

updatemaxkeys!(n::Int)

limits the maximum number of keys in statistics of leaves in JSON. Default value is 10_000.

JsonGrinder.updatemaxlen! — Function

updatemaxlen!(n::Int)

limits the maximum length of string values in statistics of nodes in JSON. Default value is 10_000. Longer strings will be trimmed and their length and hash will be appended to retain the uniqueness. This is due to some strings being very long and causing the schema to be even order of magnitute larger than needed.