JsonGrinder
JsonGrinder is a collection of routines that facilitates conversion of JSON documents into structures used by Mill.jl
project.
The envisioned workflow is as follows:
1. Estimate schema of documents from a collection of JSON documents using a call schema
.
2. Create an extractor using extractor = suggestextractor(schema, settings)
3. Conver JSONs to Mill friendly structures using extractbatch(extractor, samples) = reduce(catobs, map(s-> extractor(s), samples))
A simplest example would looks as follows: Let's start by importing libraries and defining some JSONs.
using JsonGrinder, Flux, Mill, JSON
j1 = JSON.parse("""{"a": 1, "b": "hello works", "c":{ "a":1 ,"b": "hello world"}}""")
j2 = JSON.parse("""{"a": 2, "b": "hello world", "c":{ "a":2 ,"b": "hello"}}""")
Let's estimate a schema from those two documents
julia> sch = schema([j1,j2])
: [Dict]
├── a: [Scalar - Int64], 2 unique values, updated = 2
├── b: [Scalar - String], 2 unique values, updated = 2
└── c: [Dict]
├── a: [Scalar - Int64], 2 unique values, updated = 2
└── b: [Scalar - String], 2 unique values, updated = 2
Let's create a default extractor
julia> extractor = suggestextractor(sch)
: struct
├─── a: Int64
├─── b: String
└─── c: struct
├─── a: Int64
└─── b: String
Now we can convert the data to data either by
ds = map(s-> extractor(s), [j1,j2])
dss = reduce(catobs, ds)
or for convenience joined into a single command
julia> ds = extractbatch(extractor, [j1, j2])
ProductNode
├── scalars: ArrayNode(1, 2)
├── c: ProductNode
│ ├── scalars: ArrayNode(1, 2)
│ └── b: ArrayNode(2053, 2)
└── b: ArrayNode(2053, 2)
Now, we use a convenient function reflectinmodel
which creates a model that can process our dataset
julia> m = reflectinmodel(ds, d -> Chain(Dense(d,10, relu), Dense(10,4)))
ProductModel (
├── scalars: ArrayModel(Chain(Dense(1, 10, relu), Dense(10, 4)))
├── c: ProductModel (
│ ├── scalars: ArrayModel(Chain(Dense(1, 10, relu), Dense(10, 4)))
│ └── b: ArrayModel(Chain(Dense(2053, 10, relu), Dense(10, 4)))
│ ) ↦ ArrayModel(Chain(Dense(8, 10, relu), Dense(10, 4)))
└── b: ArrayModel(Chain(Dense(2053, 10, relu), Dense(10, 4)))
) ↦ ArrayModel(Chain(Dense(12, 10, relu), Dense(10, 4)))
and finally, we can do all the usual stuff with it
julia> m(ds).data
4×2 Array{Float32,2}:
0.102617 0.116041
0.0478762 0.133312
0.0357873 -0.0108712
-0.0197168 -0.0255238
- Customization of extractors:
While extractors of Dictionaries and Lists are straighforward, as the first one is converted to Mill.ProductNode
and the latter to Mill.BagNode
. The extractor of scalars can benefit from customization. This can be to some extent automatized by defining its own conversion rules in a list of [(criterion, extractor),...] where criterion is a function accepting JsonEntry
and outputing true
and false
and extractor is a function of JsonEntry
again returning a function extracting given entry. This list is passed to suggestextractor(schema, (scalar_extractors = [(criterion, extractor),...]))
For example a default list of extractors is
function default_scalar_extractor()
[(e -> (length(keys(e.counts)) / e.updated < 0.1 && length(keys(e.counts)) <= 10000),
e -> ExtractCategorical(collect(keys(e.counts)))),
(e -> true,
e -> extractscalar(promote_type(unique(typeof.(keys(e.counts)))...))),]
end
where the first entry checks sparsity e -> (length(keys(e.counts)) / e.updated < 0.1 && length(keys(e.counts)) <= 10000)
and if it is sufficiently sparse, it will suggest Categorical (one-hot) extractor. The second is a catch-all case, which extracts a scalar value, such as Float32.