Data.toml

Data collections are represented on-disk as Data.toml files. While DataToolkit can be used at a basic level without any knowledge of the structure of the file, a little knowledge goes a long way (for instance when editing a dataset).

Overall structure

See the TOML refresher below if you're a bit rusty, then come back to this.

A Data.toml file is broadly composed of three sections:

Global setup information
Configuration
Datasets

Here's what that structure looks like in practice:

data_config_version=0

name="data collection name"
uuid="a UUIDv4"
plugins=["plugin1", "plugin2", ...]

[config]
# [Properties of the data collection itself]

[[mydataset]]
uuid="a UUIDv4"
# other properties...

[[mydataset.TRANSFORMER]]
driver="transformer driver"
type=["a QualifiedType", ...]
priority=1 # (optional)
# other properties...

[[mydataset]]
# There may be multiple data sets by the same name,
# but they must be uniquely identifyable by their properties

[[exampledata]]
# Another data set

Global setup

The global setup must specify:

The Data.toml format version
The name and UUID of the data collection
The plugins used by the data collection

Configuration

The config TOML table is special, and is used to hold custom attributes of the data collection, for example:

[config]
mykey="value"

[config.defaults]
description="Ooops, somebody forgot to describe this."

[config.defaults.storage.filesystem]
priority=2

As a consequence of this, no dataset may be named "config".

Datasets

All datasets are represented using an array of tables. This allows multiple datasets to have the same name, and be distinguished by other attributes (e.g. version information). All datasets must have a uuid key, this is important for providing a canonical unique reference to a particular dataset.

[[mydataset]]
uuid="a UUIDv4"
# other properties...

The storage/loader/writer transformers of a dataset are specified using sub-tables, i.e.

[[mydataset.TRANSFORMER]]
driver="transformer driver"
# other properties...

All transformers must set the driver key. All attributes other than driver, type, and priority are free to be used by the transformer and plugins.

TOML refresher

TOML files are already widely used with Julia (for example, Project.toml and Manifest.toml) files, as they strike a good compromise between capability and complexity. See the TOML documentation for a full description of the format, but here are the components most relevant to Data.toml files.

Key-value pairs

key = "value"

This represents a "key" dictionary key having the value "value". Strings, numbers, booleans, and date/time stamps are all appropriate value forms.

a = "value"
b = 2
c = 3.1e+12
d = true
e = 1979-05-27T07:32:00Z

Arrays are written using [ ] syntax, and can spread across multiple lines.

key = [1, 2, 3]

Tables (Dictionaries)

A collection of key-value pairs within a certain scope form a Julia Dict when parsed. TOML allows for nested dictionaries using tables. A new table is created with a bracketed header line, like so:

[new_table]

All key-value entries after such a table header, up to the next table header, belong to that table. For example:

[mytable]
a = 1
b = 2

this is parsed as

Dict("mytable" => Dict("a" => 1, "b" => 2))

It is also possible to represent this using dotted keys, e.g.

mytable.a = 1
mytable.b = 2

These two styles can mixed to form nested tables.

[mytable.innertable.deeply_nested]
key = "value"

Arrays of tables

A list of dictionaries (array of tables in TOML terminology) can be formed using double-bracketed headers, e.g.

[[table_array]]

All double-bracketed tables will be collected together into an array, for example:

[[table_array]]
key = 1

[[table_array]]
key = 2

will be parsed as

Dict("table_array" => [Dict("key" => 1),
                       Dict("key" => 2)])