Data.toml
Data collections are represented on-disk as Data.toml
files. While DataToolkit
can be used at a basic level without any knowledge of the structure of the file, a little knowledge goes a long way (for instance when edit
ing a dataset).
Overall structure
See the TOML refresher below if you're a bit rusty, then come back to this.
A Data.toml
file is broadly composed of three sections:
- Global setup information
- Configuration
- Datasets
Here's what that structure looks like in practice:
data_config_version=0
name="data collection name"
uuid="a UUIDv4"
plugins=["plugin1", "plugin2", ...]
[config]
# [Properties of the data collection itself]
[[mydataset]]
uuid="a UUIDv4"
# other properties...
[[mydataset.TRANSFORMER]]
driver="transformer driver"
type=["a QualifiedType", ...]
priority=1 # (optional)
# other properties...
[[mydataset]]
# There may be multiple data sets by the same name,
# but they must be uniquely identifyable by their properties
[[exampledata]]
# Another data set
Global setup
The global setup must specify:
- The
Data.toml
format version - The name and UUID of the data collection
- The plugins used by the data collection
Configuration
The config
TOML table is special, and is used to hold custom attributes of the data collection, for example:
[config]
mykey="value"
[config.defaults]
description="Ooops, somebody forgot to describe this."
[config.defaults.storage.filesystem]
priority=2
As a consequence of this, no dataset may be named "config"
.
Datasets
All datasets are represented using an array of tables. This allows multiple datasets to have the same name, and be distinguished by other attributes (e.g. version information). All datasets must have a uuid
key, this is important for providing a canonical unique reference to a particular dataset.
[[mydataset]]
uuid="a UUIDv4"
# other properties...
The storage/loader/writer transformers of a dataset are specified using sub-tables, i.e.
[[mydataset.TRANSFORMER]]
driver="transformer driver"
# other properties...
All transformers must set the driver
key. All attributes other than driver
, type
, and priority
are free to be used by the transformer and plugins.
TOML refresher
TOML files are already widely used with Julia (for example, Project.toml
and Manifest.toml
) files, as they strike a good compromise between capability and complexity. See the TOML documentation for a full description of the format, but here are the components most relevant to Data.toml
files.
Key-value pairs
key = "value"
This represents a "key"
dictionary key having the value "value"
. Strings, numbers, booleans, and date/time stamps are all appropriate value forms.
a = "value"
b = 2
c = 3.1e+12
d = true
e = 1979-05-27T07:32:00Z
Arrays are written using [ ]
syntax, and can spread across multiple lines.
key = [1, 2, 3]
Tables (Dictionaries)
A collection of key-value pairs within a certain scope form a Julia Dict
when parsed. TOML allows for nested dictionaries using tables. A new table is created with a bracketed header line, like so:
[new_table]
All key-value entries after such a table header, up to the next table header, belong to that table. For example:
[mytable]
a = 1
b = 2
this is parsed as
Dict("mytable" => Dict("a" => 1, "b" => 2))
It is also possible to represent this using dotted keys, e.g.
mytable.a = 1
mytable.b = 2
These two styles can mixed to form nested tables.
[mytable.innertable.deeply_nested]
key = "value"
Arrays of tables
A list of dictionaries (array of tables in TOML terminology) can be formed using double-bracketed headers, e.g.
[[table_array]]
All double-bracketed tables will be collected together into an array, for example:
[[table_array]]
key = 1
[[table_array]]
key = 2
will be parsed as
Dict("table_array" => [Dict("key" => 1),
Dict("key" => 2)])