DataToolkitCommon.Store.CACHE_PLUGINConstant

Cache the results of data loaders using the Serialisation standard library. Cache keys are determined by the loader "recipe" and the type requested.

It is important to note that not all data types can be cached effectively, such as an IOStream.

Recipe hashing

The driver, parameters, type(s), of a loader and the storage drivers of a dataset are all combined into the "recipe hash" of a loader.

╭─────────╮             ╭──────╮
│ Storage │             │ Type │
╰───┬─────╯             ╰───┬──╯
    │    ╭╌╌╌╌╌╌╌╌╌╮    ╭───┴────╮ ╭────────╮
    ├╌╌╌╌┤ DataSet ├╌╌╌╌┤ Loader ├─┤ Driver │
    │    ╰╌╌╌╌╌╌╌╌╌╯    ╰───┬────╯ ╰────────╯
╭───┴─────╮             ╭───┴───────╮
│ Storage ├─╼           │ Parmeters ├─╼
╰─────┬───╯             ╰───────┬───╯
      ╽                         ╽

Since the parameters of the loader (and each storage backend) can reference other data sets (indicated with and ), this hash is computed recursively, forming a Merkle Tree. In this manner the entire "recipe" leading to the final result is hashed.

                ╭───╮
                │ E │
        ╭───╮   ╰─┬─╯
        │ B ├──▶──┤
╭───╮   ╰─┬─╯   ╭─┴─╮
│ A ├──▶──┤     │ D │
╰───╯   ╭─┴─╮   ╰───╯
        │ C ├──▶──┐
        ╰───╯   ╭─┴─╮
                │ D │
                ╰───╯

In this example, the hash for a loader of data set "A" relies on the data sets "B" and "C", and so their hashes are calculated and included. "D" is required by both "B" and "C", and so is included in each. "E" is also used in "D".

Configuration

Store path

This uses the same store.path configuration variable as the store plugin (which see).

Disabling on a per-loader basis

Caching of individual loaders can be disabled by setting the "cache" parameter to false, i.e.

[[somedata.loader]]
cache = false
...

Store management

System-wide configuration can be set via the store config set REPL command, or directly modifying the DataToolkitCommon.Store.getinventory().config struct.

A few (system-wide) settings determine garbage collection behaviour:

  • auto_gc (default 2): How often to automatically run garbage collection (in hours). Set to a non-positive value to disable.
  • max_age (default 30): The maximum number of days since a collection was last seen before it is removed from consideration.
  • max_size (default 53687091200): The maximum (total) size of the store.
  • recency_beta (default 1): When removing items to avoid going over max_size, how much recency should be valued. Can be set to any value in (-∞, ∞). Larger (positive) values weight recency more, and negative values weight size more. -1 and 1 are equivalent.
  • store_dir (default store): The directory (either as an absolute path, or relative to the inventory file) that should be used for storage (IO) cache files.
  • cache_dir (default cache): The directory (either as an absolute path, or relative to the inventory file) that should be used for Julia cache files.
DataToolkitCommon.Store.CHECKSUM_AUTO_LOG_SIZEConstant

The file size threshold (in bytes) above which an info message should be printed when calculating the threshold, no matter what the checksum log setting is.

1073741824 bytes = 1024³ bytes = 1 GiB

DataToolkitCommon.Store.STORE_PLUGINConstant

Cache IO from data storage backends, by saving the contents to the disk.

Configuration

Store path

The directory the the store is maintained in can be set via the store.path configuration parameter.

config.store.path = "relative/to/datatoml"

The system default is ~/.cache/julia/datatoolkit, which can be overriden with the DATATOOLKIT_STORE environment variable.

Disabling on a per-storage basis

Saving of individual storage sources can be disabled by setting the "save" parameter to false, i.e.

[[somedata.storage]]
save = false

Checksums

To ensure data integrity, a checksum can be specified, and checked when saving to the store. For example,

[[iris.storage]]
checksum = "k12:cfb9a6a302f58e5a9b0c815bb7e8efb4"

If you do not have a checksum, but wish for one to be calculated upon accessing the data, the checksum parameter can be set to the special value "auto". When the data is first accessed, a checksum will be generated and replace the "auto" value.

Instead of "auto", a particular checksum algorithm can be specified, by naming it, e.g. "sha256". The currently supported algorithms are: k12 (Kangaroo Twelve), sha512, sha348, sha256, sha224, sha1, md5, and crc32c.

To explicitly specify no checksum, set the parameter to false.

Expiry/lifecycle

After a storage source is saved, the cache file can be made to expire after a certain period. This is done by setting the "lifetime" parameter of the storage, i.e.

[[updatingdata.storage]]
lifetime = "3 days"

The lifetime parameter accepts a few formats, namely:

ISO8061 periods (with whole numbers only), both forms

  1. P[n]Y[n]M[n]DT[n]H[n]M[n]S, e.g.
    • P3Y6M4DT12H30M5S represents a duration of "3 years, 6 months, 4 days, 12 hours, 30 minutes, and 5 seconds"
    • P23DT23H represents a duration of "23 days, 23 hours"
    • P4Y represents a duration of "4 years"
  2. PYYYYMMDDThhmmss / P[YYYY]-[MM]-[DD]T[hh]:[mm]:[ss], e.g.
    • P0003-06-04T12:30:05
    • P00030604T123005

"Prose style" period strings, which are a repeated pattern of [number] [unit], where unit matches year|y|month|week|wk|w|day|d|hour|h|minute|min|second|sec| optionally followed by an "s", comma, or whitespace. E.g.

  • 3 years 6 months 4 days 12 hours 30 minutes 5 seconds
  • 23 days, 23 hours
  • 4d12h

By default, the first lifetime period begins at the Unix epoch. This means a daily lifetime will tick over at 00:00 UTC. The "lifetime_offset" parameter can be used to shift this. It can be set to a lifetime string, date/time-stamp, or number of seconds.

For example, to have the lifetime expire at 03:00 UTC instead, the lifetime offset could be set to three hours.

[[updatingdata.storage]]
lifetime = "1 day"
lifetime_offset = "3h"

We can produce the same effect by specifying a different reference point for the lifetime.

[[updatingdata.storage]]
lifetime = "1 day"
lifetime_offset = 1970-01-01T03:00:00

Store management

System-wide configuration can be set via the store config set REPL command, or directly modifying the DataToolkitCommon.Store.getinventory().config struct.

A few (system-wide) settings determine garbage collection behaviour:

  • auto_gc (default 2): How often to automatically run garbage collection (in hours). Set to a non-positive value to disable.
  • max_age (default 30): The maximum number of days since a collection was last seen before it is removed from consideration.
  • max_size (default 53687091200): The maximum (total) size of the store.
  • recency_beta (default 1): When removing items to avoid going over max_size, how much recency should be valued. Can be set to any value in (-∞, ∞). Larger (positive) values weight recency more, and negative values weight size more. -1 and 1 are equivalent.
  • store_dir (default store): The directory (either as an absolute path, or relative to the inventory file) that should be used for storage (IO) cache files.
  • cache_dir (default cache): The directory (either as an absolute path, or relative to the inventory file) that should be used for Julia cache files.
Base.parseMethod
parse(Checksum, checksum::String)

Parse a string representation of a checksum in the format "type:value".

Example

julia> parse(Checksum, "k12:cfb9a6a302f58e5a9b0c815bb7e8efb4")
Checksum{16}(:k12, (0xcf, 0xb9, 0xa6, 0xa3, 0x02, 0xf5, 0x8e, 0x5a, 0x9b, 0x0c, 0x81, 0x5b, 0xb7, 0xe8, 0xef, 0xb4))
DataToolkitCommon.Store.__init__Method
__init__()

Initialise the data store by:

  • Registering the plugins STORE_PLUGIN and CACHE_PLUGIN
  • Adding the "store" Data REPL command
  • Loading the user inventory
  • Registering the GC-on-exit hook
DataToolkitCommon.Store.epochMethod
epoch(storage::DataStorage, seconds::Real)

Return the epoch that seconds lies in, according to the lifetime specification of storage.

DataToolkitCommon.Store.expunge!Method
expunge!(inventory::Inventory, collection::CollectionInfo; dryrun::Bool=false)

Remove collection and all sources only used by collection from inventory.

If dryrun is set, no action is taken.

DataToolkitCommon.Store.fetch!Method
fetch!(collection::DataCollection)

When collection uses the store plugin, call fetch! on all of its data sets.

DataToolkitCommon.Store.fetch!Method
fetch!(storer::DataStorage)

If storer is storable (either by default, or explicitly enabled), open it, and presumably save it in the Store along the way.

DataToolkitCommon.Store.fileextensionMethod
fileextension(storage::DataStorage)

Determine the apropriate file extension for a file caching the contents of storage, "cache" by default.

DataToolkitCommon.Store.garbage_collect!Method
garbage_collect!(inv::Inventory; log::Bool=true, dryrun::Bool=false, trimmsg::Bool=false)

Examine inv, and garbage collect old entries.

If log is set, an informative message is printed giving an overview of actions taken.

If dryrun is set, no actions are taken.

If trimmsg is set, a message about any sources removed by trimming is emitted.

DataToolkitCommon.Store.garbage_trim_size!Method
garbage_trim_size!(inv::Inventory; dryrun::Bool=false)

If the sources in inv exceed the maximum size, remove sources in order of their size_recency_scores until inv returns below its maximum size.

If dryrun is set, no action is taken.

DataToolkitCommon.Store.getchecksumMethod
getchecksum(storage::DataStorage, file::String)

Returns the Checksum for the file backing storage, or nothing if there is no checksum.

The checksum of file is checked against the recorded checksum in storage, if it exists.

DataToolkitCommon.Store.getchecksumMethod
checksum(file::String, method::Symbol)

Calculate the checksum of file with method, returning the Unsigned result.

Method should be one of:

  • k12
  • sha512
  • sha348
  • sha256
  • sha224
  • sha1
  • md5
  • crc32c

Should method not be recognised, nothing is returned.

DataToolkitCommon.Store.getsourceMethod
getsource(inventory::Inventory, loader::DataLoader, as::Type)

Look for the source in inventory that backs the as form of loader, returning the source or nothing if none could be found.

DataToolkitCommon.Store.getsourceMethod
getsource(inventory::Inventory, storage::DataStorage)

Look for the source in inventory that backs storage, returning the source or nothing if none could be found.

DataToolkitCommon.Store.interpret_lifetimeMethod
interpret_lifetime(lifetime::String)

Return the number of seconds in the interval specified by lifetime, which is in one of two formats:

ISO8061 periods (with whole numbers only), both forms

  1. P[n]Y[n]M[n]DT[n]H[n]M[n]S, e.g.
    • P3Y6M4DT12H30M5S represents a duration of "3 years, 6 months, 4 days, 12 hours, 30 minutes, and 5 seconds"
    • P23DT23H represents a duration of "23 days, 23 hours"
    • P4Y represents a duration of "4 years"
  2. PYYYYMMDDThhmmss / P[YYYY]-[MM]-[DD]T[hh]:[mm]:[ss], e.g.
    • P0003-06-04T12:30:05
    • P00030604T123005

"Prose style" period strings, which are a repeated pattern of [number] [unit], where unit matches year|y|month|week|wk|w|day|d|hour|h|minute|min|second|sec| optionally followed by an "s", comma, or whitespace. E.g.

  • 3 years 6 months 4 days 12 hours 30 minutes 5 seconds
  • 23 days, 23 hours
  • 4d12h
DataToolkitCommon.Store.load_inventoryFunction
load_inventory(path::String, create::Bool=true)

Load the inventory at path. If it does not exist, it will be created so long as create is set to true.

DataToolkitCommon.Store.modify_inventory!Method
modify_inventory!(modify_fn::Function (::Inventory) -> ::Any, inventory::Inventory)

Update inventory, modify it in-place with modify_fn, and the save the modified inventory.

DataToolkitCommon.Store.parsebytesizeMethod
parsebytesize(size::AbstractString)

Parse a string representation of size bytes into an integer.

This accepts any decimal value before an SI-prefixed "B" / "iB" unit (case-insensitive) with the "B" optionally omitted, seperated and surrounded by any amount of whitespace.

Note that the SI prefixes are case sensitive, e.g. "kiB" and "MiB" are recognised, but "KiB" and "miB" are not.

Examples

julia> parsebytesize("123B")
123

julia> parsebytesize("44 kiB")
45056

julia> parsebytesize("1.2 Mb")
1200000
DataToolkitCommon.Store.refresh_sources!Method
refresh_sources!(inv::Inventory; inactive_collections::Set{UUID},
                 active_collections::Dict{UUID, Set{UInt64}})

Update the listed references of each source in inv, such that only references that are part of either inactive_collections or active_collections are retained.

References to active_collections also are checked against the given recipe hash and the known recipe hashes.

Sources with no references after this update are considered orphaned and removed.

The result is a named tuple giving a list of orphaned sources and the number of recipe checks that occured.

DataToolkitCommon.Store.rhashFunction
rhash(collection::DataCollection, dict::SmallDict, h::UInt=zero(UInt)) # Helper method

Individually hash each entry in dict, and then xor the results so the final value is independant of the ordering.

DataToolkitCommon.Store.rhashFunction
rhash(collection::DataCollection, x, h::UInt)

Hash x with respect to collection, with special behaviour for the following types:

  • SmallDict
  • Dict
  • Vector
  • Pair
  • Type
  • QualifiedType
DataToolkitCommon.Store.rhashMethod
rhash(loader::DataLoader{driver}, h::UInt=zero(UInt)) where {driver}

Hash the recipe specified by loader, or more specifically the various aspects of storage that could affect the loaded result.

The hash should be consistent across sessions and cosmetic changes.

DataToolkitCommon.Store.rhashMethod
rhash(storage::DataStorage{driver}, h::UInt=zero(UInt)) where {driver}

Hash the recipe specified by storage, or more specifically the various aspects of storage that could affect the result.

DataToolkitCommon.Store.rhashMethod
rhash(::Type{T}, h::UInt) # Helper method

Hash the field names and types of T recursively, or T is a primitive type hash the name and parent module name.

DataToolkitCommon.Store.scan_collectionsMethod
scan_collections(inv::Inventory)

Examine each collection in inv, and sort them into the following categories:

  • active_collections: data collections which are part of the current STACK
  • live_collections: data collections who still exist, but are not part of STACK
  • ghost_collections: collections that do not exist, but have been seen within the maximum age
  • dead_collections: collections that have not been seen within the maximum age

These categories are returned with a named tuple of the following form:

(; active_collections::Dict{UUID, Set{UInt64}},
   live_collections::Set{UUID},
   ghost_collections::Set{UUID},
   dead_collections::Vector{UUID})

The active_collections value gives both the data collection UUIDs, as well as all known recipe hashes.

DataToolkitCommon.Store.shouldstoreMethod
shouldstore(storage::DataStorage)
shouldstore(loader::DataLoader, T::Type)

Returns true if storage/loader should be stored/cached, false otherwise.

DataToolkitCommon.Store.size_recency_scoresFunction
size_recency_scores(inventory::Inventory, sources::Vector{SourceInfo}, β::Number=1)

Produce a combined score for each of sources in inventory based on the size and (access) recency of the source, with small recent files scored higher than large older files. Files that do not exist are given a score of 0.0.

The combined score is a weighted harmonic mean, inspired by the F-score. More specifically, the combined score is $(1 + \beta^2) \cdot \frac{t \cdot s}{\beta^2 t + s}$ where $\beta$ is the recency factor, $t \in [0, 1]$ is the time score, and $s \in [0, 1]$ is the size score. When β is negative, the $\beta^2$ weighting is applied to $s$ instead.

DataToolkitCommon.Store.storefileMethod
storefile(inventory::Inventory, storage::DataStorage)
storefile(inventory, loader::DataLoader, as::Type)

Returns a path for the source of storage/loader, or nothing if either the source or the path does not exist.

Should a source exist, but the file not, the source is removed from inventory.

DataToolkitCommon.Store.storefileMethod
storefile(inventory::Inventory, source::SourceInfo)

Returns the full path for source in inventory, regardless of whether the path exists or not.

DataToolkitCommon.Store.storesaveMethod
storesave(inventory::Inventory, storage::DataStorage, ::Type{FilePath}, file::FilePath)

Save the file representing storage into inventory.

DataToolkitCommon.Store.storesaveMethod
storesave(inventory::Inventory, storage::DataStorage, ::Union{Type{IO}, Type{IOStream}}, from::IO)

Save the IO in from representing storage into inventory.

DataToolkitCommon.Store.update_inventory!Method
update_inventory!(path::String)
update_inventory!(inventory::Inventory)

Find the inventory specified by path/inventory in the INVENTORIES collection, and update it in-place. Should the inventory specified not be part of INVENTORIES, it is added.

Returns the up-to-date Inventory.

DataToolkitCommon.Store.update_source!Method
update_source!(inventory::Inventory,
               source::Union{StoreSource, CacheSource},
               collection::DataCollection)

Update the record for source in inventory, based on it having just been used by collection.

This will update the atime of the source, and add collection as a reference if it is not already listed.

Should the inventory file not be writable, nothing will be done.

DataToolkitCommon.ADDPKGS_PLUGINConstant

Register required packages of the Data Collection that needs them.

Example usage

[config.packages]
CSV = "336ed68f-0bac-5ca0-87d4-7b16caf5d00b"

With the above configuration and this plugin, upon loading the data collection, the CSV package will be registered under the data collection's module.

DataToolkitCommon.DEFAULTS_PLUGINConstant

Apply default values from the "defaults" data collection property. This works with both DataSets and AbstractDataTransformers.

Default DataSet property

[config.defaults]
description="Oh no, nobody bothered to describe this dataset."

Default AbstractDataTransformer property

This is scoped to a particular transformer, and a particular driver. One may also affect all drivers with the special "all drivers" key _. Specific-driver defaults always override all-driver defaults.

[config.defaults.storage._]
priority=0

[config.defaults.storage.filesystem]
priority=2
DataToolkitCommon.LOG_PLUGINConstant

Log major data set events.

Settings

config.log.events = ["load", "save", "storage"] # the default

To log all event types unconditionally, simply set config.log.events to true.

Loggable events

  • load, when a loader is run
  • save, when a writer is run
  • storage, when storage is accessed, in read or write mode

Other transformers or plugins may extend the list of recognised events.

DataToolkitCommon.MEMORISE_PLUGINConstant

Cache the results of data loaders in memory. This requires (dataset::DataSet, as::Type) to consistently identify the same loaded information.

Enabling caching of a dataset

[[mydata]]
memorise = true

memorise can be a boolean value, a type that should be memorised, or a list of types to be memorised.

DataToolkitCommon.VERSIONS_PLUGINConstant

Give data sets versions, and identify them by version.

Giving data sets a version

Multiple editions of a data set can be described by using the same name, but setting the version parameter to differentiate them.

For instance, say that Ronald Fisher released a second version of the "Iris" data set, with more flowers. We could specify this as:

[[iris]]
version = "1"
...

[[iris]]
version = "2"
...

Matching by version

Version matching is done via the Identifier parameter "version". As shorthand, instead of providing the "version" parameter manually, the version can be tacked onto the end of an identifier with @, e.g. iris@1 or iris@2.

The version matching re-uses machinery from Pkg, and so all Pkg-style version specifications are supported. In addition to this, one can simply request the "latest" version.

The following are all valid identifiers, using the @-shorthand:

iris@1
iris@~1
iris@>=2
iris@latest

When multiple data sets match the version specification, the one with the highest matching version is used.

DataToolkitCommon.dirofMethod
dirof(collection::DataCollection)

Return the root directory for collection. In most cases, this will simply be the directory of the collection file, the two exceptions being:

  • When the directory is "Data.d", in which case the parent directory is given
  • When collection has no path, in which case the current working directory is used and a warning emitted (once only per collection).
DataToolkitCommon.getdefaultsMethod
getdefaults(collection::DataCollection)
getdefaults(dataset::DataSet)

Get the default parameters of the datasets of a certain data collection.

DataToolkitCommon.getdefaultsMethod
getdefaults(dataset::DataSet, ADT::Type{<:AbstractDataTransformer},
           driver::Symbol; resolvetype::Bool=true)

Get the default parameters of an AbstractDataTransformer of type ADT using driver attached to a certain dataset. The default type resolved when resolvetype is set.

DataToolkitCommon.getdefaultsMethod
getdefaults(dataset::DataSet, ADT::Type{<:AbstractDataTransformer};
            spec::Dict, resolvetype::Bool=true)

Get the default parameters of an AbstractDataTransformer of type ADT where the transformer driver is read from ADT if possible, and taken from spec otherwise. The default type resolved when resolvetype is set.

DataToolkitCommon.humansizeMethod
humansize(bytes::Integer; digits::Int=2)

Determine the SI prefix for bytes, then give a tuple of the number of bytes with that prefix (rounded to digits), and the units as a string.

Examples

julia> humansize(123)
(123, "B")

julia> humansize(1234)
(1.2, "KiB")

julia> humansize(1000^3)
(954, "MiB")

julia> humansize(1024^3)
(1.0, "GiB")
DataToolkitCommon.loadtypepathMethod
loadtypepath(subloaders::Vector{DataLoader}, targettype::Type)

Return the sequence of types that the subloaders must be asked for to finally produce targettype from an initial fromtype. If this is not possible, nothing is returned instead.

DataToolkitCommon.unzipFunction
unzip(archive::IO, dir::String=pwd();
    recursive::Bool=false, log::Bool=false)

Unzip an archive to dir.

If recursive is set, nested zip files will be recursively unzipped too.

Set log to see unzipping progress.

DataToolkitCommon.REPLcmds.config_completeMethod
config_complete(sofar::AbstractString; collection::DataCollection=first(STACK))

Provide completions for the existing TOML-style property path of collections's starting with sofar.

DataToolkitCommon.REPLcmds.confirm_stack_first_writableMethod
confirm_stack_first_writable(; quiet::Bool=false)

First call confirm_stack_nonempty then return true if the first collection of STACK is writable.

Unless quiet is set, should this not be the case a warning message is emmited.

DataToolkitCommon.REPLcmds.initMethod
init(input::AbstractString)

Parse and call the repl-format init command input. If required information is missing, the user will be interactively questioned.

input should be of the following form:

[NAME] [[at] PATH] [with [-n] [PLUGINS...]]
DataToolkitCommon.REPLcmds.plugin_listMethod
plugin_list(input::AbstractString)

Parse and call the repl-format plugin list command input.

input should either be empty or '-a'/'–availible'.

DataToolkitCommon.REPLcmds.stack_demoteMethod
stack_demote(input::AbstractString)

Parse and call the repl-format stack demote command input.

input should consist of a data collection identifier and optionally a promotion amount, either an integer or the character '*'.

DataToolkitCommon.REPLcmds.stack_listMethod
stack_list(::AbstractString; maxwidth::Int=displaysize(stdout)[2])

Print a table listing all of the current data collections on the stack.

DataToolkitCommon.REPLcmds.stack_loadMethod
stack_load(input::AbstractString)

Parse and call the repl-format stack loader command input.

input should consist of a path to a Data TOML file or a folder containing a Data.toml file. The path may be preceeded by a position in the stack to be loaded to, either an integer or the character '*'.

input may also be the name of an existing data collection, in which case its path is substituted.

DataToolkitCommon.REPLcmds.stack_promoteMethod
stack_promote(input::AbstractString)

Parse and call the repl-format stack promotion command input.

input should consist of a data collection identifier and optionally a promotion amount, either an integer or the character '*'.

DataToolkitCommon.REPLcmds.stack_removeMethod
stack_remove(input::AbstractString)

Parse and call the repl-format stack removal command input.

input should consist of a data collection identifier.

DataToolkitCommon.show_extraMethod
show_extra(io::IO, dataset::DataSet)

Print extra information (namely this description) about dataset to io.

Advice point

This function call is advised within the repl_show invocation.