API Reference

Using datasets

The primary mechanism for loading datasets is the dataset function, coupled with open() to open the resulting DataSet as some Julia type. In addition, DataSets.jl provides two macros @datafunc and @datarun to help in creating program entry points and running them.

DataSets.dataset — Function

dataset(name)
dataset(project, name)

Returns the DataSet with the given name from project. If omitted, the global data environment DataSets.PROJECT will be used.

The DataSet is metadata, but to use the actual data in your program you need to use the open function to access the DataSet's content as a given Julia type.

Example

To open a dataset named "a_text_file" and read the whole content as a String,

content = open(String, dataset("a_text_file"))

To open the same dataset as an IO stream and read only the first line,

open(IO, dataset("a_text_file")) do io
    line = readline(io)
    @info "The first line is" line
end

To open a directory as a browsable tree object,

open(BlobTree, dataset("a_tree_example"))

DataSets.@datafunc — Macro

@datafunc function f(x::DT=>T, y::DS=>S...)
    ...
end

Define the function f(x::T, y::S, ...) and add data dispatch rules so that f(x::DataSet, y::DataSet) will open datasets matching dataset types DT,DS as Julia types T,S.

DataSets.@datarun — Macro

@datarun [proj] func(args...)

Run func with the named DataSets from the list args.

Example

Load DataSets named a,b as defined in Data.toml, and pass them to f().

proj = DataSets.load_project("Data.toml")
@datarun proj f("a", "b")

Data environment

The global data environment for the session is defined by DataSets.PROJECT which is initialized from the JULIA_DATASETS_PATH environment variable. To load a data project from a particular TOML file, use DataSets.load_project.

DataSets.PROJECT — Constant

DataSets.PROJECT contains the default global data environment for the Julia process. This is created from the JULIA_DATASETS_PATH environment variable at initialization which is a list of paths (separated by : or ; on windows).

In analogy to Base.LOAD_PATH and Base.DEPOT_PATH, the path components are interpreted as follows:

@ means the path of the current active project as returned by Base.active_project(false) This can be useful when you're "doing scripting" and you've got a project-specific Data.toml which resides next to the Project.toml. This only applies to projects which are explicitly set with julia --project or Pkg.activate().
Explicit paths may be either directories or files in Data.toml format. For directories, the filename "Data.toml" is implicitly appended. expanduser() is used to expand the user's home directory.
As in DEPOT_PATH, an empty path component means the user's default Julia home directory, joinpath(homedir(), ".julia", "datasets")

This simplified version of the code loading rules (LOADPATH/DEPOTPATH) is used as it seems unlikely that we'll want data location to be version- dependent in the same way that that code is.

Unlike LOAD_PATH, JULIA_DATASETS_PATH is represented inside the program as a StackedDataProject, and users can add custom projects by defining their own AbstractDataProject subtypes.

Additional projects may be added or removed from the stack with pushfirst!, push! and empty!.

DataSets.load_project — Function

load_project(path; auto_update=false)
load_project(config_dict)

Load a data project from a system path referring to a TOML file. If auto_update is true, the returned project will monitor the file for updates and reload when necessary.

Alternatively, create a DataProject from a an existing dictionary config_dict, which should be in the Data.toml format.

DataSet metadata model

The DataSet is a holder for dataset metadata, including the type of the data and the method for access (the storage driver - see Storage Drivers). DataSets are managed in projects which may be stacked together. The library provides several subtypes of DataSets.AbstractDataProject for this purpose which are listed below. (Most users will simply to configure the global data project via DataSets.PROJECT.)

DataSets.DataSet — Type

A DataSet is a metadata overlay for data held locally or remotely which is unopinionated about the underlying storage mechanism.

The data in a DataSet has a type which implies an index; the index can be used to partition the data for processing.

DataSets.AbstractDataProject — Type

Subtypes of AbstractDataProject have the interface

Must implement:

Base.get(project, dataset_name, default) — search
Base.keys(project) - get dataset names

Optional:

Base.iterate() — default implementation in terms of keys and get
Base.pairs() — default implementation in terms of keys and get
Base.haskey() — default implementation in terms of get
Base.getindex() — default implementation in terms of get
DataSets.project_name() — returns nothing by default.

Provided by AbstractDataProject (should not be overridden):

DataSets.dataset() - implemented in terms of get

DataSets.DataProject — Type

DataProject

A concrete data project is a collection of DataSets with associated names. Names are unique within the project.

DataSets.StackedDataProject — Type

StackedDataProject()
StackedDataProject(projects)

Search stack of AbstractDataProjects, where projects are searched from the first to last element of projects.

Additional projects may be added or removed from the stack with pushfirst!, push! and empty!.

Data Models for files and directories

DataSets provides some builtin data models Blob and BlobTree for accessin file- and directory-like data respectively. For modifying these, the functions newfile and newdir can be used, together with setindex! for BlobTree.

DataSets.Blob — Type

Blob(root)
Blob(root, relpath)

Blob represents the location of a collection of unstructured binary data. The location is a path relpath relative to some root data resource.

A Blob can naturally be open()ed as a Vector{UInt8}, but can also be mapped into the program as an IO byte stream, or interpreted as a String.

Blobs can be arranged into hierarchies "directories" via the BlobTree type.

DataSets.BlobTree — Type

BlobTree(root)

BlobTree is a "directory tree" like hierarchy which may have Blobs and BlobTrees as children.

The tree implements the AbstracTrees.children() interface and may be indexed with paths to traverse the hierarchy down to the leaves ("files") which are of type Blob. Individual leaves may be open()ed as various Julia types.

Example

Normally you'd construct these via the dataset function which takes care of constructing the correct root object. However, here's a direct demonstration:

julia> tree = BlobTree(DataSets.FileSystemRoot(dirname(pathof(DataSets))), path"../test/data")
📂 Tree ../test/data @ /home/chris/.julia/dev/DataSets/src
 📁 csvset
 📄 file.txt
 📄 foo.txt
 📄 people.csv.gz

julia> tree["csvset"]
📂 Tree ../test/data/csvset @ /home/chris/.julia/dev/DataSets/src
 📄 1.csv
 📄 2.csv

julia> tree[path"csvset"]
📂 Tree ../test/data/csvset @ /home/chris/.julia/dev/DataSets/src
 📄 1.csv
 📄 2.csv

DataSets.newfile — Function

newfile(func)
newfile(func, ctx)

Create a new temporary Blob object which may be later assigned to a permanent location in a BlobTree. If not assigned to a permanent location, the temporary file is cleaned up during garbage collection.

Example

tree[path"some/demo/path.txt"] = newfile() do io
    println(io, "Hi there!")
end

DataSets.newdir — Function

newdir()

Create a new temporary BlobTree which can have files assigned into it and may be assigned to a permanent location in a persistent BlobTree. If not assigned to a permanent location, the temporary tree is cleaned up during garbage collection.

Storage Drivers

To add a new kind of data storage backend, implement DataSets.add_storage_driver

DataSets.add_storage_driver — Function

add_storage_driver(driver_name=>storage_opener)

Associate DataSet storage driver named driver_name with storage_opener. When a dataset with storage.driver == driver_name is opened, storage_opener(user_func, storage_config, dataset) will be called. Any existing storage driver registered to driver_name will be overwritten.

As a matter of convention, storage_opener should generally take configuration from storage_config which is just dataset.storage. But to avoid config duplication it may also use the content of dataset, (for example, dataset.uuid).

Packages which define new storage drivers should generally call add_storage_driver() within their __init__() functions.