Introduction

The problem with the current state of affairs

Data is beguiling. It can initially seem simple to deal with: "here I have a file, and that's it". However as soon as you do things with the data you're prone to be asked tricky questions like:

  • where's the data?
  • how did you process that data?
  • how can I be sure I'm looking at the same data as you?

This is no small part of the replication crisis.

image

Further concerns arise as soon as you start dealing with large quantities of data, or computationally expensive derived data sets. For example:

  • Have I already computed this data set somewhere else?
  • Is my generated data up to date with its sources/dependencies?

Generic tools exist for many parts of this problem, but there are some benefits that can be realised by creating a Julia-specific system, namely:

  • Having all pertinent environmental information in the data processing contained in a single Project.toml
  • Improved convenience in data loading and management, compared to a generic solution
  • Allowing datasets to be easily shared with a Julia package

In addition, the Julia community seems to have a strong tendency to NIH[NIH] tools, so we may as well get ahead of this and try to make something good 😛.

Pre-existing solutions

DataLad

  • Does a lot of things well
  • Puts information on how to create data in git commit messages (bad)
  • No data file specification

Kedro data catalog

Snakemake

  • Workflow manager, with remote file support
  • Snakemake Remote Files
  • Good list of possible file locations to handle
  • Drawback is that you have to specify the location you expect(S3, http, FTP, etc.)
  • No data file specification

Nextflow

  • Workflow manager, with remote file support
  • Docs on files and IO
  • Docs on S3
  • You just call file() and nextflow figures out under the hood the protocol whether it should pull it from S3, http, FTP, or a local file.
  • No data file specification
  • NIHNot Invented Here, a tendency to "reinvent the wheel" to avoid using tools from external origins — it would of course be better if you (re)made it.