API Reference

Public API

DataDeps.DataDepType
DataDep(
    name::String,
    message::String,
    remote_path::Union{String,Vector{String}...},
    [hash::Union{String,Vector{String}...},]; # Optional, if not provided will generate
    # keyword args (Optional):
    fetch_method=fetch_default # (remote_filepath, local_directory_path)->local_filepath
    post_fetch_method=identity # (local_filepath)->Any
)

Required Fields

  • name: the name used to refer to this datadep
    • Corresponds to a folder name where the datatep will be stored.
    • It can have spaces or any other character that is allowed in a Windows filestring (which is a strict subset of the restriction for unix filenames).
  • message: a message displayed to the user for they are asked if they want to download it
    • This is normally used to give a link to the original source of the data, a paper to be cited etc.
  • remote_path: where to fetch the data from
    • This is usually a string, or a vector of strings (or a vector of vectors... see Recursive Structure in the documentation for developers).

Optional Fields

  • hash: used to check whether the files downloaded correctly
    • By far the most common use is to just provide a SHA256 sum as a hex-string for the files.
    • If not provided, then a warning message with the SHA256 sum is displayed. This is to help package devs work out the sum for their files, without using an external tool. You can also calculate it using Preupload Checking in the documentation for developers.
    • If you want to use a different hashing algorithm, then you can provide a tuple (hashfun, targethex). hashfun should be a function which takes an IOStream, and returns a Vector{UInt8}. Such as any of the functions from SHA.jl, eg sha3_384, sha1_512 or md5 from MD5.jl
    • If you want to use a different hashing algorithm, but don't know the sum, you can provide just the hashfun and a warning message will be displayed, giving the correct tuple of (hashfun, targethex) that should be added to the registration block.
    • If you don't want to provide a checksum, because your data can change pass in the type Any which will suppress the warning messages. (But see above warnings about "what if my data is dynamic").
    • Can take a vector of checksums, being one for each file, or a single checksum in which case the per file hashes are xored to get the target hash. (See Recursive Structure in the documentation for developers).
  • fetch_method=fetch_default: a function to run to download the files
    • Function should take 2 parameters (remote_filepath, local_directorypath), and can must return the local filepath to the file downloaded.
    • Default (fetch_default) can correctly handle strings containing HTTP[S] URLs, or any remote_path type which overloads Base.basename and Base.download, e.g. AWSS3.S3Path.
    • Can take a vector of methods, being one for each file, or a single method, in which case that method is used to download all of them. (See Recursive Structure in the documentation for developers).
    • Overloading this lets you change things about how the download is done – the transport protocol.
    • The default is suitable for HTTP[/S], without auth. Modifying it can add authentication or an entirely different protocol (e.g. git, google drive etc).
    • This function is also responsible to work out what the local file should be called (as this is protocol dependent).
  • post_fetch_method: a function to run after the files have been downloaded
    • Should take the local filepath as its first and only argument. Can return anything.
    • Default is to do nothing.
    • Can do what it wants from there, but most likely wants to extract the file into the data directory.
    • towards this end DataDeps.jl includes a command: unpack which will extract an compressed folder, deleting the original.
    • It should be noted that post_fetch_method runs from within the data directory.
      • which means operations that just write to the current working directory (like rm or mv or run(`SOMECMD`)) just work.
      • You can call cwd() to get the the data directory for your own functions. (Or dirname(local_filepath)).
    • Can take a vector of methods, being one for each file, or a single method, in which case that same method is applied to all of the files. (See Recursive Structure in the documentation for developers).
    • You can check this as part of Preupload Checking in the documentation for developers.
DataDeps.ManualDataDepType
ManualDataDep(name, message)

A DataDep for if the installation needs to be handled manually. This can be done via Pkg/git if you put the dependency into the packages repo's /deps/data directory. More generally, message should give instructions on how to setup the data.

DataDeps.registerFunction
register(datadep::AbstractDataDep)

Registers the given datadep to be globally available to the program. this makes datadep"Name" work. register should be run within this __init__ of your module.

DataDeps.@datadep_strMacro
`datadep"Name"` or `datadep"Name/file"`

Use this just like you would a file path, except that you can refer by name to the datadep. The name alone will resolve to the corresponding folder. Even if that means it has to be downloaded first. Adding a path within it functions as expected.

Base.downloadFunction
Base.download(
    datadep::DataDep,
    localdir;
    remotepath=datadep.remotepath,
    skip_checksum=false,
    i_accept_the_terms_of_use=nothing)

A method to download a datadep. Normally, you do not have to download a data dependency manually. If you simply cause the string macro datadep"DepName", to be executed it will be downloaded if not already present.

Invoking this download method manually is normally for purposes of debugging, As such it include a number of parameters that most people will not want to use.

  • localdir: this is the local directory to save to.
  • remotepath: the remote path to fetch the data from, use this e.g. if you can't access the normal path where the data should be, but have an alternative.
  • skip_checksum: setting this to true causes the checksum to not be checked. Use this if the data has changed since the checksum was set in the registry, or for some reason you want to download different data.
  • i_accept_the_terms_of_use: use this to bypass the I agree to terms screen. Useful if you are scripting the whole process, or using another system to get confirmation of acceptance.
    • For automation perposes you can set the environment variable DATADEPS_ALWAYS_ACCEPT
    • If not set, and if DATADEPS_ALWAYS_ACCEPT is not set, then the user will be prompted.
    • Strictly speaking these are not always terms of use, it just refers to the message and permission to download.

If you need more control than this, then your best bet is to construct a new DataDep object, based on the original, and then invoke download on that.

Helpers

DataDeps.unpackFunction
unpack(f; keep_originals=false)

Extracts the content of an archive in the current directory; deleting the original archive, unless the keep_originals flag is set.

Internal

DataDeps.DataDepType
DataDep(
    name::String,
    message::String,
    remote_path::Union{String,Vector{String}...},
    [hash::Union{String,Vector{String}...},]; # Optional, if not provided will generate
    # keyword args (Optional):
    fetch_method=fetch_default # (remote_filepath, local_directory_path)->local_filepath
    post_fetch_method=identity # (local_filepath)->Any
)

Required Fields

  • name: the name used to refer to this datadep
    • Corresponds to a folder name where the datatep will be stored.
    • It can have spaces or any other character that is allowed in a Windows filestring (which is a strict subset of the restriction for unix filenames).
  • message: a message displayed to the user for they are asked if they want to download it
    • This is normally used to give a link to the original source of the data, a paper to be cited etc.
  • remote_path: where to fetch the data from
    • This is usually a string, or a vector of strings (or a vector of vectors... see Recursive Structure in the documentation for developers).

Optional Fields

  • hash: used to check whether the files downloaded correctly
    • By far the most common use is to just provide a SHA256 sum as a hex-string for the files.
    • If not provided, then a warning message with the SHA256 sum is displayed. This is to help package devs work out the sum for their files, without using an external tool. You can also calculate it using Preupload Checking in the documentation for developers.
    • If you want to use a different hashing algorithm, then you can provide a tuple (hashfun, targethex). hashfun should be a function which takes an IOStream, and returns a Vector{UInt8}. Such as any of the functions from SHA.jl, eg sha3_384, sha1_512 or md5 from MD5.jl
    • If you want to use a different hashing algorithm, but don't know the sum, you can provide just the hashfun and a warning message will be displayed, giving the correct tuple of (hashfun, targethex) that should be added to the registration block.
    • If you don't want to provide a checksum, because your data can change pass in the type Any which will suppress the warning messages. (But see above warnings about "what if my data is dynamic").
    • Can take a vector of checksums, being one for each file, or a single checksum in which case the per file hashes are xored to get the target hash. (See Recursive Structure in the documentation for developers).
  • fetch_method=fetch_default: a function to run to download the files
    • Function should take 2 parameters (remote_filepath, local_directorypath), and can must return the local filepath to the file downloaded.
    • Default (fetch_default) can correctly handle strings containing HTTP[S] URLs, or any remote_path type which overloads Base.basename and Base.download, e.g. AWSS3.S3Path.
    • Can take a vector of methods, being one for each file, or a single method, in which case that method is used to download all of them. (See Recursive Structure in the documentation for developers).
    • Overloading this lets you change things about how the download is done – the transport protocol.
    • The default is suitable for HTTP[/S], without auth. Modifying it can add authentication or an entirely different protocol (e.g. git, google drive etc).
    • This function is also responsible to work out what the local file should be called (as this is protocol dependent).
  • post_fetch_method: a function to run after the files have been downloaded
    • Should take the local filepath as its first and only argument. Can return anything.
    • Default is to do nothing.
    • Can do what it wants from there, but most likely wants to extract the file into the data directory.
    • towards this end DataDeps.jl includes a command: unpack which will extract an compressed folder, deleting the original.
    • It should be noted that post_fetch_method runs from within the data directory.
      • which means operations that just write to the current working directory (like rm or mv or run(`SOMECMD`)) just work.
      • You can call cwd() to get the the data directory for your own functions. (Or dirname(local_filepath)).
    • Can take a vector of methods, being one for each file, or a single method, in which case that same method is applied to all of the files. (See Recursive Structure in the documentation for developers).
    • You can check this as part of Preupload Checking in the documentation for developers.
DataDeps.preupload_checkMethod
preupload_check(datadep, local_filepath[s])::Bool)

Peforms preupload checks on the local files without having to download them. This is tool for creating or updating DataDeps, allowing the author to check the files before they are uploaded (or if downloaded directly). This checking includes checking the checksum, and the making sure the post_fetch_method runs without errors. It basically performs datadep resolution, but bypasses the step of downloading the files. The results of performing the post_fetch_method are not kept. As normal if the DataDep being checked does not have a checksum, or if the checksum does not match, then a warning message will be displayed. Similarly, if the post_fetch_method throws an exception, a warning will be displayed.

Returns: true or false, depending on if the checks were all good, or not.

Arguments:

  • datadep: Either an instance of a DataDep type, or the name of a registered DataDep as a AbstractString
  • local_filepath: a filepath or (recursive) list of filepaths. This is what would be returned by fetch in normal datadep use.
DataDeps.registerMethod
register(datadep::AbstractDataDep)

Registers the given datadep to be globally available to the program. this makes datadep"Name" work. register should be run within this __init__ of your module.

DataDeps.resolveMethod
resolve("name/path", @__FILE__)

Is the function that lives directly behind the datadep"name/path" macro. If you are working the the names of the datadeps programmatically, and don't want to download them by mistake; it can be easier to work with this function.

Note though that you must include @__FILE__ as the second argument, as DataDeps.jl uses this to allow reading the package specific deps/data directory. Advanced usage could specify a different file or nothing, but at that point you are on your own.

DataDeps.resolveMethod
resolve(datadep, inner_filepath, calling_filepath)

Returns a path to the folder containing the datadep. Even if that means downloading the dependency and putting it in there.

 - `inner_filepath` is the path to the file within the data dir
 - `calling_filepath` is a path to the file where this is being invoked from

This is basically the function the lives behind the string macro datadep"DepName/inner_filepath".

DataDeps.unpackMethod
unpack(f; keep_originals=false)

Extracts the content of an archive in the current directory; deleting the original archive, unless the keep_originals flag is set.

DataDeps.@datadep_strMacro
`datadep"Name"` or `datadep"Name/file"`

Use this just like you would a file path, except that you can refer by name to the datadep. The name alone will resolve to the corresponding folder. Even if that means it has to be downloaded first. Adding a path within it functions as expected.

Base.downloadMethod
Base.download(
    datadep::DataDep,
    localdir;
    remotepath=datadep.remotepath,
    skip_checksum=false,
    i_accept_the_terms_of_use=nothing)

A method to download a datadep. Normally, you do not have to download a data dependency manually. If you simply cause the string macro datadep"DepName", to be executed it will be downloaded if not already present.

Invoking this download method manually is normally for purposes of debugging, As such it include a number of parameters that most people will not want to use.

  • localdir: this is the local directory to save to.
  • remotepath: the remote path to fetch the data from, use this e.g. if you can't access the normal path where the data should be, but have an alternative.
  • skip_checksum: setting this to true causes the checksum to not be checked. Use this if the data has changed since the checksum was set in the registry, or for some reason you want to download different data.
  • i_accept_the_terms_of_use: use this to bypass the I agree to terms screen. Useful if you are scripting the whole process, or using another system to get confirmation of acceptance.
    • For automation perposes you can set the environment variable DATADEPS_ALWAYS_ACCEPT
    • If not set, and if DATADEPS_ALWAYS_ACCEPT is not set, then the user will be prompted.
    • Strictly speaking these are not always terms of use, it just refers to the message and permission to download.

If you need more control than this, then your best bet is to construct a new DataDep object, based on the original, and then invoke download on that.

DataDeps._resolveMethod

The core of the resolve function without any user friendly file stuff, returns the directory

DataDeps.accept_termsMethod
accept_terms(datadep, localpath, remotepath, i_accept_the_terms_of_use)

Ensures the user accepts the terms of use; otherwise errors out.

DataDeps.better_readlineFunction
better_readline(stream = stdin)

A version of readline that does not immediately return an empty string if the stream is closed. It will attempt to reopen the stream and if that fails then throw an error.

DataDeps.checksumMethod
checksum(hasher=sha2_256, filename[/s])

Executes the hasher, on the file/files, and returns a UInt8 array of the hash. xored if there are multiple files

DataDeps.checksum_passMethod
checksum_pass(hash, fetched_path)

Ensures the checksum passes, and handles the dialog with use user when it fails.

DataDeps.ensure_download_permittedMethod
ensure_download_permitted()

This function will throw an error if download functionality has been disabled. Otherwise will do nothing.

DataDeps.env_boolFunction
env_bool(key)

Checks for an environment variable and fuzzy converts it to a bool

DataDeps.env_listFunction
env_list(key)

Checks for an environment variable and converts it to a list of strings, sperated with a colon

DataDeps.fetch_baseMethod

fetchbase(remotepath, local_dir)

Download from remote_path to local_dir, via Base mechanisms. The download is performed using Base.download and Base.basename(remote_path) is used to determine the filename. This is very limited in the case of HTTP as the filename is not always encoded in the URL. But it does work for simple paths like "http://myserver/files/data.csv". In general for those cases prefer http_download.

The more important feature is that this works for anything that has overloaded Base.basename and Base.download, e.g. AWSS3.S3Path. While this doesn't work for all transport mechanisms (so some datadeps will still a custom fetch_method), it works for many.

DataDeps.fetch_httpMethod
fetch_http(remotepath, localdir; update_period=5)

Pass in a HTTP[/S] URL and a directory to save it to, and it downloads that file, returning the local path. This is using the HTTP protocol's method of defining filenames in headers, if that information is present. Returns the localpath that it was downloaded to.

update_period controls how often to print the download progress to the log. It is expressed in seconds. It is printed at @info level in the log. By default it is once per second, though this depends on configuration

DataDeps.handle_missingMethod
handle_missing(datadep::DataDep, calling_filepath)::String

This function is called when the datadep is missing.

DataDeps.input_choiceMethod
input_choice

Prompts the user for one of a list of options. Takes a vararg of tuples of Letter, Prompt, Action (0 argument function)

Example:

input_choice(
    ('A', "Abort -- errors out", ()->error("aborted")),
    ('X', "eXit -- exits normally", ()->exit()),
    ('C', "Continue -- continues running", ()->nothing)),
)
DataDeps.is_valid_nameMethod
is_valid_name(name)

This checks if a datadep name is valid. This basically means it must be a valid folder name on windows.

DataDeps.list_local_pathsMethod
list_local_paths( name|datadep, [calling_filepath|module|nothing])

Lists all the local paths to a given datadep. This may be an empty list

DataDeps.postfetch_checkMethod
postfetch_check(post_fetch_method, local_path)

Executes the postfetchmethod on the given local path, in a temporary directory. Returns true if there are no exceptions. Performs in (async) parallel if multiple paths are given

DataDeps.preferred_pathsFunction
preferred_paths(calling_filepath; use_package_dir=true)

returns the datadeps loadpath plus if callingfilepath is provided and use_package_dir=true and is currently inside a package directory then it also includes the path to the dataseps in that folder.

DataDeps.progress_update_periodMethod
progress_update_period()

Returns the period between updated being logged on the progress. This is used by the default fetch_method and is generally a good idea to use it in any custom fetch method, if possible

DataDeps.run_checksumMethod

Providing only a hash string, results in defaulting to sha2_256, with that string being the target

DataDeps.run_checksumMethod

If a vector of paths is provided and a vector of hashing methods (of any form) then they are all required to match.

DataDeps.run_checksumMethod

If only a function is provided then assume the user is a developer, wanting to know what hash-line to add to the Registration line.

DataDeps.run_checksumMethod

If nothing is provided then assume the user is a developer, wanting to know what sha2_256 hash-line to add to the Registration line.

DataDeps.run_checksumMethod
run_checksum(checksum, path)

THis runs the checksum on the files at the fetched_path. And returns true or false base on if the checksum matches. (always true if no target sum given) It is kinda flexible and accepts different kinds of behaviour to give different kinds of results.

If path (the second parameter) is a Vector, then unless checksum is also a Vector, the result is the xor of the all the file checksums.

DataDeps.run_checksumMethod

Use Any to mark as not caring about the hash. Use this for data that can change

DataDeps.run_fetchMethod
run_fetch(fetch_method, remotepath, localdir)

executes the fetchmethod on the given remotepath, into the local directory and local paths. Performs in (async) parallel if multiple paths are given

DataDeps.run_post_fetchMethod
run_post_fetch(post_fetch_method, fetched_path)

executes the postfetchmethod on the given fetched path, Performs in (async) parallel if multiple paths are given

DataDeps.splitpathMethod
splitpath(path)

The opposite of joinpath, splits a path unto each of its directories names / filename (for the last).

DataDeps.try_determine_load_pathMethod
try_determine_load_path(name)

Tries to find a local path to the datadep with the given name. If it fails then it returns nothing.

DataDeps.try_determine_package_datadeps_dirMethod
try_determine_package_datadeps_dir(filepath)

Takes a path to a file. If that path is in a package's folder, Then this returns a path to the deps/data dir for that package (as a Nullable). Which may or may not exist. If not in a package returns null

DataDeps.uv_accessMethod
uv_access(path, mode)

Check access to a path. Returns 2 results, first an error code (0 for all good), and second an error message. https://stackoverflow.com/a/47126837/179081