Description

FatDatasets is a package which gives access to a few datasets. It takes care of downloading, preprocessing, storing and loading the data. The datasets are considered as data dependencies, and are not contained in this repository but downloaded when requested. A few methods are also provided to generate random synthetic data.

All the datasets provided here are numerical datasets, i.e. they are a collection of n numerical samples in dimension d. They are "unsupervised" datasets in the sense that no labels are any additional information is provided with them.

Usage

Load datasets using load_dataset("dataset_name"). Depending on the datasets, result will be either a matrix (data stored in jld file and loaded in memory) or a H5Dataset object, whose values can be accessed using (2d only) indexing (data stored into a HDF5 file and loaded dynamically).

Julia using column-first ordering, we always use the convention columns=samples and rows=features.

Available datasets

Dataset name n d Format Remarks
kddcup99 4,898,431 145 JLD Between 0.0 and 1.0, contains converted categorical features. (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)
kddcup99_small 494,021 141 JLD 10% subset of kddcup99. Between 0.0 and 1.0, contains converted categorical features. (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)
fma 106,574 518 JLD Misc. aggregated audio features with various stats e.g. min, max, mean, median, skew, kurtosis. (https://github.com/mdeff/fma)
fma_mfcc 106,574 20 JLD Contains mean MFCC features from FMA dataset. (https://github.com/mdeff/fma)
gowalla 6,442,892 2 JLD Localization data. (https://snap.stanford.edu/data/loc-gowalla.html)
fasttext_XX 2,000,000 300 HDF5 XX should be one of these language codes: "en", "fr", "de", "eo". Lines starting by the char " will be escaped, so there might be 1,999,999 entries only in final dataset. (https://fasttext.cc/docs/en/crawl-vectors.html)
lfw 13233 62500 HDF5 Raw vectorized 250×250 images of celebrities faces from the “Labeled Faces in the Wild” dataset. (http://vis-www.cs.umass.edu/lfw/)
celeba (link broken?) 202599 38804 HDF5 Raw vectorized 178×218 images of celebrities faces from CelebA dataset.
intel_lab 2,313,153 5 HDF5 Time and misc. sensors measures in the Intel lab at Berkeley. (http://db.csail.mit.edu/labdata/labdata.html)

Synthetic datasets

The following functions can be used to generate data according to mixture of gaussians or on a low-dimensional subspace.

  • GMM_dataset()
  • GMM_stream()
  • lowrank_dataset()
  • lowrank_stream()

Notes for developers

Useful functions:

  • compute_stats(path_to_hdf5_file) will compute and store in the hdf5 file the mean and std of the dataset. Should be called after storing the file for all HDF5-stored datasets.
  • convert_categorical_columns!(df): removes categorical values of a DataFrame by adding new binary features.

For new requests: the DataDep registration should be placed in a file in src/datasets/registrators and will be loaded at loadtime. Preprocessing functions can be placed directly in src/datasets (and loaded with the module, provided it is explicitly included).

To do

  • Fix the link to the celeba dataset, which seems to be broken.