Description

FatDatasets is a package which gives access to a few datasets. It takes care of downloading, preprocessing, storing and loading the data. The datasets are considered as data dependencies, and are not contained in this repository but downloaded when requested. A few methods are also provided to generate random synthetic data.

All the datasets provided here are numerical datasets, i.e. they are a collection of n numerical samples in dimension d. They are "unsupervised" datasets in the sense that no labels are any additional information is provided with them.

Usage

Load datasets using load_dataset("dataset_name"). Depending on the datasets, result will be either a matrix (data stored in jld file and loaded in memory) or a H5Dataset object, whose values can be accessed using (2d only) indexing (data stored into a HDF5 file and loaded dynamically).

Julia using column-first ordering, we always use the convention columns=samples and rows=features.

Available datasets

Dataset name	n	d	Format	Remarks
kddcup99	4,898,431	145	JLD	Between 0.0 and 1.0, contains converted categorical features. (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)
kddcup99_small	494,021	141	JLD	10% subset of kddcup99. Between 0.0 and 1.0, contains converted categorical features. (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html)
fma	106,574	518	JLD	Misc. aggregated audio features with various stats e.g. min, max, mean, median, skew, kurtosis. (https://github.com/mdeff/fma)
fma_mfcc	106,574	20	JLD	Contains mean MFCC features from FMA dataset. (https://github.com/mdeff/fma)
gowalla	6,442,892	2	JLD	Localization data. (https://snap.stanford.edu/data/loc-gowalla.html)
fasttext_XX	2,000,000	300	HDF5	XX should be one of these language codes: "en", "fr", "de", "eo". Lines starting by the char " will be escaped, so there might be 1,999,999 entries only in final dataset. (https://fasttext.cc/docs/en/crawl-vectors.html)
lfw	13233	62500	HDF5	Raw vectorized 250×250 images of celebrities faces from the “Labeled Faces in the Wild” dataset. (http://vis-www.cs.umass.edu/lfw/)
celeba (link broken?)	202599	38804	HDF5	Raw vectorized 178×218 images of celebrities faces from CelebA dataset.
intel_lab	2,313,153	5	HDF5	Time and misc. sensors measures in the Intel lab at Berkeley. (http://db.csail.mit.edu/labdata/labdata.html)

Synthetic datasets

The following functions can be used to generate data according to mixture of gaussians or on a low-dimensional subspace.

GMM_dataset()
GMM_stream()
lowrank_dataset()
lowrank_stream()

Notes for developers

Useful functions:

compute_stats(path_to_hdf5_file) will compute and store in the hdf5 file the mean and std of the dataset. Should be called after storing the file for all HDF5-stored datasets.
convert_categorical_columns!(df): removes categorical values of a DataFrame by adding new binary features.

For new requests: the DataDep registration should be placed in a file in src/datasets/registrators and will be loaded at loadtime. Preprocessing functions can be placed directly in src/datasets (and loaded with the module, provided it is explicitly included).

To do

Fix the link to the celeba dataset, which seems to be broken.