Description
FatDatasets is a package which gives access to a few datasets. It takes care of downloading, preprocessing, storing and loading the data. The datasets are considered as data dependencies, and are not contained in this repository but downloaded when requested. A few methods are also provided to generate random synthetic data.
All the datasets provided here are numerical datasets, i.e. they are a collection of n
numerical samples in dimension d
.
They are "unsupervised" datasets in the sense that no labels are any additional information is provided with them.
Usage
Load datasets using load_dataset("dataset_name")
. Depending on the datasets, result will be either a matrix (data stored in jld file and loaded in memory) or a H5Dataset object, whose values can be accessed using (2d only) indexing (data stored into a HDF5 file and loaded dynamically).
Julia using column-first ordering, we always use the convention columns=samples and rows=features.
Available datasets
Dataset name | n | d | Format | Remarks |
---|---|---|---|---|
kddcup99 | 4,898,431 | 145 | JLD | Between 0.0 and 1.0, contains converted categorical features. (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) |
kddcup99_small | 494,021 | 141 | JLD | 10% subset of kddcup99. Between 0.0 and 1.0, contains converted categorical features. (http://kdd.ics.uci.edu/databases/kddcup99/kddcup99.html) |
fma | 106,574 | 518 | JLD | Misc. aggregated audio features with various stats e.g. min, max, mean, median, skew, kurtosis. (https://github.com/mdeff/fma) |
fma_mfcc | 106,574 | 20 | JLD | Contains mean MFCC features from FMA dataset. (https://github.com/mdeff/fma) |
gowalla | 6,442,892 | 2 | JLD | Localization data. (https://snap.stanford.edu/data/loc-gowalla.html) |
fasttext_XX | 2,000,000 | 300 | HDF5 | XX should be one of these language codes: "en", "fr", "de", "eo". Lines starting by the char " will be escaped, so there might be 1,999,999 entries only in final dataset. (https://fasttext.cc/docs/en/crawl-vectors.html) |
lfw | 13233 | 62500 | HDF5 | Raw vectorized 250×250 images of celebrities faces from the “Labeled Faces in the Wild” dataset. (http://vis-www.cs.umass.edu/lfw/) |
celeba (link broken?) | 202599 | 38804 | HDF5 | Raw vectorized 178×218 images of celebrities faces from CelebA dataset. |
intel_lab | 2,313,153 | 5 | HDF5 | Time and misc. sensors measures in the Intel lab at Berkeley. (http://db.csail.mit.edu/labdata/labdata.html) |
Synthetic datasets
The following functions can be used to generate data according to mixture of gaussians or on a low-dimensional subspace.
GMM_dataset()
GMM_stream()
lowrank_dataset()
lowrank_stream()
Notes for developers
Useful functions:
compute_stats(path_to_hdf5_file)
will compute and store in the hdf5 file the mean and std of the dataset. Should be called after storing the file for all HDF5-stored datasets.convert_categorical_columns!(df)
: removes categorical values of aDataFrame
by adding new binary features.
For new requests: the DataDep registration should be placed in a file in src/datasets/registrators
and will be loaded at loadtime. Preprocessing functions can be placed directly in src/datasets
(and loaded with the module, provided it is explicitly include
d).
To do
- Fix the link to the celeba dataset, which seems to be broken.