CuratedDatasets

This package contains routines to load selected datasets from OpenML and UCI repositories. It takes care of preprocessing/filtering the data and serving them as DataFrame or Matrix. The median pairwise euclidean distance between data points has also been pre-computed for these selected datasets.

Usage

Use using CuratedDatasets. UCIData being an optional dependency, in order to load UCI datasets one needs to additionally manually run using UCIData.

Main method:

  • X, y = load_dataset([T,] dataset_name) where T can be either Matrix (default if omitted, X is then d×n but the keyword argument transpose_X=false can be used if needed) or DataFrame (X has then n rows). By default, both X and y will be normalized. This can be avoided by setting the keyword arguments standardize_X, standardize_y to false.

Additional methods:

  • median_pairwise_Euclidean: median value of pairwise inter-point Euclidean distance (precomputed on a random subset).
  • median_pairwise_Euclidean_standardized: same as above but for standardized data (precomputed on a random subset).

Supported Datasets

Dataset name n d Size σ (no std/std) Source Remarks
fried 40768 10 0.003 / 12.3 1.27 / 4.40 OML #564
stock 59049 9 0.003 / 27.9 30.82 / 4.13 OML #1200
mlr_knn_rng 111753 10 0.008 / 99.9 10280.0 / 3.57 OML #42454 "distance" inc. (one-hot), Ignoring "dataset"/"learner"/"cpo".
methane 9199930 32 2.35 / 677 TB 91.54 / 6.69 OML #42701 Ignoring "MM263", "MM256", "year"
diamonds 53940 26 0.01 / 23.3 4.4 / 6.89 OML #42225
protein 45730 9 0.003 / 16.7 480978.0 / 3.046 OML #42903
house_8L 22784 8 1e-3 / 4.15 889.6 / 4.15 OML #218
sulfur 10081 6 4e-4 / 0.8 0.571 / 3.14 OML #44145
elevators 16599 18 0.002 / 2.2 267.7 / 4.994 OML #216
sarcos 44484 21 0.007 / 15.8 26.15 / 6.063 OML #43873 Ignoring V23 to V28
year_prediction_msd 515345 90 0.37 / 2.1 TB / 3.3 3069.3 / 10.95 UCI