CuratedDatasets
This package contains routines to load selected datasets from OpenML and UCI repositories.
It takes care of preprocessing/filtering the data and serving them as DataFrame
or Matrix
.
The median pairwise euclidean distance between data points has also been pre-computed for these selected datasets.
Usage
Use using CuratedDatasets
. UCIData
being an optional dependency, in order to load UCI datasets one needs to additionally manually run using UCIData
.
Main method:
X, y = load_dataset([T,] dataset_name)
whereT
can be eitherMatrix
(default if omitted,X
is then d×n but the keyword argumenttranspose_X=false
can be used if needed) orDataFrame
(X
has thenn
rows). By default, bothX
andy
will be normalized. This can be avoided by setting the keyword argumentsstandardize_X
,standardize_y
to false.
Additional methods:
median_pairwise_Euclidean
: median value of pairwise inter-point Euclidean distance (precomputed on a random subset).median_pairwise_Euclidean_standardized
: same as above but for standardized data (precomputed on a random subset).
Supported Datasets
Dataset name | n | d | Size | σ (no std/std) | Source | Remarks |
---|---|---|---|---|---|---|
fried | 40768 | 10 | 0.003 / 12.3 | 1.27 / 4.40 | OML #564 | |
stock | 59049 | 9 | 0.003 / 27.9 | 30.82 / 4.13 | OML #1200 | |
mlr_knn_rng | 111753 | 10 | 0.008 / 99.9 | 10280.0 / 3.57 | OML #42454 | "distance" inc. (one-hot), Ignoring "dataset"/"learner"/"cpo". |
methane | 9199930 | 32 | 2.35 / 677 TB | 91.54 / 6.69 | OML #42701 | Ignoring "MM263", "MM256", "year" |
diamonds | 53940 | 26 | 0.01 / 23.3 | 4.4 / 6.89 | OML #42225 | |
protein | 45730 | 9 | 0.003 / 16.7 | 480978.0 / 3.046 | OML #42903 | |
house_8L | 22784 | 8 | 1e-3 / 4.15 | 889.6 / 4.15 | OML #218 | |
sulfur | 10081 | 6 | 4e-4 / 0.8 | 0.571 / 3.14 | OML #44145 | |
elevators | 16599 | 18 | 0.002 / 2.2 | 267.7 / 4.994 | OML #216 | |
sarcos | 44484 | 21 | 0.007 / 15.8 | 26.15 / 6.063 | OML #43873 | Ignoring V23 to V28 |
year_prediction_msd | 515345 | 90 | 0.37 / 2.1 TB / 3.3 | 3069.3 / 10.95 | UCI |