CuratedDatasets

This package contains routines to load selected datasets from OpenML and UCI repositories. It takes care of preprocessing/filtering the data and serving them as DataFrame or Matrix. The median pairwise euclidean distance between data points has also been pre-computed for these selected datasets.

Usage

Use using CuratedDatasets. UCIData being an optional dependency, in order to load UCI datasets one needs to additionally manually run using UCIData.

Main method:

X, y = load_dataset([T,] dataset_name) where T can be either Matrix (default if omitted, X is then d×n but the keyword argument transpose_X=false can be used if needed) or DataFrame (X has then n rows). By default, both X and y will be normalized. This can be avoided by setting the keyword arguments standardize_X, standardize_y to false.

Additional methods:

median_pairwise_Euclidean: median value of pairwise inter-point Euclidean distance (precomputed on a random subset).
median_pairwise_Euclidean_standardized: same as above but for standardized data (precomputed on a random subset).

Supported Datasets

Dataset name	n	d	Size	σ (no std/std)	Source	Remarks
fried	40768	10	0.003 / 12.3	1.27 / 4.40	OML #564
stock	59049	9	0.003 / 27.9	30.82 / 4.13	OML #1200
mlr_knn_rng	111753	10	0.008 / 99.9	10280.0 / 3.57	OML #42454	"distance" inc. (one-hot), Ignoring "dataset"/"learner"/"cpo".
methane	9199930	32	2.35 / 677 TB	91.54 / 6.69	OML #42701	Ignoring "MM263", "MM256", "year"
diamonds	53940	26	0.01 / 23.3	4.4 / 6.89	OML #42225
protein	45730	9	0.003 / 16.7	480978.0 / 3.046	OML #42903
house_8L	22784	8	1e-3 / 4.15	889.6 / 4.15	OML #218
sulfur	10081	6	4e-4 / 0.8	0.571 / 3.14	OML #44145
elevators	16599	18	0.002 / 2.2	267.7 / 4.994	OML #216
sarcos	44484	21	0.007 / 15.8	26.15 / 6.063	OML #43873	Ignoring V23 to V28
year_prediction_msd	515345	90	0.37 / 2.1 TB / 3.3	3069.3 / 10.95	UCI