CuratedDatasets

This package contains routines to load selected datasets from OpenML and UCI repositories. It takes care of downloading the data (using OpenML and UCIData packages), and separating train/target variables according to hard-coded rules (beware that some datasets might provide more target variables than the one hard-coded in this package). No coercion to autotype is peformed, however all variables which are not considered as targets are converted to Continous (using a one-hot encoder if needed). Datasets can be loaded as DataFrame or Matrix, and standardized. The median pairwise euclidean distance between data points has also been pre-computed for these selected datasets.

Usage

Use using CuratedDatasets. UCIData being an optional dependency, in order to load UCI datasets one needs to additionally manually run using UCIData.

Main method:

X, y = load_dataset([T,] dataset_name) where T can be either DataFrame (default if omitted, X has then n rows) or Matrix (X is then d×n but the keyword argument transpose_X=false can be used if needed). By default, X will be normalized (but not y). This can be changed using the keyword arguments standardize_X and standardize_y. The kwarg force_X_continuous=false can be used to avoid automatic conversion, however the dimension of features in the resulting dataset may differ from the numbers reported in the table below.

Additional methods:

median_pairwise_Euclidean: median value of pairwise inter-point Euclidean distance (precomputed on a random subset).
median_pairwise_Euclidean_standardized: same as above but for standardized data (precomputed on a random subset).

Supported Datasets

Datasets with continuous target variable (regression)

The target variable is always scalar.

Dataset name	n	d	Target	Source	Remarks
fried	40768	10	"Y" (numeric)	OML #564
stock	59049	9	"company10" (numeric)	OML #1200
mlr_knn_rng	111753	10	"perf.logloss" (numeric)	OML #42454	"distance" inc. (one-hot), Ignoring "dataset"/"learner"/"cpo".
methane	9199930	32	"MM264" (numeric)	OML #42701	Ignoring "MM263", "MM256", "year"
diamonds	53940	26	"price" (numeric)	OML #42225
protein	45730	9	"RMSD" (numeric)	OML #42903
house_8L	22784	8	"price" (numeric)	OML #218
sulfur	10081	6	"y1" (numeric)	OML #44145
elevators	16599	18	"Goal" (numeric)	OML #216
sarcos	44484	21	"V22" (numeric)	OML #43873	Ignoring V23 to V28
year_prediction_msd	515345	90	"target"	UCI

Dataset with categorical target variable (classification/clustering)

For these datasets, the target is coerced to Multiclass even when it is not encoded as such.

Dataset name	n	d	Target	Source	Remarks
dna	3186	180	"class" (categorical)	OML #40670	Features are already one-hot encoded in the dataset, and thus treated as Continuous.
segment	2310	18	"class" (categorical, string)	OML #36
mushrooms	5644	125	"class" (categorical, string)	OML #24	Only one-hot encoded features, some features are constant and should not be normalized
a8a	32561	123	"class" (categorical, ±1)	OML #32	Obtained from Adult UCI dataset. Features obtained with one-hot encoding and quantile-based discretization.
SVHN	99289	3072	"class" (categorical, 1-10)	OML#41081	Street View House Numbers (classes = digits), 32×32 RGB images
covertype-binary	581012	53	"Y" (categorical, ±1)	OpenML #293	Forest cover type (originally from UCI)
covertype	581012	54	"class" (categorical, 1-7)	OpenML #1596	Forest cover type (originally from UCI). Binary variables (Wilderness_AreaX and Soil_TypeX coerced to Continuous

Coercion rules

Data is processed as follows:

the dataset is loaded using either the OpenML or UCIData packages;
it is converted to a DataFrame and missing values are removed
hard-coded coercion rules are applied (indicated in the above table, see "remarks")
unpack is called, and the target variable (y) corresponds to the column indicated as "target" in the above table.