CuratedDatasets

This package contains routines to load selected datasets from OpenML and UCI repositories. It takes care of downloading the data (using OpenML and UCIData packages), and separating train/target variables according to hard-coded rules (beware that some datasets might provide more target variables than the one hard-coded in this package). No coercion to autotype is peformed, however all variables which are not considered as targets are converted to Continous (using a one-hot encoder if needed). Datasets can be loaded as DataFrame or Matrix, and standardized. The median pairwise euclidean distance between data points has also been pre-computed for these selected datasets.

Usage

Use using CuratedDatasets. UCIData being an optional dependency, in order to load UCI datasets one needs to additionally manually run using UCIData.

Main method:

  • X, y = load_dataset([T,] dataset_name) where T can be either DataFrame (default if omitted, X has then n rows) or Matrix (X is then d×n but the keyword argument transpose_X=false can be used if needed). By default, X will be normalized (but not y). This can be changed using the keyword arguments standardize_X and standardize_y. The kwarg force_X_continuous=false can be used to avoid automatic conversion, however the dimension of features in the resulting dataset may differ from the numbers reported in the table below.

Additional methods:

  • median_pairwise_Euclidean: median value of pairwise inter-point Euclidean distance (precomputed on a random subset).
  • median_pairwise_Euclidean_standardized: same as above but for standardized data (precomputed on a random subset).

Supported Datasets

Datasets with continuous target variable (regression)

The target variable is always scalar.

Dataset name n d Target Source Remarks
fried 40768 10 "Y" (numeric) OML #564
stock 59049 9 "company10" (numeric) OML #1200
mlr_knn_rng 111753 10 "perf.logloss" (numeric) OML #42454 "distance" inc. (one-hot), Ignoring "dataset"/"learner"/"cpo".
methane 9199930 32 "MM264" (numeric) OML #42701 Ignoring "MM263", "MM256", "year"
diamonds 53940 26 "price" (numeric) OML #42225
protein 45730 9 "RMSD" (numeric) OML #42903
house_8L 22784 8 "price" (numeric) OML #218
sulfur 10081 6 "y1" (numeric) OML #44145
elevators 16599 18 "Goal" (numeric) OML #216
sarcos 44484 21 "V22" (numeric) OML #43873 Ignoring V23 to V28
year_prediction_msd 515345 90 "target" UCI

Dataset with categorical target variable (classification/clustering)

For these datasets, the target is coerced to Multiclass even when it is not encoded as such.

Dataset name n d Target Source Remarks
dna 3186 180 "class" (categorical) OML #40670 Features are already one-hot encoded in the dataset, and thus treated as Continuous.
segment 2310 18 "class" (categorical, string) OML #36
mushrooms 5644 125 "class" (categorical, string) OML #24 Only one-hot encoded features, some features are constant and should not be normalized
a8a 32561 123 "class" (categorical, ±1) OML #32 Obtained from Adult UCI dataset. Features obtained with one-hot encoding and quantile-based discretization.
SVHN 99289 3072 "class" (categorical, 1-10) OML#41081 Street View House Numbers (classes = digits), 32×32 RGB images
covertype-binary 581012 53 "Y" (categorical, ±1) OpenML #293 Forest cover type (originally from UCI)
covertype 581012 54 "class" (categorical, 1-7) OpenML #1596 Forest cover type (originally from UCI). Binary variables (Wilderness_AreaX and Soil_TypeX coerced to Continuous

Coercion rules

Data is processed as follows:

  • the dataset is loaded using either the OpenML or UCIData packages;
  • it is converted to a DataFrame and missing values are removed
  • hard-coded coercion rules are applied (indicated in the above table, see "remarks")
  • unpack is called, and the target variable (y) corresponds to the column indicated as "target" in the above table.