Fairness Datasets

To make it easy to try algorithms and metrics on various datasets, Fairness.jl provides you with the popular fairness datasets.

These datasets can be easily accesses using macros.

COMPAS Dataset


Macro to load COMPAS dataset It is a reduced version of COMPAS Datset with 8 features and 6907 rows. The protected attributes are sex and race. The available features are used to predict whether a criminal defendant will recidivate(reoffend).

Returns (X, y)

julia> using Fairness

julia> X, y = @load_compas;

Adult Dataset


Macro to Load the Adult dataset It has 14 features and 32561 rows. The protected attributes are race and sex. This dataset is used to predict whether income exceeds 50K dollars per year.

Returns (X, y)

German Credit Dataset


Load the full version of German credit dataset. This dataset has 20 features and 1000 rows. The protected attributes are gender_status and age (>25 is priviledged) Using the 20 features, it classifies the credit decision to a person as good or bad credit risks.

Returns (X, y)

Bank Marketing Dataset


The data is related with direct marketing campaigns of a Portuguese banking institution. The marketing campaigns were based on phone calls. Often, more than one contact to the same client was required, in order to access if the product (bank term deposit) would be ('yes') or not ('no') subscribed. It has 20 features and 41188 rows. The protected attributes is marital.

Communities and Crime Dataset


The per capita violent crimes variable was calculated using population and the sum of crime variables considered violent crimes in the United States: murder, rape, robbery, and assault. It has 127 features and 1994 rows. The protected attributes are ....?

Student Performance Dataset


Student Performance Dataset. It has 395 rows and 30 features. The target attribute corresponds to grade G1. The target tells whether the student gets grade >= 12. The protected attribute is sex.

Synthetic Datasets


genZafarData(n = 10000; d = pi/4)

Generate synthetic data from Zafar et al., 2017 Fairness Constraints: Mechanisms for Fair Classification.


  • n=10000 : number of samples
  • d=pi/4 : discrimination factor


  • X : DataFrame containing features and protected attribute z {"A", "B"} where z="B" is the protected group.
  • y : Binary Target variable {-1, 1}

genSubgroupData(n=10000, setting="B00")

Generate synthetic data from Loh et al., 2019 : Subgroup identification for precision medicine: A comparative review of 13 methods


  • n=10000 : number of samples
  • setting="B00" : Simulation data setting: one of "B00", ..., "B02", "B1", ... , "B8"

For "B00", ..., "B02" there is no "bias" in the data, i.e. group membership has no effect on y. whereas for "B1", ... , "B8", there is a direct effect of group membership z on y, usually mediated by one or more features.


  • X : DataFrame containing features and protected attribute z
  • y : Binary target variable

Generate synthetic data from Zafar et al., 2017 Fairness Beyond Disparate Treatment & Disparate Impact


  • n=10000 : number of samples


  • X : DataFrame containing features and protected attribute z
  • y : Binary Target variable

genBiasedSampleData(n=10000, sampling_bias=0.8)

Generate synthetic data: Biased sample


  • n=10000 : number of samples
  • sampling_bias=0.8 : Percentage of data belonging to majority group.

The idea behind this simulation is that algorithms might fit the process in the majority group while disregarding the process in the minority group.

Two different processes for d1 and d2: d1: logit(y) = 0.5( X1 + X2 + 0.3X4) + 2(I(X3 > 0)) d2: logit(y) = 0.5(0.3X1 + X2 + X4) + 2(I(X3 > 0.2))


  • X : DataFrame containing features and protected attribute z
  • y : Binary Target variable

Inspecting Datasets

To see the columns in dataset, their types and scientific types, you can use schema from MLJ.

julia> using Fairness, MLJ

julia> X, y = @load_adult;

julia> schema(X)
│ _.names        │ _.types                          │ _.scitypes     │
│ age            │ Float64                          │ Continuous     │
│ workclass      │ CategoricalValue{String, UInt32} │ Multiclass{9}  │
│ fnlwgt         │ Float64                          │ Continuous     │
│ education      │ CategoricalValue{String, UInt32} │ Multiclass{16} │
│ education_num  │ Float64                          │ Continuous     │
│ marital_status │ CategoricalValue{String, UInt32} │ Multiclass{7}  │
│ occupation     │ CategoricalValue{String, UInt32} │ Multiclass{15} │
│ relationship   │ CategoricalValue{String, UInt32} │ Multiclass{6}  │
│ race           │ CategoricalValue{String, UInt32} │ Multiclass{5}  │
│ sex            │ CategoricalValue{String, UInt32} │ Multiclass{2}  │
│ capital_gain   │ Float64                          │ Continuous     │
│ capital_loss   │ Float64                          │ Continuous     │
│ hours_per_week │ Float64                          │ Continuous     │
│ native_country │ CategoricalValue{String, UInt32} │ Multiclass{42} │
_.nrows = 32561

Toy Data

This is a 10 row dataset that was used by authors of Reweighing Algorithm. This dataset is intended to test ideas and evaluate metrics without calculating predictions. It is different from other macros as it returns (X, y, ŷ) instead of (X, y)


Macro to read csv file of job data (data/jobs.csv) and convert columns to categorical. Returns the tuple (X, y, ŷ)


Macro to create fairness Tensor for data/jobs.csv The fairness tensor will be created on the basis of the column Job Type. This column has 3 different values for job types.

julia> X, y, ŷ = @load_toydata;

julia> ft = @load_toyfairtensor
Fairness.FairTensor{3}([2 2; 0 0; 0 2]

[0 0; 2 1; 1 0], ["Board", "Education", "Healthcare"])

Other Datasets

You can try working with the vast range of datasets available through OpenML. Refer MLJ's OpenML documentation for the OpenML API. The id to be passed to OpenML.load can be found through OpenML site

julia> using MLJ, Fairness

julia> using DataFrames

julia> data = OpenML.load(1480); # load Indian Liver Patient Dataset

julia> df = DataFrame(data) ;

julia> y, X = unpack(df, ==(:Class), name->true); # Unpack the data into features and target

julia> y = coerce(y, Multiclass); # Specifies that the target y is of type Multiclass. It is othewise a string.

julia> coerce!(X, :V2 => Multiclass, Count => Continuous); # Specifying which columns are Multiclass in nature. Converting from Count to Continuous enables use of more models.

Helper Functions


Checks whether the dataset is already present in data directory. Downloads it if not present.


genGaussian(meanin, covin, class_label, n)

Draw from a gaussian distribution


  • mean_in : means
  • cov_in : covariances
  • class_label : class_label
  • n : number of samples to draw

logit_fun(X, z, setting)

Compute y from X and z according to a setting provided in Loh et al., 2019: Subgroup identification for precision medicine: A comparative review of 13 methods


  • X : matrix of features
  • z : vector of group assignments
  • setting : Simulation data setting: one of "B00", ..., "B02", "B1", ... , "B8"