Module

Feature selection has been one of the important steps in machine learning. Some of the advantages of feature selection are the performance increase and may prevent model to overfits.

This technique selects features based on their importance, which is defined by how much impact does the feature have to the target variable. This package helps selecting the important features based on the correlation and p-value.

Selectors

Univariate feature selector

FeatureSelectors.UnivariateFeatureSelector — Type

UnivariateFeatureSelector has the following fields:

method::Function (required)- Method to calculate feature importance. The method chosen will determine the scoring. Below is the scring with available statistical method to obtain them.
- Correlation - higher score means more important
  - pearson_correlation
- P-value - lower score means more important
  - f_test
  - chisq_test
k::Union{Int64,Nothing} - Select top k features with the highest correlation to target variable. You could ignore this by specifying k == nothing. This defaults to nothing.
threshold::Union{Float64,Nothing} - Select features with correlation more than or equal to threshold. To ignore, simply set threshold to nothing (default behavior).

Supported method

FeatureSelectors.pearson_correlation — Function

pearson_correlation(X_data::Matrix, y::Vector)

Calculate pearson's correlation on X_data to y.

FeatureSelectors.f_test — Function

f_test(X_data::Matrix, y::Vector)

Calculate p-value using f-test method.

FeatureSelectors.chisq_test — Function

chisq_test(X_data::Matrix, y::Vector)

Calculate p-value using chi-square test.

Select feature function

FeatureSelectors.select_features — Function

select_features(selector,
                X::DataFrame,
                y::Vector;
                verbose::Bool=false)

Select features based on the importance, which is defined by selector.method to target y. if verbose is true, logs will be printed - this defaults to false. This function will return only the feature names of selected features.

If you have feature X_data as matrix and feature names X_features as a Vector, you can replace X with X_data and X_features (in this order).

Example

julia> using RDatasets, FeatureSelectors, DataFrames

julia> boston = dataset("MASS", "Boston");

julia> selector = UnivariateFeatureSelector(method=pearson_correlation, k=5)
UnivariateFeatureSelector(FeatureSelectors.pearson_correlation, 5, nothing)

julia> select_features(
           selector,
           boston[:, Not(:MedV)],
           boston.MedV
       )
5-element Vector{String}:
 "LStat"
 "Rm"
 "PTRatio"
 "Indus"
 "Tax"

Other util functions

FeatureSelectors.calculate_feature_importance — Method

calculate_feature_importance(method::Function, X::DataFrame, y::Vector)

Calculate feature importance defined by method. Similar with select_features, this can take X in DataFrame or splitted into X_data in Matrix and X_features in Vector.

This function will return Dict with feature names as key and scores as value.

FeatureSelectors.one_hot_encode — Method

one_hot_encode(df::DataFrame;
               cols::Vector{Symbol}=Vector{Symbol}(),
               drop_original::Bool=false)

Utility function to perform one-hot-encoding in DataFrame. This will add new columns with names <original_col_name>_<value>.

Following options can be passed to modify behavior.

cols - Vector of Symbol to specify which columns to be encoded. Defaults to empty, which means all features will be encoded.
drop_original - If true, this will drop the original feature set from resulting DataFrame. This defaults to false.

Example

julia> using RDatasets, FeatureSelectors

julia> titanic = dataset("datasets", "Titanic");

julia> first(one_hot_encode(titanic[:, [:Class, :Sex, :Age]]), 3)
3×11 DataFrame
 Row │ Class    Sex      Age      Class_1st  Class_2nd  Class_3rd  Class_Crew  ⋯
     │ String7  String7  String7  Bool       Bool       Bool       Bool        ⋯
─────┼──────────────────────────────────────────────────────────────────────────
   1 │ 1st      Male     Child         true      false      false       false  ⋯
   2 │ 2nd      Male     Child        false       true      false       false
   3 │ 3rd      Male     Child        false      false       true       false
                                                               4 columns omitted


julia> first(one_hot_encode(titanic[:, [:Class, :Sex, :Age]], cols=[:Class], drop_original=true), 3)
3×6 DataFrame
 Row │ Sex      Age      Class_1st  Class_2nd  Class_3rd  Class_Crew
     │ String7  String7  Bool       Bool       Bool       Bool
─────┼───────────────────────────────────────────────────────────────
   1 │ Male     Child         true      false      false       false
   2 │ Male     Child        false       true      false       false
   3 │ Male     Child        false      false       true       false