Module
Feature selection has been one of the important steps in machine learning. Some of the advantages of feature selection are the performance increase and may prevent model to overfits.
This technique selects features based on their importance, which is defined by how much impact does the feature have to the target variable. This package helps selecting the important features based on the correlation and p-value.
Selectors
Univariate feature selector
FeatureSelectors.UnivariateFeatureSelector
— TypeUnivariateFeatureSelector
has the following fields:
method::Function
(required)- Method to calculate feature importance. The method chosen will determine the scoring. Below is the scring with available statistical method to obtain them.Correlation - higher score means more important
pearson_correlation
P-value - lower score means more important
f_test
chisq_test
k::Union{Int64,Nothing}
- Select topk
features with the highest correlation to target variable. You could ignore this by specifying k == nothing. This defaults to nothing.threshold::Union{Float64,Nothing}
- Select features with correlation more than or equal to threshold. To ignore, simply set threshold to nothing (default behavior).
Supported method
FeatureSelectors.pearson_correlation
— Functionpearson_correlation(X_data::Matrix, y::Vector)
Calculate pearson's correlation on X_data
to y
.
FeatureSelectors.f_test
— Functionf_test(X_data::Matrix, y::Vector)
Calculate p-value using f-test method.
FeatureSelectors.chisq_test
— Functionchisq_test(X_data::Matrix, y::Vector)
Calculate p-value using chi-square test.
Select feature function
FeatureSelectors.select_features
— Functionselect_features(selector,
X::DataFrame,
y::Vector;
verbose::Bool=false)
Select features based on the importance, which is defined by selector.method
to target y
. if verbose
is true, logs will be printed - this defaults to false. This function will return only the feature names of selected features.
If you have feature X_data
as matrix and feature names X_features
as a Vector, you can replace X
with X_data
and X_features
(in this order).
Example
julia> using RDatasets, FeatureSelectors, DataFrames
julia> boston = dataset("MASS", "Boston");
julia> selector = UnivariateFeatureSelector(method=pearson_correlation, k=5)
UnivariateFeatureSelector(FeatureSelectors.pearson_correlation, 5, nothing)
julia> select_features(
selector,
boston[:, Not(:MedV)],
boston.MedV
)
5-element Vector{String}:
"LStat"
"Rm"
"PTRatio"
"Indus"
"Tax"
Other util functions
FeatureSelectors.calculate_feature_importance
— Methodcalculate_feature_importance(method::Function, X::DataFrame, y::Vector)
Calculate feature importance defined by method
. Similar with select_features
, this can take X
in DataFrame
or splitted into X_data
in Matrix
and X_features
in Vector
.
This function will return Dict
with feature names as key and scores as value.
FeatureSelectors.one_hot_encode
— Methodone_hot_encode(df::DataFrame;
cols::Vector{Symbol}=Vector{Symbol}(),
drop_original::Bool=false)
Utility function to perform one-hot-encoding in DataFrame. This will add new columns with names <original_col_name>_<value>
.
Following options can be passed to modify behavior.
cols
- Vector of Symbol to specify which columns to be encoded. Defaults to empty, which means all features will be encoded.drop_original
- If true, this will drop the original feature set from resulting DataFrame. This defaults to false.
Example
julia> using RDatasets, FeatureSelectors
julia> titanic = dataset("datasets", "Titanic");
julia> first(one_hot_encode(titanic[:, [:Class, :Sex, :Age]]), 3)
3×11 DataFrame
Row │ Class Sex Age Class_1st Class_2nd Class_3rd Class_Crew ⋯
│ String7 String7 String7 Bool Bool Bool Bool ⋯
─────┼──────────────────────────────────────────────────────────────────────────
1 │ 1st Male Child true false false false ⋯
2 │ 2nd Male Child false true false false
3 │ 3rd Male Child false false true false
4 columns omitted
julia> first(one_hot_encode(titanic[:, [:Class, :Sex, :Age]], cols=[:Class], drop_original=true), 3)
3×6 DataFrame
Row │ Sex Age Class_1st Class_2nd Class_3rd Class_Crew
│ String7 String7 Bool Bool Bool Bool
─────┼───────────────────────────────────────────────────────────────
1 │ Male Child true false false false
2 │ Male Child false true false false
3 │ Male Child false false true false