FeatureSelection

FeatureSelction is a julia package containing implementations of feature selection algorithms for use with the machine learning toolbox MLJ.

Installation

On a running instance of Julia with at least version 1.6 run

import Pkg;
Pkg.add("FeatureSelection")

Example Usage

Lets build a supervised recursive feature eliminator with RandomForestRegressor from DecisionTree.jl as our base model. But first we need a dataset to train on. We shall create a synthetic dataset popularly known in the R community as the friedman dataset#1. Notice how the target vector for this dataset depends on only the first five columns of feature table. So we expect that our recursive feature elimination should return the first columns as important features.

using MLJ, FeatureSelection, StableRNGs
rng = StableRNG(123)
A = rand(rng, 50, 10)
X = MLJ.table(A) # features
 y = @views(
        10 .* sin.(
            pi .* A[:, 1] .* A[:, 2]
        ) + 20 .* (A[:, 3] .- 0.5).^ 2 .+ 10 .* A[:, 4] .+ 5 * A[:, 5]
) # target
50-element Vector{Float64}:
 15.823421292367547
 11.300228454892402
 14.70281910203931
  5.771835160196897
 18.552879762728146
 20.78516621103614
 20.681427309506923
 21.326088995836216
 14.247147497721128
 13.537577529977188
  ⋮
 19.965258516245633
 19.364285908333393
 13.314083067474565
 19.297478118395937
 22.704030205168113
  8.23163352846279
 19.138707544262704
 10.856925348363083
 18.098524734814458

Now we that we have our data, we can create our recursive feature elimination model and train it on our dataset

RandomForestRegressor = @load RandomForestRegressor pkg=DecisionTree
forest = RandomForestRegressor(rng=rng)
rfe = RecursiveFeatureElimination(
    model = forest, n_features=5, step=1
) # see doctring for description of defaults
mach = machine(rfe, X, y)
fit!(mach)
trained Machine; caches model-specific representations of data
  model: DeterministicRecursiveFeatureElimination(model = RandomForestRegressor(max_depth = -1, …), …)
  args: 
    1:	Source @719 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @838 ⏎ AbstractVector{ScientificTypesBase.Continuous}

We can inspect the feature importances in two ways:

julia> report(mach).scores
Dict{Symbol, Int64} with 10 entries:
  :x9  => 4
  :x2  => 6
  :x5  => 6
  :x6  => 3
  :x7  => 2
  :x3  => 6
  :x8  => 1
  :x4  => 6
  :x10 => 5
  :x1  => 6

julia> feature_importances(mach)
10-element Vector{Pair{Symbol, Int64}}:
  :x9 => 4
  :x2 => 6
  :x5 => 6
  :x6 => 3
  :x7 => 2
  :x3 => 6
  :x8 => 1
  :x4 => 6
 :x10 => 5
  :x1 => 6

We can view the important features used by our model by inspecting the fitted_params object.

julia> p = fitted_params(mach)
(features_left = [:x4, :x2, :x1, :x5, :x3],
 model_fitresult = (forest = Ensemble of Decision Trees
Trees:      100
Avg Leaves: 25.3
Avg Depth:  8.01,),)

julia> p.features_left
5-element Vector{Symbol}:
 :x4
 :x2
 :x1
 :x5
 :x3

We can also call the predict method on the fitted machine, to predict using a random forest regressor trained using only the important features, or call the transform method, to select just those features from some new table including all the original features. For more info, type ?RecursiveFeatureElimination on a Julia REPL.

Okay, let's say that we didn't know that our synthetic dataset depends on only five columns from our feature table. We could apply cross fold validation StratifiedCV(nfolds=5) with our recursive feature elimination model to select the optimal value of n_features for our model. In this case we will use a simple Grid search with root mean square as the measure.

rfe = RecursiveFeatureElimination(model = forest)
tuning_rfe_model  = TunedModel(
    model = rfe,
    measure = rms,
    tuning = Grid(rng=rng),
    resampling = StratifiedCV(nfolds = 5),
    range = range(
        rfe, :n_features, values = 1:10
    )
)
self_tuning_rfe_mach = machine(tuning_rfe_model, X, y)
fit!(self_tuning_rfe_mach)
trained Machine; does not cache data
  model: ProbabilisticTunedModel(model = DeterministicRecursiveFeatureElimination(model = RandomForestRegressor(max_depth = -1, …), …), …)
  args: 
    1:	Source @561 ⏎ ScientificTypesBase.Table{AbstractVector{ScientificTypesBase.Continuous}}
    2:	Source @278 ⏎ AbstractVector{ScientificTypesBase.Continuous}

As before we can inspect the important features by inspecting the object returned by fitted_params or feature_importances as shown below.

julia> fitted_params(self_tuning_rfe_mach).best_fitted_params.features_left
5-element Vector{Symbol}:
 :x4
 :x2
 :x1
 :x5
 :x3

julia> feature_importances(self_tuning_rfe_mach)
10-element Vector{Pair{Symbol, Int64}}:
  :x9 => 2
  :x2 => 6
  :x5 => 6
  :x6 => 4
  :x7 => 1
  :x3 => 6
  :x8 => 5
  :x4 => 6
 :x10 => 3
  :x1 => 6

and call predict on the tuned model machine as shown below

Xnew = MLJ.table(rand(rng, 50, 10)) # create test data
predict(self_tuning_rfe_mach, Xnew)
50-element Vector{Float64}:
 14.612915980139839
 18.487917617909133
 13.618764198364357
 11.672276660630207
 14.002553975255033
 15.873693213080978
 13.441382659338421
 18.91285351506014
 12.339465903155357
 15.877906366769604
  ⋮
 15.782144419104077
 10.94908418407389
 11.859042543036969
 14.716854931815393
 13.547841255475241
 11.502891246322193
 14.093312357135664
 13.443435888734923
 16.061363024914666

In this case, prediction is done using the best recursive feature elimination model gotten from the tuning process above.

For resampling methods different from cross-validation, and for other TunedModel options, such as parallelization, see the Tuning Models section of the MLJ manual. MLJ Documentation