Feature selection for Machine Learning in Python — Wrapper Methods


This post was originally published by Jack Tan at Towards Data Science

Introduction and Concept

Wrapper methods evaluate multiple models using procedures that add and/or remove predictors to find the optimal combination that maximizes model performance. [1] These procedures are normally built after the concept of Greedy Search technique (or algorithm). [2] A greedy algorithm is any algorithm that follows the problem-solving heuristic of making the locally optimal choice at each stage.[3]

Generally, three directions of procedures are possible:

  • Forward selection — starts with one predictor and adds more iteratively. At each subsequent iteration, the best of the remaining original predictors are added based on performance criteria.
  • Backward elimination — starts with all predictors and eliminates one-by-one iteratively. One of the most popular algorithms is Recursive Feature Elimination (RFE) which eliminates less important predictors based on feature importance ranking.
  • Step-wise selection — bi-directional, based on a combination of forward selection and backward elimination. It is considered less greedy than the previous two procedures since it does reconsider adding predictors back into the model that has been removed (and vice versa). Nonetheless, the considerations are still made based on local optimisation at any given iteration.

Wrapper Methods in Python

There are two popular libraries in Python which can be used to perform wrapper style feature selection — Sequential Feature Selector from mlxtend and Recursive Feature Elimination from Scikit-learn.

The complete Python codes can be found on Github. The data used are the Boston house-prices dataset from Scikit-learn.

Printed output: 5 most important features are iteratively added to the subset in a forward selection manner based on R-squared scoring.

SequentialFeatureSelector() class accepts the following major parameters:

  • LinearRegression() acts as an estimator for the feature selection process. Alternatively, it can be substituted with other regression or classification based algorithm.
  • k_features indicates the number of features to be selected. For demonstration purposes, 5 features are selected from the original 13. This value can be optimized by analyzing the scores for different numbers of features.
  • forward indicates the direction of the wrapper method used. forward = True for forward selection whereas forward = False for backward elimination.
  • Scoring argument specifies the evaluation criterion to be used. For regression problems, r2 score is the default and only implementation. But for classification, there are options for accuracy, precision, recall, f1-score, etc.
  • cv argument is for k-fold cross-validation. Be default, it will be set as 5. Bear in mind, a larger number of cross-validation can be time-consuming and computing-intensive.

For other possible parameters, please refer to the documentation.

2. Backward elimination — RFE() from Sklearn

Printed output: 5 most important features are selected to the subset in a recursive backward elimination manner based on regressor coefficients. Due to the different methods available, intermediate scoring for each iteration is not immediately available. However, score(X, y) can be used to output the score of the underlying estimator.

RFE(estimator, n_features_to_select) is a class which stands for Reursive Feature Elimination is derived from the commonly used sklearn library for machine learning algorithms, it accepts the following major parameters :

  • estimator (pass with model in the sample code above) acts as an object for the the feature selection process. For regression problems shown, coef_ attribute is used to determine feature importance. For tree-like algorithm, feature_importances_ attribute is used instead.
  • n_features_to_selec refers to the number of features to select. If None, half of the features are selected.

For other possible parameters, please refer to the documentation.

3. Step-wise Selection — SFFS() from mlxtend

Printed output: 5 most important features are iteratively added to the subset in a step-wise manner based on R-squared scoring.

Similar to the class used in 1. Forward Selection, the parameters used are the same, with the exception:

  • floating arguments add a conditional exclusion/inclusion of features to create a bi-directional selection. When forward = True, that means with every forward iteration (in Sequential Forward Selection), it also considers excluding feature(s) in the previous iteration to optimize performance. Vice versa for Backward Elimination to include feature(s) in the existing subset.

Bonus — What is the ideal number of features?

In the last 3 examples, the number of features assigned to k_features is set at 5. But how are we supposed to know if 5 is the best number? Fortunately, in the same mlxtend library, it is possible to use the plot_sequential_feature_selection class to visualize the scoring to make the decision easier. For more details, please refer to the documentation.

As seen in the graph, the performance peaks at 7 features with 5-fold cross-validation (cv=5). Not to be confused with the negative signs for performance, neg_mean_squared_error simply multiplies the result of mean squared error (MSE) with -1 just to follow the convention of that sklearn. The idea is that minimizing MSE is equivalent to maximizing negative-MSE.

Spread the word

This post was originally published by Jack Tan at Towards Data Science

Related posts