cheatsheet

Scikit-learn Cheatsheet: Feature Selection

Feature selection reduces the number of input variables to reduce overfitting, improve accuracy, and decrease computational cost.

What can be done?

Key Algorithms

  1. SelectKBest (Filter):
    • Selects high-scoring features based on f_classif (ANOVA), chi2, or mutual_info_classif.
  2. RFE (Recursive Feature Elimination - Wrapper):
    • Iteratively fits a model and removes the least important features.
    • RFECV: Automatically finds the optimal number of features using cross-validation.
  3. SelectFromModel (Embedded):
    • Uses feature_importances_ or coef_ attributes of a fitted estimator.
    • Works well with Lasso (L1 penalty) or RandomForest.
  4. SequentialFeatureSelector:
    • Greedy forward or backward selection. More robust but slower than RFE.

Theoretical Background

Computational Complexity

Code Snippet: Feature Selection Pipeline

from sklearn.pipeline import Pipeline
from sklearn.feature_selection import SelectKBest, f_classif, RFE
from sklearn.ensemble import RandomForestClassifier

# 1. Filter: Select top 10 features via ANOVA
selector = SelectKBest(score_func=f_classif, k=10)

# 2. Wrapper: Recursive Feature Elimination
# Requires an estimator that provides feature importance (e.g. RF, Linear)
rfe = RFE(estimator=RandomForestClassifier(), n_features_to_select=5)

# 3. Embedded: Select from Model
from sklearn.feature_selection import SelectFromModel
from sklearn.linear_model import Lasso
sfm = SelectFromModel(Lasso(alpha=0.1))

# Integrating into Pipeline
pipe = Pipeline([
    ('feature_selection', selector),
    ('classification', RandomForestClassifier())
])
pipe.fit(X_train, y_train)

Credits: This cheatsheet is based on the scikit-learn documentation and examples, which are licensed under the BSD 3-Clause License. Copyright (c) 2007 - 2026 The scikit-learn developers. All rights reserved.