machine_learning

How to Select the Best Features for Machine Learning Using Scikit-learn

Struggling with too many features? Learn how to use mutual info, RFECV, and permutation importance to streamline your ML models.

How to Select the Best Features for Machine Learning Using Scikit-learn

Have you ever stared at a dataset with hundreds of columns and felt completely lost? I have. Last month, I was building a model to predict equipment failure. My initial dataset had nearly 300 potential signals—temperatures, pressures, vibration frequencies. The model was slow, confusing, and frankly, not very good. It was a classic case of “garbage in, garbage out,” but the garbage was dressed in fancy, technical clothing. That experience is why I’m writing this. I needed a systematic way to cut through the noise, and I found it by combining three powerful techniques.

Think of feature selection like packing for a trip. You don’t bring your entire closet; you choose versatile items that serve multiple purposes. In machine learning, the right features make your model accurate, fast, and understandable. The wrong ones add weight and confusion.

So, how do we separate the useful signals from the distracting noise? We’ll use three complementary tools from Scikit-learn. Each has a different strength, and together, they form a robust selection process.

First, let’s talk about mutual information. This is a filter method. It evaluates each feature’s relationship with the target variable before any model is built. The beauty is it catches connections that simple correlation might miss, including non-linear patterns.

Here’s how you can calculate it on a training set. Remember, always perform selection after splitting your data to avoid bias.

from sklearn.feature_selection import mutual_info_regression
import pandas as pd

# Assuming X_train is a DataFrame and y_train is a Series
mi_scores = mutual_info_regression(X_train, y_train, random_state=42)
mi_series = pd.Series(mi_scores, index=X_train.columns, name="MI_Score")
top_features = mi_series.sort_values(ascending=False).head(20)
print(top_features)

This gives you a ranked list. But what threshold should you use? There’s no universal answer. I often look for a natural drop-off in scores or use the top N features that appear in the next method’s results.

Now, what if your features are tangled together, with many telling the same story? Enter Recursive Feature Elimination with Cross-Validation, or RFECV. This is a wrapper method. It uses a model’s own logic to test different combinations of features, finding the optimal number for performance.

It works by training a model, removing the weakest feature, and repeating the process. Cross-validation ensures the result is stable. You can use any estimator, but tree-based models like RandomForest are a common choice.

from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import StratifiedKFold

# Create the selector
estimator = RandomForestRegressor(n_estimators=100, random_state=42)
cv_strategy = StratifiedKFold(n_splits=5)
selector = RFECV(estimator, step=1, cv=cv_strategy, scoring='r2', n_jobs=-1)

# Fit it to the training data
selector = selector.fit(X_train, y_train)

print(f"Optimal number of features: {selector.n_features_}")
selected_rfe = X_train.columns[selector.support_]
print("Selected Features:", list(selected_rfe))

You get a clear answer on how many features to keep and which ones. But here’s a question: Does a feature important to a Random Forest matter to a Linear Model? Not always. That’s why we need a third, model-agnostic perspective.

This is where permutation importance comes in. It’s beautifully simple. It measures how much a model’s performance drops when you randomly shuffle a single feature’s values. If shuffling breaks the model, that feature was important. If nothing changes, the feature was likely irrelevant.

from sklearn.inspection import permutation_importance

# First, train a final model on your selected features
final_model = RandomForestRegressor().fit(X_train[selected_rfe], y_train)

# Calculate importance on a held-out validation set
result = permutation_importance(
    final_model, X_val[selected_rfe], y_val,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

# Organize the results
perm_importance = pd.DataFrame({
    'feature': selected_rfe,
    'importance_mean': result.importances_mean,
    'importance_std': result.importances_std
}).sort_values('importance_mean', ascending=False)

This gives you a robust check. I’ve seen features that RFE loves get a near-zero score here, revealing they only seemed important due to a quirk in the training data.

So, what’s the final workflow? I start with mutual information for a broad filter. Then, I use RFECV to find a strong subset. Finally, I check that subset with permutation importance on a validation set. The features that consistently rank high across these tests are your true champions.

It’s a process that saves time, reduces overfitting, and makes your models profoundly easier to explain to stakeholders. The journey from a bloated, confusing dataset to a lean, powerful model isn’t just satisfying—it’s essential for real-world success.

I hope this clear, step-by-step approach helps you tackle your own feature-rich datasets. What’s the most features you’ve ever had to wrangle? Share your stories in the comments below—I’d love to hear about your challenges and solutions. If this guide cleared the fog for you, please like and share it with a colleague who might be facing the same struggle


As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!


📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!


Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

Keywords: feature selection,scikit-learn,machine learning,mutual information,rfecv



Similar Posts
Blog Image
Building Production-Ready ML Pipelines with Scikit-learn From Data Processing to Model Deployment Complete Guide

Learn to build robust, production-ready ML pipelines with Scikit-learn. Master data preprocessing, custom transformers, model deployment & monitoring for real-world ML systems.

Blog Image
Complete Guide to SHAP Model Explainability: Decode Black-Box Machine Learning Models

Master SHAP explainability techniques for black-box ML models. Learn global & local explanations, visualizations, and production deployment tips.

Blog Image
Looking at your comprehensive blog post on building anomaly detection systems, here's an SEO-optimized title: **Building Production-Ready Anomaly Detection Systems: Isolation Forest vs Local Outlier Factor in Python**

Learn to build powerful anomaly detection systems using Isolation Forest and LOF algorithms in Python. Complete tutorial with code examples, optimization tips, and real-world deployment strategies.

Blog Image
Model Explainability with SHAP and LIME: Complete Python Implementation Guide for Machine Learning Interpretability

Master model explainability with SHAP and LIME in Python. Learn to implement local/global explanations, create visualizations, and deploy interpretable ML solutions. Start building transparent AI models today.

Blog Image
Complete Guide to SHAP Model Interpretability: Local to Global Insights with Python Implementation

Master SHAP model interpretability in Python. Learn local & global explanations, visualizations, and best practices for tree-based, linear & deep learning models.

Blog Image
Python Model Explainability Guide: Master SHAP, LIME, and Permutation Importance Techniques

Master model explainability in Python with SHAP, LIME, and Permutation Importance. Learn to interpret ML predictions, build explainable pipelines, and debug models effectively.