How to Select the Best Features for Machine Learning Using Scikit-learn

machine_learning

How to Select the Best Features for Machine Learning Using Scikit-learn

Struggling with too many features? Learn how to use mutual info, RFECV, and permutation importance to streamline your ML models.

Dec 31, 2025

How to Select the Best Features for Machine Learning Using Scikit-learn

Have you ever stared at a dataset with hundreds of columns and felt completely lost? I have. Last month, I was building a model to predict equipment failure. My initial dataset had nearly 300 potential signals—temperatures, pressures, vibration frequencies. The model was slow, confusing, and frankly, not very good. It was a classic case of “garbage in, garbage out,” but the garbage was dressed in fancy, technical clothing. That experience is why I’m writing this. I needed a systematic way to cut through the noise, and I found it by combining three powerful techniques.

Think of feature selection like packing for a trip. You don’t bring your entire closet; you choose versatile items that serve multiple purposes. In machine learning, the right features make your model accurate, fast, and understandable. The wrong ones add weight and confusion.

So, how do we separate the useful signals from the distracting noise? We’ll use three complementary tools from Scikit-learn. Each has a different strength, and together, they form a robust selection process.

First, let’s talk about mutual information. This is a filter method. It evaluates each feature’s relationship with the target variable before any model is built. The beauty is it catches connections that simple correlation might miss, including non-linear patterns.

Here’s how you can calculate it on a training set. Remember, always perform selection after splitting your data to avoid bias.

from sklearn.feature_selection import mutual_info_regression
import pandas as pd

# Assuming X_train is a DataFrame and y_train is a Series
mi_scores = mutual_info_regression(X_train, y_train, random_state=42)
mi_series = pd.Series(mi_scores, index=X_train.columns, name="MI_Score")
top_features = mi_series.sort_values(ascending=False).head(20)
print(top_features)

This gives you a ranked list. But what threshold should you use? There’s no universal answer. I often look for a natural drop-off in scores or use the top N features that appear in the next method’s results.

Now, what if your features are tangled together, with many telling the same story? Enter Recursive Feature Elimination with Cross-Validation, or RFECV. This is a wrapper method. It uses a model’s own logic to test different combinations of features, finding the optimal number for performance.

It works by training a model, removing the weakest feature, and repeating the process. Cross-validation ensures the result is stable. You can use any estimator, but tree-based models like RandomForest are a common choice.

from sklearn.feature_selection import RFECV
from sklearn.ensemble import RandomForestRegressor
from sklearn.model_selection import StratifiedKFold

# Create the selector
estimator = RandomForestRegressor(n_estimators=100, random_state=42)
cv_strategy = StratifiedKFold(n_splits=5)
selector = RFECV(estimator, step=1, cv=cv_strategy, scoring='r2', n_jobs=-1)

# Fit it to the training data
selector = selector.fit(X_train, y_train)

print(f"Optimal number of features: {selector.n_features_}")
selected_rfe = X_train.columns[selector.support_]
print("Selected Features:", list(selected_rfe))

You get a clear answer on how many features to keep and which ones. But here’s a question: Does a feature important to a Random Forest matter to a Linear Model? Not always. That’s why we need a third, model-agnostic perspective.

This is where permutation importance comes in. It’s beautifully simple. It measures how much a model’s performance drops when you randomly shuffle a single feature’s values. If shuffling breaks the model, that feature was important. If nothing changes, the feature was likely irrelevant.

from sklearn.inspection import permutation_importance

# First, train a final model on your selected features
final_model = RandomForestRegressor().fit(X_train[selected_rfe], y_train)

# Calculate importance on a held-out validation set
result = permutation_importance(
    final_model, X_val[selected_rfe], y_val,
    n_repeats=10,
    random_state=42,
    n_jobs=-1
)

# Organize the results
perm_importance = pd.DataFrame({
    'feature': selected_rfe,
    'importance_mean': result.importances_mean,
    'importance_std': result.importances_std
}).sort_values('importance_mean', ascending=False)

This gives you a robust check. I’ve seen features that RFE loves get a near-zero score here, revealing they only seemed important due to a quirk in the training data.

So, what’s the final workflow? I start with mutual information for a broad filter. Then, I use RFECV to find a strong subset. Finally, I check that subset with permutation importance on a validation set. The features that consistently rank high across these tests are your true champions.

It’s a process that saves time, reduces overfitting, and makes your models profoundly easier to explain to stakeholders. The journey from a bloated, confusing dataset to a lean, powerful model isn’t just satisfying—it’s essential for real-world success.

I hope this clear, step-by-step approach helps you tackle your own feature-rich datasets. What’s the most features you’ve ever had to wrangle? Share your stories in the comments below—I’d love to hear about your challenges and solutions. If this guide cleared the fog for you, please like and share it with a colleague who might be facing the same struggle

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

machine_learning