Machine learning Apr 29, 2026

MLflow with Scikit-learn: Track Experiments, Version Models, Deploy Confidently

Learn MLflow with Scikit-learn to track experiments, version models, and deploy reliably. Build a reproducible classification pipeline today.

I’ve been deep in a machine learning project where every small change to my pipeline seemed to create a new, undocumented version of my model. I’d train something, tweak a hyperparameter, run it again, and within a week I had ten different models saved as final_model_v2_final_actually.pkl and no memory of which one used which settings. Sound familiar? That’s exactly why I started using MLflow – and why I’m writing this guide today. If you’ve ever wasted hours trying to reproduce a result or struggled to roll back to a previous model version, you’re in the right place. I’ll walk you through a complete MLflow setup using a real classification pipeline with Scikit-learn, so you can track experiments, version your models, and deploy with confidence.

Let’s start with the core idea: MLflow is a platform that covers the entire machine learning lifecycle. It’s not a new framework or a replacement for your favorite tools. Instead, it sits on top of them and records everything that matters – parameters, metrics, code versions, artifacts – in a structured way. Forget messy notebooks and scattered CSV files. With MLflow, every run is logged against a named experiment, and you can compare results visually in a web UI. More importantly, it gives you a central registry where models get versioned and staged from development to production. You can ask yourself: How often have I needed to revert to last week’s model but couldn’t find it? MLflow solves this by treating your models as first-class, versioned artifacts.

I’ll use the Adult Income dataset because it’s a classic binary classification problem with mixed data types – exactly the kind of scenario where tracking every preprocessing step becomes critical. But before diving into the code, I need to set up MLflow itself. You can install it with pip: pip install mlflow scikit-learn pandas numpy matplotlib seaborn. Then, to get the full model registry functionality, I recommend starting a local tracking server backed by SQLite instead of the default file store. Run this in a terminal:

mlflow server \
  --backend-store-uri sqlite:///mlflow.db \
  --default-artifact-root ./mlruns \
  --host 0.0.0.0 \
  --port 5000

Open http://localhost:5000 in your browser – there’s your UI, empty for now, but ready to capture every detail of your experiments. In your Python code, set the tracking URI: mlflow.set_tracking_uri("http://localhost:5000"). Without this, MLflow defaults to a local ./mlruns directory, which works but doesn’t support the model registry unless you switch to a database backend. Why does the registry need a database? Because it stores model metadata and version transitions as relational data – file stores simply can’t handle that reliably.

Now let’s prepare the data. Load the Adult Income CSV, drop missing values, and split into train/test sets. I always use stratification on the target to maintain class balance. Here’s the minimal snippet:

import pandas as pd
from sklearn.model_selection import train_test_split

url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
    "age", "workclass", "fnlwgt", "education", "education_num",
    "marital_status", "occupation", "relationship", "race", "sex",
    "capital_gain", "capital_loss", "hours_per_week", "native_country", "income"
]

df = pd.read_csv(url, names=columns, na_values=" ?", skipinitialspace=True)
df.dropna(inplace=True)
df["income"] = (df["income"].str.strip() == ">50K").astype(int)

X = df.drop("income", axis=1)
y = df["income"]
X_train, X_test, y_train, y_test = train_test_split(
    X, y, test_size=0.2, random_state=42, stratify=y
)

Notice that I didn’t scale or encode anything yet – that belongs in a pipeline. I define numeric and categorical features separately and build a ColumnTransformer to handle both. For classification I’ll start with a random forest because it’s robust and doesn’t require extensive tuning for a baseline. The full pipeline looks like this:

from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier

numeric_features = ["age", "fnlwgt", "education_num", "capital_gain",
                    "capital_loss", "hours_per_week"]
categorical_features = ["workclass", "education", "marital_status",
                        "occupation", "relationship", "race", "sex",
                        "native_country"]

numeric_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="median")),
    ("scaler", StandardScaler())
])

categorical_transformer = Pipeline(steps=[
    ("imputer", SimpleImputer(strategy="most_frequent")),
    ("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])

preprocessor = ColumnTransformer(transformers=[
    ("num", numeric_transformer, numeric_features),
    ("cat", categorical_transformer, categorical_features)
])

pipeline = Pipeline(steps=[
    ("preprocessor", preprocessor),
    ("classifier", RandomForestClassifier(n_estimators=100, random_state=42))
])

Now comes the part that changed my workflow forever: logging everything inside an MLflow run. I create an experiment first: mlflow.set_experiment("adult_income_classification"). Then I wrap my training code in with mlflow.start_run(run_name="rf_baseline") as run:. Inside that context, I log parameters, metrics, and – this is key – the entire pipeline as a model artifact. I also log a confusion matrix plot so I can visually compare runs later. Let me show you the complete training function I use:

import mlflow
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import tempfile
import os

def train_and_log(pipeline, X_train, y_train, X_test, y_test,
                  run_name="rf_baseline", tags=None):
    with mlflow.start_run(run_name=run_name) as run:
        # Tag the run for easy filtering
        mlflow.set_tags({
            "model_type": "RandomForest",
            "dataset": "adult_income",
            "engineer": "me",
            **(tags or {})
        })
        
        clf = pipeline.named_steps["classifier"]
        # Log hyperparameters manually
        mlflow.log_params({
            "n_estimators": clf.n_estimators,
            "max_depth": clf.max_depth,
            "min_samples_split": clf.min_samples_split,
            "min_samples_leaf": clf.min_samples_leaf
        })
        
        pipeline.fit(X_train, y_train)
        y_pred = pipeline.predict(X_test)
        
        accuracy = accuracy_score(y_test, y_pred)
        f1 = f1_score(y_test, y_pred)
        roc_auc = roc_auc_score(y_test, y_pred)
        
        mlflow.log_metrics({
            "accuracy": accuracy,
            "f1_score": f1,
            "roc_auc": roc_auc
        })
        
        # Log confusion matrix as artifact
        cm = confusion_matrix(y_test, y_pred)
        plt.figure(figsize=(6, 5))
        sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
        plt.xlabel("Predicted")
        plt.ylabel("Actual")
        plt.title("Confusion Matrix")
        with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
            plt.savefig(tmp.name, dpi=100, bbox_inches="tight")
            mlflow.log_artifact(tmp.name, "plots")
            os.unlink(tmp.name)  # cleanup temp file
        plt.close()
        
        # Log the entire pipeline as a model
        mlflow.sklearn.log_model(pipeline, "model", registered_model_name="AdultIncomeClassifier")
        
        return run.info.run_id

Calling train_and_log(pipeline, X_train, y_train, X_test, y_test) will execute everything. Notice the line mlflow.sklearn.log_model(pipeline, "model", registered_model_name="AdultIncomeClassifier"). This does two things at once: it saves the serialized pipeline locally under the run’s artifact folder, and it registers it in the MLflow Model Registry under the name AdultIncomeClassifier. The first time you log a model with that name, MLflow creates a new entry with version 1. Every subsequent run with the same registered model name will produce version 2, 3, etc.

After you run this, open the MLflow UI again. You’ll see a new experiment with one run. Click on it to see the logged parameters, metrics, and artifacts – including the confusion matrix plot. The model registry tab will show your registered model with its initial version marked as “None” in the staging stage. That’s where you can manually transition it. In my own projects, I often run multiple experiments with different hyperparameters, then compare their metrics in the UI to pick the best one. Which version should I push to production? The UI helps you decide.

Once you have a champion model, you can transition it to “Staging” and then to “Production” using the MLflow UI or via the API. For script-based deployments, I use the mlflow.<flavor>.load_model API. For example, to load the latest production version of the AdultIncomeClassifier:

model_uri = "models:/AdultIncomeClassifier/Production"
model = mlflow.sklearn.load_model(model_uri)
predictions = model.predict(X_test)

You can also serve the model as a REST API directly from the MLflow server using mlflow models serve -m models:/AdultIncomeClassifier/Production -p 1234. This is incredibly handy for rapid prototyping or internal demos. But for production, you’d likely package it into a Docker container using mlflow models build-docker.

Throughout this process, I’ve found that the biggest wins come from discipline: always log a run, always tag it meaningfully, and never skip logging the full model artifact. The questions I ask myself now are: Can I reproduce this exact model six months from now? With MLflow, the answer is yes – because everything, including the dataset version (if you log it as an artifact), preprocessing parameters, and code environment, is captured.

To wrap up, I want you to try this yourself. Start with a simple pipeline, run two experiments with different n_estimators, and see how the UI highlights the differences. Then register the better model, transition it to staging, and serve it locally. You’ll quickly see why this tool has become the standard for ML teams. If this guide helped you understand MLflow a little better, I’d appreciate it if you share it with a colleague or leave a comment about your own tracking struggles. And if you have questions or tips, drop them below – I read every one.

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Keywords: MLflowScikit-learnexperiment trackingmodel versioningmachine learning deployment

MLflow with Scikit-learn: Track Experiments, Version Models, Deploy Confidently

101 Books

Our Creations

We are on Medium

More from our team

Similar Posts

SHAP Model Interpretability Guide: Master Local and Global ML Explanations in 2024

Master SHAP for Explainable AI: Complete Python Guide to Advanced Model Interpretation

Complete Scikit-learn Pipeline Guide: Build Production ML Models with Automated Feature Engineering

Master Model Interpretability: Complete SHAP Guide for Local to Global Feature Importance Analysis

How to Build Robust Machine Learning Pipelines with Scikit-learn: Complete 2024 Guide to Deployment

Master SHAP for Production ML: Complete Guide to Feature Attribution and Model Explainability