MLflow with Scikit-Learn: Track Experiments, Register Models, and Deploy with Confidence

Learn MLflow with Scikit-learn to track experiments, register model versions, and deploy reliably. Follow this practical guide to streamline ML workflows.

MLflow with Scikit-Learn: Track Experiments, Register Models, and Deploy with Confidence

I remember the day my team lost a production model. We had trained thirty versions over three weeks, and nobody could recall which hyperparameters gave us the 92% AUC. The lead engineer had logged everything in a text file that somehow got overwritten. That was the day I decided we needed something more reliable than human memory. That’s when I discovered MLflow.

Have you ever tried to reproduce a model from three months ago, only to find you have no idea which dataset or parameters were used? It’s frustrating, and worse, it wastes time that could be spent improving the model.

MLflow gives you a systematic way to track every experiment, log every metric, save every model, and deploy the exact version you want. It works with any machine learning library, but I use it heavily with Scikit-learn. In this article, I’ll walk you through how I set up experiment tracking, a model registry, and a serving layer for a credit risk classification project. The code examples are short and ready to run.

I’ll assume you have Python 3.9+ and the basics of Scikit-learn. Let’s start by installing the core packages:

pip install mlflow scikit-learn pandas numpy optuna

MLflow’s tracking server is the heart of the system. You can run it locally for testing or on a shared server for team collaboration. I usually start with a SQLite backend because it persists runs even if the server restarts.

mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlartifacts --host 127.0.0.1 --port 5000

Point your browser to http://127.0.0.1:5000 and you’ll see an empty UI. Now let’s log our first experiment.

Before writing code, set the tracking URI in your script:

import mlflow

mlflow.set_tracking_uri("http://127.0.0.1:5000")
mlflow.set_experiment("credit_risk_classification")

I use the UCI Credit Card Default dataset. Here’s how I load it and split the data:

import pandas as pd
from sklearn.model_selection import train_test_split

url = ("https://archive.ics.uci.edu/ml/machine-learning-databases/"
       "00350/default%20of%20credit%20card%20clients.xls")
df = pd.read_excel(url, header=1)
df.rename(columns={"default payment next month": "target"}, inplace=True)
df.drop(columns=["ID"], inplace=True)

X = df.drop("target", axis=1)
y = df["target"]
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Now I train a simple Random Forest and log everything with MLflow’s autologging feature. Autologging captures parameters, metrics, and the model automatically – a huge time saver.

mlflow.sklearn.autolog()

with mlflow.start_run() as run:
    from sklearn.ensemble import RandomForestClassifier
    from sklearn.pipeline import Pipeline
    from sklearn.preprocessing import StandardScaler

    pipeline = Pipeline([
        ("scaler", StandardScaler()),
        ("rf", RandomForestClassifier(n_estimators=100, max_depth=10, random_state=42))
    ])

    pipeline.fit(X_train, y_train)
    score = pipeline.score(X_test, y_test)
    print(f"Test accuracy: {score:.4f}")

Run the script. In the MLflow UI you’ll see the run with parameters like n_estimators, max_depth, metrics like accuracy, and artifacts including the fitted model. You can click on it and see everything.

Do you ever wonder which experiment gave the best F1 score? MLflow’s UI lets you compare runs visually. Select two runs and click “Compare” to see side-by-side metrics.

Autologging is great, but sometimes you need custom logging. For example, you might want to store dataset statistics or custom plots. Here’s how I log extra information inside the same run:

with mlflow.start_run(run_name="baseline_random_forest"):
    mlflow.log_param("model_type", "RandomForest")
    mlflow.log_metric("train_size", len(X_train))
    mlflow.log_metric("class_balance", y_train.mean())

    # Log a plot
    import matplotlib.pyplot as plt
    from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
    y_pred = pipeline.predict(X_test)
    cm = confusion_matrix(y_test, y_pred)
    disp = ConfusionMatrixDisplay(confusion_matrix=cm)
    disp.plot()
    plt.savefig("confusion_matrix.png")
    mlflow.log_artifact("confusion_matrix.png")

Now I can see the confusion matrix directly in the run’s artifact folder. This makes debugging a lot easier.

When experimenting, I often run many trials with different hyperparameters. For that, I integrate Optuna. Each trial becomes a nested run under a parent experiment run. This keeps the UI tidy.

import optuna
from optuna.samplers import TPESampler

def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 300),
        "max_depth": trial.suggest_int("max_depth", 5, 20),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 10)
    }

    with mlflow.start_run(nested=True):
        mlflow.log_params(params)

        pipeline = Pipeline([
            ("scaler", StandardScaler()),
            ("rf", RandomForestClassifier(**params, random_state=42))
        ])

        pipeline.fit(X_train, y_train)
        acc = pipeline.score(X_test, y_test)
        mlflow.log_metric("accuracy", acc)

    return acc

study = optuna.create_study(direction="maximize", sampler=TPESampler())
study.optimize(objective, n_trials=20)

# Log the best trial info in the parent run
with mlflow.start_run(run_name="optuna_hpo"):
    mlflow.log_params(study.best_params)
    mlflow.log_metric("best_accuracy", study.best_value)

After running, the UI shows a parent run with 20 nested children. You can expand the parent to see each trial. This structure saved me countless hours when I needed to revisit which hyperparameter combination produced the best result.

But tracking alone isn’t enough for production. You need a way to promote a model from experimentation to staging, then to production, and keep versions. That’s where the MLflow Model Registry comes in.

First, register the best model from the Optuna study:

best_params = study.best_params
best_model = Pipeline([
    ("scaler", StandardScaler()),
    ("rf", RandomForestClassifier(**best_params, random_state=42))
])
best_model.fit(X_train, y_train)

# Log and register the model
with mlflow.start_run(run_name="register_best") as run:
    mlflow.sklearn.log_model(best_model, artifact_path="model")
    run_id = run.info.run_id

    model_uri = f"runs:/{run_id}/model"
    mlflow.register_model(model_uri, "credit_risk_model")

Go to the UI and click the “Models” tab. You’ll see credit_risk_model with version 1. You can change its stage: Staging, for testing, then Production, for serving. I usually set it to Staging first, run offline validation, then promote to Production.

How do you promote a model programmatically? Like this:

client = mlflow.tracking.MlflowClient()
client.transition_model_version_stage(
    name="credit_risk_model",
    version=1,
    stage="Production"
)

Now the production stage is set. Any request to the serving endpoint will automatically serve the production version.

Serving a model is as simple as running:

mlflow models serve -m "models:/credit_risk_model/Production" -p 5001

This starts a REST API on port 5001. You can send a POST request with input features, and get predictions back. Here’s a quick client example:

import requests
import numpy as np

sample = X_test.iloc[0:1].to_dict(orient="records")
response = requests.post(
    "http://127.0.0.1:5001/invocations",
    json={"dataframe_split": {"columns": list(X_test.columns), "data": sample}}
)
print(response.json())

What if you need to load a specific version for batch inference? Use the model URI:

import mlflow.pyfunc

model = mlflow.pyfunc.load_model("models:/credit_risk_model/1")
predictions = model.predict(X_test)
print(predictions[:5])

The URI can reference a specific version (e.g., credit_risk_model/1) or a stage (credit_risk_model/Production). This gives you fine-grained control over which model runs where.

After months of using this system, I can confidently say it transformed our workflow. No more lost configurations, no more manual deployment scripts. Every run is documented, every model is versioned, and deploying is a single command.

If you want to avoid the chaos I faced, I encourage you to implement MLflow tracking and registry in your next project. It takes an hour to set up, and it will save you weeks of headaches.

If you found this guide helpful, please like the article, share it with your team, and leave a comment with your biggest experiment tracking challenge. I read every comment and I’d love to hear how you manage your ML experiments.


As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!


📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!


Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

// Our Network

More from our team

Explore our publications across finance, culture, tech, and beyond.

// More Articles

Similar Posts