Machine learning Apr 12, 2026

MLflow for Experiment Tracking and Model Versioning in Machine Learning

Learn how to use MLflow for experiment tracking and model versioning to build reproducible ML pipelines and deploy models with confidence.

I’ve been there—staring at a screen full of notebook cells, trying to remember which random seed gave us that fleeting moment of high accuracy. It was a Tuesday afternoon, and our team couldn’t recall the model version running in production. That moment of confusion is why I’m writing this. If you’ve ever lost track of a machine learning experiment, you know the pain. This article is my way of showing you a better path. Let’s build a system that remembers, so you don’t have to.

Machine learning projects often start with excitement. You write code, train models, and see promising results. But then, weeks later, someone asks, “Which features did we use for that model?” Without a clear system, finding the answer feels like searching for a needle in a haystack. Have you ever had to re-run experiments because the original settings were lost?

This is where experiment tracking comes in. It’s the practice of recording every detail of your ML work. Think of it as a lab notebook for your code. You note down what you did, what changed, and what happened. Why is this so critical? Because machine learning is not a one-time event. It’s a cycle of trying, failing, and improving. Without records, you’re flying blind.

Model versioning is the next piece. It’s about keeping track of different iterations of your models. Just like software version control, but for trained models. This ensures that you know exactly what is deployed and can roll back if needed. Can you imagine pushing a new model without knowing how it differs from the last?

To solve these problems, we use tools like MLflow. MLflow is an open-source platform that helps manage the ML lifecycle. It has parts for tracking experiments, packaging code, and managing models. I’ll show you how to use it in a real project. We’ll build a pipeline that tracks everything automatically.

Let’s start by setting things up. First, you need to install MLflow and other libraries. Open your terminal and run this command.

pip install mlflow scikit-learn pandas

Now, create a new directory for your project. Inside, make a Python script. I’ll call mine train.py. We’ll use a simple dataset for classification. The goal is to predict if a loan will default. This is a common problem in finance.

Here’s how to begin the script. We import the necessary libraries.

import mlflow
import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score
import pandas as pd

Next, load your data. For this example, I’m using a synthetic dataset. In a real project, you might load a CSV file.

# Generate sample data
from sklearn.datasets import make_classification
X, y = make_classification(n_samples=1000, n_features=10, random_state=42)
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

Before training, set up MLflow to track your experiments. You need to start a tracking server. Run this command in your terminal.

mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./artifacts --host 0.0.0.0 --port 5000

This starts a local server. Now, in your Python code, connect to it.

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("loan_default_prediction")

With MLflow ready, you can start logging. Each training run becomes a recorded experiment. Let’s train a model and log details.

with mlflow.start_run():
    # Log parameters
    n_estimators = 100
    max_depth = 5
    mlflow.log_param("n_estimators", n_estimators)
    mlflow.log_param("max_depth", max_depth)
    
    # Train model
    model = RandomForestClassifier(n_estimators=n_estimators, max_depth=max_depth, random_state=42)
    model.fit(X_train, y_train)
    
    # Evaluate
    predictions = model.predict(X_test)
    acc = accuracy_score(y_test, predictions)
    mlflow.log_metric("accuracy", acc)
    
    # Log the model
    mlflow.sklearn.log_model(model, "model")

Run this script. Then, open your browser and go to http://localhost:5000. You’ll see a web interface showing your experiment. It lists the parameters, metrics, and the model file. Isn’t it satisfying to have everything in one place?

But what if you want to try different settings? You can run multiple experiments. Change the parameters and run the script again. MLflow will log each run separately. You can compare them side by side. Which combination of parameters gives the best accuracy?

Now, let’s talk about the model registry. After training, you need a place to store and manage models. MLflow has a model registry for this. It allows you to version models and track their stage, like “Staging” or “Production.”

First, register a model from one of your runs. You can do it through the UI or code. Here’s how in Python.

# After a run, register the model
run_id = mlflow.active_run().info.run_id
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "LoanDefaultModel")

Once registered, you can transition the model through stages. For example, move it to “Production” when ready.

from mlflow.tracking import MlflowClient
client = MlflowClient()
client.transition_model_version_stage(
    name="LoanDefaultModel",
    version=1,
    stage="Production"
)

This helps keep your deployment process organized. How do you decide when a model is ready for production? Often, it’s based on metrics like accuracy or precision.

In a team setting, collaboration is key. MLflow allows multiple users to log experiments to the same server. Everyone can see what others are doing. This reduces duplication of effort. Have you ever worked on a project where two people trained the same model without knowing?

To make this robust, you should structure your code as a pipeline. Break it into steps: data loading, preprocessing, training, and evaluation. Each step can be logged. Here’s a simplified example.

# Define a preprocessing step
from sklearn.preprocessing import StandardScaler
scaler = StandardScaler()
X_train_scaled = scaler.fit_transform(X_train)
X_test_scaled = scaler.transform(X_test)

# Log the scaler as an artifact
import joblib
joblib.dump(scaler, "scaler.pkl")
mlflow.log_artifact("scaler.pkl")

Logging artifacts like this ensures that you capture everything needed to reproduce the model. What good is a model if you can’t preprocess new data the same way?

As your project grows, you might add more features. For instance, hyperparameter tuning. You can use tools like GridSearchCV with MLflow. Log each configuration tried. This creates a history of what worked and what didn’t.

from sklearn.model_selection import GridSearchCV
param_grid = {'n_estimators': [50, 100], 'max_depth': [3, 5]}
grid_search = GridSearchCV(RandomForestClassifier(), param_grid, cv=3)
grid_search.fit(X_train_scaled, y_train)

# Log the best parameters and score
mlflow.log_params(grid_search.best_params_)
mlflow.log_metric("best_score", grid_search.best_score_)

With all this data logged, you can analyze past experiments. Use the MLflow API to query runs. Find the best model based on your criteria.

# Search for runs with accuracy above 0.9
from mlflow.tracking import MlflowClient
client = MlflowClient()
runs = client.search_runs(
    experiment_ids=["1"],
    filter_string="metrics.accuracy > 0.9"
)
for run in runs:
    print(f"Run ID: {run.info.run_id}, Accuracy: {run.data.metrics['accuracy']}")

This programmatic access is powerful for automation. Imagine setting up a script that promotes the best model to production automatically. What thresholds would you use?

In production, you need to serve models. MLflow makes this easy. You can deploy a model as a REST API. Run this command.

mlflow models serve -m "models:/LoanDefaultModel/Production" -p 1234

Now, you have an endpoint at http://localhost:1234 that can make predictions. Send a POST request with your data, and it returns predictions. This simplifies integration with other systems.

Throughout this process, documentation is vital. MLflow logs help, but you should also add comments and readme files. Share insights with your team. Why did a particular model fail? What patterns did you notice?

I’ve found that setting up these practices early saves countless hours later. Start small, with a single project, and expand as needed. The key is consistency. Make tracking a habit, not an afterthought.

What challenges have you faced in tracking ML experiments? How do you think a system like this could help your workflow?

To wrap up, experiment tracking and model versioning are not just nice-to-haves. They are essential for reliable machine learning. MLflow provides the tools to build this into your pipeline. From logging runs to managing deployments, it covers the lifecycle.

I hope this guide helps you avoid the chaos I experienced. If you found these ideas useful, please like, share, and comment below. Your feedback can help others in the community. Let’s build better ML systems together.

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Keywords: MLflowexperiment trackingmodel versioningmachine learningMLOps

MLflow for Experiment Tracking and Model Versioning in Machine Learning

101 Books

Our Creations

We are on Medium

More from our team

Similar Posts

Build Production-Ready ML Model Monitoring and Drift Detection with Evidently AI and MLflow

Complete Guide to SHAP Implementation: From Theory to Production with Real-World Examples

SHAP Model Explainability Guide: From Theory to Production Implementation with Python Code Examples

How SHAP and TreeExplainer Demystify XGBoost and LightGBM Predictions

SHAP Model Interpretability Guide: Feature Attribution to Production Deployment with Python Examples

Master Feature Engineering Pipelines with Scikit-learn and Pandas: Complete Automation Guide for Data Scientists