MLflow Tutorial: Track, Version, and Serve Machine Learning Models Reliably

Learn how to use MLflow to track experiments, version models, and serve ML APIs reliably. Follow this practical workflow today.

MLflow Tutorial: Track, Version, and Serve Machine Learning Models Reliably

I remember the moment clearly. I had just finished training the 47th version of my churn prediction model, each with slightly different hyperparameters, feature combinations, and random seeds. My notebook was a mess of commented-out parameters, and I had no idea which combination had produced the best AUC. Worse, when I tried to reproduce the “best” model three weeks later, the environment had changed, libraries had updated, and I couldn’t replicate the results. That’s when I realized that building a machine learning model is not the real challenge—keeping track of what you did, and more importantly, being able to recreate and serve that work reliably, is where the real engineering discipline begins.

MLflow is the tool that rescued me from this chaos. It’s an open-source platform that handles the entire lifecycle of machine learning: tracking experiments, versioning models, managing a registry, and serving models for inference. In this article, I’ll take you through my workflow using a churn prediction use case, showing you exactly how to set up each component. By the end, you’ll be able to instrument your own training code, compare runs visually, promote models through staging, and serve them as REST APIs—all without losing your sanity.

Have you ever lost track of which model you deployed to production? Let’s fix that.


Starting with the Experiment Tracker

The first thing I do when starting any ML project is configure MLflow’s tracking server. For local development, a simple SQLite backend and local artifact store work fine, but if you’re collaborating or planning to scale, use PostgreSQL and an S3-compatible object store like MinIO. Here’s my minimal docker-compose.yml that spins up both:

version: "3.8"
services:
  postgres:
    image: postgres:15
    environment:
      POSTGRES_USER: mlflow
      POSTGRES_PASSWORD: mlflow
      POSTGRES_DB: mlflow_db
    ports:
      - "5432:5432"
  minio:
    image: minio/minio
    command: server /data --console-address ":9001"
    environment:
      MINIO_ROOT_USER: admin
      MINIO_ROOT_PASSWORD: admin123
    ports:
      - "9000:9000"
      - "9001:9001"

Once the stack is up, start the MLflow server pointing to the PostgreSQL backend and MinIO for artifacts:

mlflow server \
  --backend-store-uri postgresql://mlflow:mlflow@localhost:5432/mlflow_db \
  --default-artifact-root s3://mlflow-bucket/ \
  --host 0.0.0.0 --port 5000

I set the tracking URI in my Python code:

import mlflow
mlflow.set_tracking_uri("http://localhost:5000")

Now every mlflow.start_run() call will log to the central server. You can view all runs at http://localhost:5000.


Instrumenting the Training Code

Let me show you how I log every detail of a training run—parameters, metrics, tags, and the model itself. I’ll use a synthetic churn dataset to keep the focus on the MLflow integration.

import pandas as pd
import numpy as np
from sklearn.ensemble import RandomForestClassifier
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, precision_score, recall_score, f1_score

# generate dummy data (shown briefly)
np.random.seed(42)
n = 5000
df = pd.DataFrame({
    'tenure': np.random.randint(1, 72, n),
    'monthly_charges': np.random.uniform(20, 120, n),
    'support_calls': np.random.randint(0, 10, n),
    'contract': np.random.choice([0,1,2], n),  # 0: monthly, 1: annual, 2: two-year
    'churn': np.random.choice([0,1], n, p=[0.7, 0.3])
})

X = df.drop('churn', axis=1)
y = df['churn']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=42)

# Start an MLflow run
with mlflow.start_run(run_name="rf_baseline"):
    # log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("max_depth", 6)
    mlflow.log_param("min_samples_split", 10)
    
    # train model
    model = RandomForestClassifier(n_estimators=100, max_depth=6, min_samples_split=10, random_state=42)
    model.fit(X_train, y_train)
    
    # evaluate
    y_pred = model.predict(X_test)
    acc = accuracy_score(y_test, y_pred)
    prec = precision_score(y_test, y_pred)
    rec = recall_score(y_test, y_pred)
    f1 = f1_score(y_test, y_pred)
    
    # log metrics
    mlflow.log_metric("accuracy", acc)
    mlflow.log_metric("precision", prec)
    mlflow.log_metric("recall", rec)
    mlflow.log_metric("f1", f1)
    
    # log the model (as sklearn flavor)
    mlflow.sklearn.log_model(model, "model")
    
    # log extra files (like feature importance plot)
    import matplotlib.pyplot as plt
    importance = pd.Series(model.feature_importances_, index=X.columns)
    importance.sort_values().plot(kind='barh')
    plt.tight_layout()
    plt.savefig("feature_importance.png")
    mlflow.log_artifact("feature_importance.png")

After executing, open the MLflow UI and you’ll see this run under the “Default” experiment. You’ll notice the parameters, metrics, and a link to the logged model artifact.

What if you want to compare multiple runs? Run the same code with different n_estimators values. The UI lets you select several runs and compare their metrics side-by-side. That visual comparison is where you truly appreciate experiment tracking—no more spreadsheets.


Versioning Models with the Model Registry

Once you’ve identified the best run, you can register its model in the MLflow Model Registry. This adds a version number and supports lifecycle stages: None, Staging, Production, Archived.

From the UI, click on the best run → Artifacts → model → Register Model. Give it a name, say “churn_prediction”. Or do it programmatically:

# after the training run
mlflow.register_model(f"runs:/{run.info.run_id}/model", "churn_prediction")

The registry now holds version 1 of your model. If you later train a better model and register it again, version 2 is created. You can then promote version 2 to “Staging” for testing, and later to “Production”. This gives you a clear history of which model is serving live traffic.

I always add a description to each registered model version explaining what changed—feature set, algorithm, or data range. It pays off when your teammate asks why the new model regressed.


Serving the Model as a REST API

Deploying a model shouldn’t require writing a web server from scratch. MLflow includes a built-in serving capability that turns any logged model into a REST endpoint. The server automatically loads the model and exposes a predict endpoint.

To serve the production model:

mlflow models serve -m "models:/churn_prediction/Production" --port 5001

Now you can send predictions via HTTP:

import requests
import json

data = {
    "dataframe_split": {
        "columns": ["tenure", "monthly_charges", "support_calls", "contract"],
        "data": [[12, 65.3, 2, 0], [48, 95.0, 5, 2]]
    }
}

response = requests.post("http://localhost:5001/invocations", json=data)
print(response.json())

You get back a list of predictions or probabilities, depending on the model flavor. This API is compatible with Scikit-learn, PyTorch, TensorFlow, and even custom Python models.

What if you need to serve a specific version, not the production stage? Use the version number in the URI: models:/churn_prediction/1.


Automating Hyperparameter Tuning with Optuna

I rarely train a single model—I usually run hundreds of trials with Optuna. MLflow integrates beautifully with Optuna if you log each trial as its own run.

import optuna
from sklearn.ensemble import RandomForestClassifier

def objective(trial):
    params = {
        "n_estimators": trial.suggest_int("n_estimators", 50, 200),
        "max_depth": trial.suggest_int("max_depth", 3, 10),
        "min_samples_split": trial.suggest_int("min_samples_split", 2, 20),
    }
    
    model = RandomForestClassifier(**params, random_state=42)
    model.fit(X_train, y_train)
    score = f1_score(y_test, model.predict(X_test))
    
    # Log this trial as an MLflow run
    with mlflow.start_run(nested=True):
        mlflow.log_params(params)
        mlflow.log_metric("f1", score)
        mlflow.sklearn.log_model(model, "model")
    
    return score

study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=50)

Each trial becomes a separate run under the parent experiment. You can view the parallel coordinates plot in the MLflow UI to see how hyperparameters affect F1.


Batch Prediction from the Registry

REST APIs are great for online serving, but sometimes you need to score thousands of records offline. MLflow’s Python client can load any registered model and apply it to a DataFrame:

import mlflow.pyfunc

model_uri = "models:/churn_prediction/Production"
loaded_model = mlflow.pyfunc.load_model(model_uri)

# batch_score new data
new_data = pd.DataFrame({
    'tenure': [6, 24, 60],
    'monthly_charges': [80, 50, 110],
    'support_calls': [8, 1, 0],
    'contract': [0, 1, 2]
})

predictions = loaded_model.predict(new_data)
print(predictions)

This gives you a reproducible way to run inference without managing model files manually. The version is fixed, so your batch pipeline always uses the approved model.


Wrapping Up

Machine learning is full of messy experiments and forgotten configurations. MLflow gives you a single, open-source platform to track, version, and serve models with confidence. I’ve seen teams reduce their model deployment time from days to minutes once they adopt this workflow.

Now, here’s my request to you: try this on your next ML project. Set up the tracking server, log a few runs, register a model, and serve it. Then come back and let me know how it went. If you found this guide useful, please like, share it with your teammates, and comment below with your own MLflow tips or questions. Your feedback helps me write better articles.


As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!


📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!


Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

// Our Network

More from our team

Explore our publications across finance, culture, tech, and beyond.

// More Articles

Similar Posts