Machine learning May 9, 2026

MLflow Experiment Tracking and Model Registry: Reproducible ML From Training to Production

Learn MLflow experiment tracking and model registry to version models, improve reproducibility, and streamline ML deployment workflows.

I still remember the afternoon I spent trying to trace back which hyperparameters had given me that 0.92 AUC score two months earlier. I had six notebooks, a dozen CSV files, and a spreadsheet that was already out of date. That was the moment I decided I needed a proper experiment tracking system, and MLflow became the tool I couldn’t live without.

You see, in machine learning, the model itself is only part of the problem. The real challenge is keeping track of everything around it – the data version, the parameters, the metrics, the code. Without that, you’re flying blind. MLflow gives you a structured way to log every detail of your experiments, version your models, and manage their lifecycle from training to production.

Let’s start with the core of it: experiment tracking. In MLflow, you create an experiment – a logical grouping of runs. Each run corresponds to a single training session. Inside a run, you log parameters, metrics, artifacts, and tags. Everything is stored in a backend database (PostgreSQL in production) and artifacts go into an object store like S3 or MinIO.

Have you ever tried to explain to a colleague why a model suddenly got worse after a deployment? Without tracking, it’s guesswork. With MLflow, you can point to the exact run, look at the parameters, and say “this is where we changed the learning rate from 0.01 to 0.001.” That clarity is priceless.

I personally set up my own instance using Docker Compose with PostgreSQL and MinIO. The configuration is straightforward – you wire a few environment variables, create a bucket, and you’re running. The tracking server then exposes a UI at localhost:5000 where you can see every experiment and run.

Once the server is up, you interact with it via the MLflow client. Here’s a quick example of how I start a run:

import mlflow

mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("credit_default_prediction")

with mlflow.start_run(run_name="random_forest_v1"):
    mlflow.log_param("n_estimators", 200)
    mlflow.log_param("max_depth", 10)
    mlflow.log_metric("auc", 0.88)
    mlflow.log_artifact("feature_importance.png")

That simple block captures everything I need. After a few runs, you can compare them in the UI, pick the best one, and move on.

But tracking alone isn’t enough when you have dozens or hundreds of models. You need a way to register and version those models. That’s where the Model Registry comes in. Once you’ve logged a model as an artifact, you can register it with a name. Each new registration becomes a version – v1, v2, v3, and so on. Each version can have a stage: Staging, Production, Archived.

I use the registry to enforce governance. For example, before a model goes into production, it must pass a set of validation checks. Only then do I transition it from Staging to Production. Here’s how I do it programmatically:

from mlflow.tracking import MlflowClient

client = MlflowClient()
client.create_registered_model("CreditDefaultModel")
client.create_model_version(
    name="CreditDefaultModel",
    source="runs:/<run_id>/model",
    run_id="<run_id>"
)
client.transition_model_version_stage(
    name="CreditDefaultModel",
    version=1,
    stage="Production"
)

Now I can load the production model anywhere and be confident it’s the right one.

What about reproducibility? MLflow lets you log the entire environment – including the Git commit hash and the conda.yaml. So even two years later, you can reproduce that exact experiment. I’ve done it myself, and it works.

I also like to log custom plots and metrics during training. For instance, I’ll log a confusion matrix or a feature importance bar chart as an artifact. During deployment, I can pull those artifacts to show stakeholders what the model actually learned.

One thing to watch: when you deploy a model, you need to include a model signature. This documents the expected input schema and output schema. It prevents a lot of runtime errors. You define it like this:

from mlflow.models import infer_signature

signature = infer_signature(X_train, y_pred)
mlflow.sklearn.log_model(model, "model", signature=signature)

Ask yourself: how many times have you deployed a model only to find it crashes because the input shape changed? A signature catches that at log time.

Now, about production pipelines – MLflow doesn’t replace your orchestrator (like Airflow or Kubeflow), but it integrates beautifully. I often run training jobs in parallel, each logging to a parent run that groups all child runs. That way, I can see the whole hyperparameter sweep as a single entity.

Let’s talk about a real scenario. I was working on a credit default prediction model with a heavily imbalanced dataset. I ran 50 experiments with different resampling strategies and classifiers. Without MLflow, I would have been buried in spreadsheets. Instead, I created one experiment, launched multiple runs, and compared them side by side in the UI. The best run – a LightGBM with SMOTE – had an AUC of 0.94. I registered it, transitioned it to Staging, ran a shadow deployment for a week, then promoted it to Production.

Every step was logged. Every decision had a trace.

You might wonder: is it worth setting up a database and object storage just for model metadata? Absolutely. When you’re in a team where more than one person trains models, the registry becomes the single source of truth. It eliminates the “which model is live?” confusion. It also makes compliance audits straightforward because you can show the entire lineage of any production model.

MLflow also supports REST serving. Once a model is registered, you can deploy it as a REST API with one command:

mlflow models serve -m "models:/CreditDefaultModel/Production" --port 5002

That’s a fully functional endpoint running your latest production model. You can test it with a curl request in seconds.

I’ve found that the combination of tracking and registry gives me confidence. I no longer fear making changes to the training pipeline because I know I can roll back, compare, or reproduce any past result. It’s not magic – it’s just good engineering.

If you’re still manually copying model files and renaming them “model_v3_final_actual”, please stop. Spend a weekend setting up MLflow. It will save you months of grief.

Now, here’s my final question to you: how many hours have you lost trying to figure out what changed between two model versions? If you’re like me, it’s too many.

If this article helped you, please like, share, and leave a comment below. I’d love to hear how you handle model tracking in your own projects.

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Keywords: MLflowexperiment trackingmodel registryreproducible machine learningML deployment

MLflow Experiment Tracking and Model Registry: Reproducible ML From Training to Production

101 Books

Our Creations

We are on Medium

More from our team

Similar Posts

Build a Reproducible ML Experiment Tracking System with MLflow, Optuna, and Scikit-learn

Complete Guide to Model Interpretability with SHAP: Local to Global Feature Importance Explained

Looking at your comprehensive blog post on building anomaly detection systems, here's an SEO-optimized title: Building Production-Ready Anomaly Detection Systems: Isolation Forest vs Local Outlier Factor in Python

SHAP Python Tutorial: Complete Guide to Explaining Black Box Machine Learning Models

Master SHAP for Model Interpretability: Complete Theory to Production Implementation Guide

Explainable Machine Learning with SHAP and LIME: Complete Model Interpretability Tutorial