Machine learning May 8, 2026

MLflow Experiment Tracking Guide: Reproducible Machine Learning Without Notebook Chaos

Learn MLflow experiment tracking to log metrics, parameters, models, and artifacts for reproducible machine learning workflows.

I still remember the first time I built a machine learning model I was proud of. It took me three days of tweaking parameters and trying different preprocessing steps. By the third day I had a notebook full of cells and no idea which combination actually gave me that 0.97 ROC‑AUC score. Did I use StandardScaler or RobustScaler? Was it the Random Forest with 300 trees or the one with 500? The notebook was a mess. I felt stupid. And I was not alone. Every data scientist I spoke with had the same story. That is why I want to talk about experiment tracking. It sounds boring. It is anything but. Without it, you are basically throwing your best ideas into a black hole.

Experiment tracking is the habit of recording everything about a training run: the hyperparameters, the metrics, the exact pipeline, the versions of the libraries, the time it took, the artifacts like plots and models. You do it so that two weeks from now you can answer the simple question “What the hell did I do to get that result?” MLflow is the tool that makes this painless. It works with Scikit‑learn, PyTorch, TensorFlow, any Python library. I will show you how to use it from scratch.

First, install the packages. Open a terminal and run pip install mlflow scikit-learn pandas numpy matplotlib seaborn optuna. Then start the tracking server. I like to use SQLite for small projects because it is zero config. Run mlflow server --backend-store-uri sqlite:///mlflow.db --default-artifact-root ./mlflow-artifacts --host 0.0.0.0 --port 5000. Now open http://localhost:5000 in your browser. You will see an empty UI. Good. Now we connect our Python code to that server.

import mlflow
TRACKING_URI = "http://localhost:5000"
mlflow.set_tracking_uri(TRACKING_URI)
client = mlflow.MlflowClient()
print(mlflow.get_tracking_uri())

That prints the URI. If you see “http://localhost:5000” you are connected. Now think about what happens if you do not use a server? Everything gets stored in local ./mlruns folders. Fine for one person. But when you work with a team, you need a central place. The server gives you that. Also it keeps your experiments safe if your laptop catches fire.

Why do we even need a registry on top of tracking? Because tracking records history, but a registry lets you say “this version is the one we trust for production.” It is like Git tags but for models. I will come back to that later.

Let us create an experiment. An experiment is just a bucket for related runs. For this example I will use a synthetic fraud detection dataset because it is realistic and imbalanced. That forces us to care about metrics like precision and recall.

from sklearn.datasets import make_classification
X, y = make_classification(n_samples=50000, n_features=28, n_informative=15,
                           n_redundant=5, weights=[0.97, 0.03], random_state=42)
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, stratify=y, random_state=42)

Now I set up the MLflow experiment.

experiment_name = "fraud-detection-baseline"
mlflow.set_experiment(experiment_name)

You can also create it with tags: mlflow.create_experiment(name, tags={...}). Experiment tags help you search later. For example I tag every experiment with the project name and the team. When you have hundreds of experiments, tags save you.

Now the actual training run. I will log parameters, metrics, and the model itself. Watch carefully.

from sklearn.ensemble import RandomForestClassifier
from sklearn.preprocessing import StandardScaler
from sklearn.pipeline import Pipeline
from sklearn.metrics import roc_auc_score, average_precision_score, f1_score

with mlflow.start_run(run_name="rf-v1") as run:
    params = {"n_estimators": 200, "max_depth": 10, "min_samples_split": 20, "class_weight": "balanced"}
    pipe = Pipeline([("scaler", StandardScaler()), ("clf", RandomForestClassifier(**params, random_state=42))])
    pipe.fit(X_train, y_train)
    y_pred = pipe.predict(X_test)
    y_proba = pipe.predict_proba(X_test)[:, 1]
    auc = roc_auc_score(y_test, y_proba)
    auprc = average_precision_score(y_test, y_proba)
    f1 = f1_score(y_test, y_pred)
    mlflow.log_params(params)
    mlflow.log_metrics({"roc_auc": auc, "auprc": auprc, "f1": f1})
    mlflow.sklearn.log_model(pipe, "model")
    print(f"Run {run.info.run_id} - auc: {auc:.4f}, auprc: {auprc:.4f}")

Every time you run this block, a new run appears in the UI. You see the metrics, parameters, and you can download the model as a ZIP. That is the bare minimum. Now I want to show you something that blew my mind: you can log plots too. Just save a matplotlib figure to a temporary file and log it as an artifact.

import matplotlib.pyplot as plt
from sklearn.metrics import confusion_matrix, ConfusionMatrixDisplay
cm = confusion_matrix(y_test, y_pred)
disp = ConfusionMatrixDisplay(confusion_matrix=cm)
disp.plot()
plt.savefig("confusion_matrix.png")
mlflow.log_artifact("confusion_matrix.png")
plt.close()

Now when you open that run in the UI, you see the confusion matrix image right there. That kind of context saves hours of head‑scratching. “Oh, right, that run had great recall because I used class_weight.”

I remember a colleague who never tracked anything. He spent two weeks chasing the same configuration. I showed him MLflow. He looked at the UI and said “I could have saved two weeks of my life.” So why do so many people skip this step? Because setting it up feels like overhead. But the overhead is five lines of code. The benefit is that you never lose a great model again.

Now, what if you want to compare multiple runs? Open the experiment page in the MLflow UI. You can select several runs and click “Compare.” It shows you a table of all metrics side by side. You see that run A has higher AUC but lower precision. Run B is the opposite. Your business might need precision more. You can decide immediately. Without tracking, you would have to dig through logs or notebooks. Inevitably you find the wrong one.

Let us test the registry. After you have a few good runs, you want to register a model. The registry is separate from experiments. You create a registered model with a name, then you add versions of it from specific runs.

model_uri = f"runs:/{run.info.run_id}/model"
client.create_registered_model("FraudDetector")
client.create_model_version("FraudDetector", model_uri, run.info.run_id, description="baseline RF")

You can also use mlflow.register_model(model_uri, "FraudDetector") inside a run. Then in the UI you can promote versions from “Staging” to “Production” to “Archived.” That gives you a clear train‑test‑deploy pipeline.

Why does this matter? Because now when you deploy a model, you do not guess which pickle file to use. You query the registry for the latest production version.

prod_model = mlflow.pyfunc.load_model(model_uri="models:/FraudDetector/Production")
predictions = prod_model.predict(X_test)

That code loads the exact model that is in production. No more “I think I used the model from run 42, but maybe it was 43.” The registry removes ambiguity.

One more thing. Hyperparameter tuning with Optuna or GridSearchCV can also be tracked. You wrap the objective function inside mlflow.start_run() and log each trial. That way you have a historical record of every trial, not just the best one.

import optuna
def objective(trial):
    n_est = trial.suggest_int("n_estimators", 100, 500)
    max_depth = trial.suggest_int("max_depth", 5, 30)
    with mlflow.start_run(nested=True) as run:
        pipe = Pipeline([("scaler", StandardScaler()), 
                         ("clf", RandomForestClassifier(n_estimators=n_est, max_depth=max_depth, random_state=42))])
        pipe.fit(X_train, y_train)
        auc = roc_auc_score(y_test, pipe.predict_proba(X_test)[:,1])
        mlflow.log_params({"n_estimators": n_est, "max_depth": max_depth})
        mlflow.log_metric("auc", auc)
    return auc
study = optuna.create_study(direction="maximize")
study.optimize(objective, n_trials=20)

Notice nested=True. That creates a child run inside the parent experiment. The UI shows the hierarchy. You can collapse and expand. It is beautiful.

Now I have to admit something. When I first used MLflow, I logged everything manually. Then I discovered mlflow.sklearn.autolog(). You call it once at the top of your script, and it automatically logs all parameters, metrics, and the model for any Scikit‑learn estimator. It is that simple.

mlflow.sklearn.autolog()

But be careful: autolog is great for quick experiments, but for production you often need fine control. For example, you might want to log a custom metric like average precision at a specific recall threshold. With autolog you get what it chooses. So I usually start with autolog for exploration, then switch to manual logging for the final runs.

Here is a question for you: have you ever had to re‑run an experiment because you forgot to note which version of the library you used? MLflow logs the environment automatically if you add mlflow.log_artifacts("requirements.txt"). Better yet, use mlflow.log_artifact("conda.yaml") for reproducibility. That file contains all dependencies. Another question: how do you handle data versioning? MLflow does not do that natively, but you can log a hash of your training data as a parameter. I do this:

import hashlib
data_str = X_train.tobytes()
hash_val = hashlib.md5(data_str).hexdigest()
mlflow.log_param("train_data_hash", hash_val)

Now you can trace a run back to the exact data version. Overkill? Maybe. But when your data changes and your model suddenly breaks, you will be grateful.

I want to leave you with this thought. A machine learning project without experiment tracking is like cooking without taking notes. You might create a masterpiece once, but you will never be able to repeat it. MLflow is your notebook. It costs nothing to set up. It takes minutes. And it saves you hours of frustration.

Start with a tiny project. Track three runs. Compare them. Then teach someone else. That is how the habit sticks. I guarantee you will never go back to the “notebook chaos” way of working.

If you found this useful, drop a like. Share it with a teammate who still thinks tracking is unnecessary. And comment: what is the biggest pain point in your current ML workflow? I read every reply and I might cover your question in a future guide.

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Keywords: MLflowexperiment trackingmachine learningmodel registryreproducibility

MLflow Experiment Tracking Guide: Reproducible Machine Learning Without Notebook Chaos

101 Books

Our Creations

We are on Medium

More from our team

Similar Posts

SHAP Model Explainability Complete Guide: Unlock Black-Box Machine Learning Models with Professional Techniques

SHAP Model Explainability: Complete Theory to Production Implementation Guide with Python Code

SHAP Model Explainability Guide: From Theory to Production Implementation in 2024

Complete Guide to SHAP Model Interpretability: Unlock Machine Learning Black Box Predictions

Model Interpretability with SHAP: Complete Theory to Production Implementation Guide

Complete Guide to SHAP Model Interpretability: Transform Black-Box Models into Transparent Predictions