MLflow with Scikit-learn: Track Experiments, Version Models, Deploy Confidently
Learn MLflow with Scikit-learn to track experiments, version models, and deploy reliably. Build a reproducible classification pipeline today.
I’ve been deep in a machine learning project where every small change to my pipeline seemed to create a new, undocumented version of my model. I’d train something, tweak a hyperparameter, run it again, and within a week I had ten different models saved as final_model_v2_final_actually.pkl and no memory of which one used which settings. Sound familiar? That’s exactly why I started using MLflow – and why I’m writing this guide today. If you’ve ever wasted hours trying to reproduce a result or struggled to roll back to a previous model version, you’re in the right place. I’ll walk you through a complete MLflow setup using a real classification pipeline with Scikit-learn, so you can track experiments, version your models, and deploy with confidence.
Let’s start with the core idea: MLflow is a platform that covers the entire machine learning lifecycle. It’s not a new framework or a replacement for your favorite tools. Instead, it sits on top of them and records everything that matters – parameters, metrics, code versions, artifacts – in a structured way. Forget messy notebooks and scattered CSV files. With MLflow, every run is logged against a named experiment, and you can compare results visually in a web UI. More importantly, it gives you a central registry where models get versioned and staged from development to production. You can ask yourself: How often have I needed to revert to last week’s model but couldn’t find it? MLflow solves this by treating your models as first-class, versioned artifacts.
I’ll use the Adult Income dataset because it’s a classic binary classification problem with mixed data types – exactly the kind of scenario where tracking every preprocessing step becomes critical. But before diving into the code, I need to set up MLflow itself. You can install it with pip: pip install mlflow scikit-learn pandas numpy matplotlib seaborn. Then, to get the full model registry functionality, I recommend starting a local tracking server backed by SQLite instead of the default file store. Run this in a terminal:
mlflow server \
--backend-store-uri sqlite:///mlflow.db \
--default-artifact-root ./mlruns \
--host 0.0.0.0 \
--port 5000
Open http://localhost:5000 in your browser – there’s your UI, empty for now, but ready to capture every detail of your experiments. In your Python code, set the tracking URI: mlflow.set_tracking_uri("http://localhost:5000"). Without this, MLflow defaults to a local ./mlruns directory, which works but doesn’t support the model registry unless you switch to a database backend. Why does the registry need a database? Because it stores model metadata and version transitions as relational data – file stores simply can’t handle that reliably.
Now let’s prepare the data. Load the Adult Income CSV, drop missing values, and split into train/test sets. I always use stratification on the target to maintain class balance. Here’s the minimal snippet:
import pandas as pd
from sklearn.model_selection import train_test_split
url = "https://archive.ics.uci.edu/ml/machine-learning-databases/adult/adult.data"
columns = [
"age", "workclass", "fnlwgt", "education", "education_num",
"marital_status", "occupation", "relationship", "race", "sex",
"capital_gain", "capital_loss", "hours_per_week", "native_country", "income"
]
df = pd.read_csv(url, names=columns, na_values=" ?", skipinitialspace=True)
df.dropna(inplace=True)
df["income"] = (df["income"].str.strip() == ">50K").astype(int)
X = df.drop("income", axis=1)
y = df["income"]
X_train, X_test, y_train, y_test = train_test_split(
X, y, test_size=0.2, random_state=42, stratify=y
)
Notice that I didn’t scale or encode anything yet – that belongs in a pipeline. I define numeric and categorical features separately and build a ColumnTransformer to handle both. For classification I’ll start with a random forest because it’s robust and doesn’t require extensive tuning for a baseline. The full pipeline looks like this:
from sklearn.pipeline import Pipeline
from sklearn.compose import ColumnTransformer
from sklearn.preprocessing import StandardScaler, OneHotEncoder
from sklearn.impute import SimpleImputer
from sklearn.ensemble import RandomForestClassifier
numeric_features = ["age", "fnlwgt", "education_num", "capital_gain",
"capital_loss", "hours_per_week"]
categorical_features = ["workclass", "education", "marital_status",
"occupation", "relationship", "race", "sex",
"native_country"]
numeric_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="median")),
("scaler", StandardScaler())
])
categorical_transformer = Pipeline(steps=[
("imputer", SimpleImputer(strategy="most_frequent")),
("encoder", OneHotEncoder(handle_unknown="ignore", sparse_output=False))
])
preprocessor = ColumnTransformer(transformers=[
("num", numeric_transformer, numeric_features),
("cat", categorical_transformer, categorical_features)
])
pipeline = Pipeline(steps=[
("preprocessor", preprocessor),
("classifier", RandomForestClassifier(n_estimators=100, random_state=42))
])
Now comes the part that changed my workflow forever: logging everything inside an MLflow run. I create an experiment first: mlflow.set_experiment("adult_income_classification"). Then I wrap my training code in with mlflow.start_run(run_name="rf_baseline") as run:. Inside that context, I log parameters, metrics, and – this is key – the entire pipeline as a model artifact. I also log a confusion matrix plot so I can visually compare runs later. Let me show you the complete training function I use:
import mlflow
from sklearn.metrics import accuracy_score, f1_score, roc_auc_score, confusion_matrix
import matplotlib.pyplot as plt
import seaborn as sns
import tempfile
import os
def train_and_log(pipeline, X_train, y_train, X_test, y_test,
run_name="rf_baseline", tags=None):
with mlflow.start_run(run_name=run_name) as run:
# Tag the run for easy filtering
mlflow.set_tags({
"model_type": "RandomForest",
"dataset": "adult_income",
"engineer": "me",
**(tags or {})
})
clf = pipeline.named_steps["classifier"]
# Log hyperparameters manually
mlflow.log_params({
"n_estimators": clf.n_estimators,
"max_depth": clf.max_depth,
"min_samples_split": clf.min_samples_split,
"min_samples_leaf": clf.min_samples_leaf
})
pipeline.fit(X_train, y_train)
y_pred = pipeline.predict(X_test)
accuracy = accuracy_score(y_test, y_pred)
f1 = f1_score(y_test, y_pred)
roc_auc = roc_auc_score(y_test, y_pred)
mlflow.log_metrics({
"accuracy": accuracy,
"f1_score": f1,
"roc_auc": roc_auc
})
# Log confusion matrix as artifact
cm = confusion_matrix(y_test, y_pred)
plt.figure(figsize=(6, 5))
sns.heatmap(cm, annot=True, fmt="d", cmap="Blues")
plt.xlabel("Predicted")
plt.ylabel("Actual")
plt.title("Confusion Matrix")
with tempfile.NamedTemporaryFile(suffix=".png", delete=False) as tmp:
plt.savefig(tmp.name, dpi=100, bbox_inches="tight")
mlflow.log_artifact(tmp.name, "plots")
os.unlink(tmp.name) # cleanup temp file
plt.close()
# Log the entire pipeline as a model
mlflow.sklearn.log_model(pipeline, "model", registered_model_name="AdultIncomeClassifier")
return run.info.run_id
Calling train_and_log(pipeline, X_train, y_train, X_test, y_test) will execute everything. Notice the line mlflow.sklearn.log_model(pipeline, "model", registered_model_name="AdultIncomeClassifier"). This does two things at once: it saves the serialized pipeline locally under the run’s artifact folder, and it registers it in the MLflow Model Registry under the name AdultIncomeClassifier. The first time you log a model with that name, MLflow creates a new entry with version 1. Every subsequent run with the same registered model name will produce version 2, 3, etc.
After you run this, open the MLflow UI again. You’ll see a new experiment with one run. Click on it to see the logged parameters, metrics, and artifacts – including the confusion matrix plot. The model registry tab will show your registered model with its initial version marked as “None” in the staging stage. That’s where you can manually transition it. In my own projects, I often run multiple experiments with different hyperparameters, then compare their metrics in the UI to pick the best one. Which version should I push to production? The UI helps you decide.
Once you have a champion model, you can transition it to “Staging” and then to “Production” using the MLflow UI or via the API. For script-based deployments, I use the mlflow.<flavor>.load_model API. For example, to load the latest production version of the AdultIncomeClassifier:
model_uri = "models:/AdultIncomeClassifier/Production"
model = mlflow.sklearn.load_model(model_uri)
predictions = model.predict(X_test)
You can also serve the model as a REST API directly from the MLflow server using mlflow models serve -m models:/AdultIncomeClassifier/Production -p 1234. This is incredibly handy for rapid prototyping or internal demos. But for production, you’d likely package it into a Docker container using mlflow models build-docker.
Throughout this process, I’ve found that the biggest wins come from discipline: always log a run, always tag it meaningfully, and never skip logging the full model artifact. The questions I ask myself now are: Can I reproduce this exact model six months from now? With MLflow, the answer is yes – because everything, including the dataset version (if you log it as an artifact), preprocessing parameters, and code environment, is captured.
To wrap up, I want you to try this yourself. Start with a simple pipeline, run two experiments with different n_estimators, and see how the UI highlights the differences. Then register the better model, transition it to staging, and serve it locally. You’ll quickly see why this tool has become the standard for ML teams. If this guide helped you understand MLflow a little better, I’d appreciate it if you share it with a colleague or leave a comment about your own tracking struggles. And if you have questions or tips, drop them below – I read every one.
As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!
101 Books
101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.
Check out our book Golang Clean Code available on Amazon.
Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!
📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!
Our Creations
Be sure to check out our creations:
Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools
We are on Medium
Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva