How to Detect Model Drift in Production with Evidently AI and MLflow

Learn how to monitor model drift in production using Evidently AI and MLflow, set alerts, and prevent silent ML failures early.

How to Detect Model Drift in Production with Evidently AI and MLflow

I remember the first time I deployed a machine learning model to production. I was proud. The accuracy on my holdout test set was 94%. I slept well that night. Two months later, a colleague asked why the model was approving transactions that were clearly fraudulent. I checked the logs. The accuracy had silently slipped to 71%. No alerts. No warnings. Just quiet failure.

That experience taught me a hard lesson: deploying a model is not the finish line — it’s the starting gun. The real world changes. Data distributions shift. User behavior evolves. And your model, frozen in time, slowly becomes obsolete. This article is about building a safety net for your models. I will show you how to detect when your model is drifting, how to measure it, and how to act before it fails.


You might be wondering: what exactly is drift? Let me break it down. There are three main types.

Data drift happens when the input features change. Imagine you trained a fraud detector on transactions with an average amount of $50. After a new payment method launches, the average jumps to $200. The model has never seen that — it starts making mistakes.

Concept drift is trickier. This is when the relationship between the input features and the target changes. For example, fraudsters used to make small test transactions before a big one. Now they go straight for the large amount. The model learned the small-test pattern, so it misses the new fraud.

Prediction drift means the output distribution shifts. Maybe your model used to predict “fraud” for 2% of transactions. Now it predicts 8%. Something has changed, even if you don’t know what.

These three types can happen separately or together. The only way to catch them early is to monitor continuously.


Let me show you the tool I rely on most: Evidently AI. It’s an open-source library that compares your current production data to your training data (the reference dataset) and tells you exactly how much each feature has drifted. It uses statistical tests like the Kolmogorov-Smirnov test for continuous features and the Chi-Squared test for categorical ones. It also calculates the Population Stability Index, a classic metric used in credit risk.

Here is how I set up a simple drift detector in Python.

import pandas as pd
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

# Load your reference dataset (the data your model was trained on)
reference = pd.read_csv("data/reference_data.csv")

# Load a batch of recent production data
current = pd.read_csv("data/production_batch.csv")

# Create a drift report
drift_report = Report(metrics=[
    DataDriftPreset()
])

drift_report.run(reference_data=reference, current_data=current)

# Save the report as HTML so you can open it in a browser
drift_report.save_html("reports/drift_report.html")

# Get the drift score programmatically
drift_score = drift_report.as_dict()["metrics"][0]["result"]["drift_by_columns"]
print(f"Number of drifted features: {sum(drift_score)}")

That block gives you a visual report with graphs for every feature. You can see exactly which columns are drifting and by how much. But you don’t want to manually run this every day. So I built an automated pipeline.


The second piece of the puzzle is MLflow. I use it to track every version of my model and to log performance metrics over time. Think of MLflow as a diary for your models — it records what you trained, when, and how well it performed. By combining Evidently with MLflow, I can create a dashboard that shows me drift scores alongside model accuracy.

Here is how I register a model with MLflow after training:

import mlflow
import mlflow.sklearn
from sklearn.ensemble import GradientBoostingClassifier

with mlflow.start_run(run_name="fraud_detector_v2"):
    model = GradientBoostingClassifier(n_estimators=100, learning_rate=0.1)
    model.fit(X_train, y_train)
    
    # Log model
    mlflow.sklearn.log_model(model, "model")
    
    # Log parameters
    mlflow.log_param("n_estimators", 100)
    mlflow.log_param("learning_rate", 0.1)
    
    # Log metrics
    accuracy = model.score(X_test, y_test)
    mlflow.log_metric("accuracy", accuracy)

Now every time I retrain, MLflow keeps a record. I can compare old models to new ones. But I also want to log drift metrics into MLflow so I can correlate model performance with drift. I do that by writing a scheduled job.

import mlflow
from evidently.report import Report
from evidently.metric_preset import DataDriftPreset

def run_drift_check():
    reference = pd.read_csv("data/reference_data.csv")
    current = pd.read_csv("data/production_last_hour.csv")
    
    report = Report(metrics=[DataDriftPreset()])
    report.run(reference_data=reference, current_data=current)
    
    drift_score = report.as_dict()["metrics"][0]["result"]["drift_by_columns"]
    num_drifted = sum(1 for col in drift_score if drift_score[col]["drift_score"] > 0.05)
    
    # Log drift metric into MLflow under a separate experiment
    with mlflow.start_run(run_name="drift_check_prod"):
        mlflow.log_metric("num_drifted_features", num_drifted)
        mlflow.log_metric("mean_drift_score", 
            sum(drift_score[col]["drift_score"] for col in drift_score) / len(drift_score))

Now I have a timeline of drift in MLflow. I can set a threshold — if the number of drifted features exceeds 3, I trigger an alert. That alert could be an email, a Slack message, or even an automatic retraining job.


Let me walk you through my full monitoring pipeline. I use FastAPI to serve my model, and every prediction gets logged to a file. At the end of each hour, a scheduler (I use APScheduler) picks up the batch of predictions and runs the drift check.

Here is a simplified version of the FastAPI endpoint with a logging middleware:

from fastapi import FastAPI, Request
from pydantic import BaseModel
import pandas as pd
import json

app = FastAPI()

class Transaction(BaseModel):
    transaction_amount: float
    account_age_days: float
    num_transactions_24h: float
    merchant_risk_score: float
    distance_from_home_km: float
    hour_of_day: int
    is_international: int
    card_present: int
    device_trust_score: float
    velocity_score: float

@app.post("/predict")
async def predict(transaction: Transaction, request: Request):
    # Convert to dataframe
    df = pd.DataFrame([transaction.dict()])
    
    # Predict (using a loaded model, omitted for brevity)
    prediction = model.predict(df)[0]
    probability = model.predict_proba(df)[0][1]
    
    # Log prediction to a queue or file for later batch analysis
    log_entry = transaction.dict()
    log_entry["prediction"] = int(prediction)
    log_entry["probability"] = probability
    with open("data/production_log.jsonl", "a") as f:
        f.write(json.dumps(log_entry) + "\n")
    
    return {"fraud": bool(prediction), "probability": probability}

The log file becomes the input for the drift check. Every hour, my scheduler reads the latest 1000 records from that file and compares them to the reference data. If the drift threshold is breached, I trigger a retraining pipeline and send myself a Slack notification.


I remember one incident where the drift check caught something I never expected. The feature hour_of_day was drifting. It turned out my data pipeline had started logging timestamps in UTC instead of local time. The model, trained on local times, suddenly saw a spike in nighttime transactions. Without the drift monitor, I would have blamed the model. With it, I fixed the data bug in an hour.

You might be thinking: isn’t this overkill for a small project? Maybe. But once you scale to even a handful of models, you cannot afford to wait for users to report problems. Trust me, users don’t complain — they just leave.


Now, I want you to take action. Start small. Pick one model, set up Evidently, and run one drift report. See what you learn. Then schedule it. Then add an alert. Over time, you will build a system that watches your models while you sleep. No more silent failures.

If this article helped you see the value of monitoring, please like it, share it with your team, and leave a comment telling me about your own drift horror story. I read every one. And remember: the model you deploy today is not the same model six months from now. Keep watching it.


As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!


📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!


Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

// Our Network

More from our team

Explore our publications across finance, culture, tech, and beyond.

// More Articles

Similar Posts