Machine learning May 8, 2026

How to Build a Production-Ready Recommendation System with Scikit-learn, Surprise, MLflow, and FastAPI

Learn to build a production-ready recommendation system with Scikit-learn, Surprise, MLflow, and FastAPI. Track, evaluate, deploy smarter.

I’ve spent the last three years helping teams build recommendation systems. Over and over I saw the same pattern: a brilliant data scientist would train a matrix factorization model in a Jupyter notebook, get a decent RMSE, then struggle to turn that notebook into something that actually serves recommendations to real users. The model would break in production because it couldn’t handle new users or new items. The evaluation metrics would look great in the lab but fail to capture catalog coverage or serendipity. Worst of all, nobody could reproduce the winning experiment because the parameters were scribbled on a sticky note.

That frustration is why I’m writing this. Not to give you another toy example, but to show you how to build a recommendation pipeline that can survive a code review, run reliably on a server, and adapt to real-world data. We’ll use Scikit-learn for the content side, Surprise for collaborative filtering, and MLflow to track every experiment so you never lose the recipe for your best model. By the end, you’ll have a versioned, deployable artifact and the confidence to explain your choices to a product manager.

Let’s begin. First, set up your environment. I assume you have Python 3.10+ and a virtual environment. Run these commands:

python -m venv rec-env
source rec-env/bin/activate
pip install scikit-learn scikit-surprise pandas numpy scipy mlflow fastapi uvicorn pydantic

That’s it. You don’t need GPU power for this – the MovieLens 100K dataset fits comfortably on a laptop.

Now, why do we even need three different recommendation paradigms? Collaborative filtering looks only at user–item interactions. If you’ve ever seen “users who bought this also bought,” that’s collaborative filtering. Content-based filtering looks at item attributes. If Netflix recommends a movie because it shares the same director or genre as one you liked, that’s content-based. Hybrid combines both. Which one wins depends on your data. Have you ever wondered why a streaming service suggests a terrible movie you’d never watch? That’s often a pure collaborative system running out of data for a new user. Content-based can help fill that gap.

Let’s get our hands dirty with data. I’ll use the MovieLens 100K dataset because it’s small, clean, and widely used. Download it directly:

import pandas as pd
ratings = pd.read_csv(
    "https://files.grouplens.org/datasets/movielens/ml-100k/u.data",
    sep="\t",
    names=["user_id", "item_id", "rating", "timestamp"],
)
movies = pd.read_csv(
    "https://files.grouplens.org/datasets/movielens/ml-100k/u.item",
    sep="|", encoding="latin-1",
    usecols=[0,1,2,5,6,7,8,9,10,11,12,13,14,15,16,17,18,19,20,21,22,23],
    names=["item_id","title","release_date","video_release_date","imdb_url",
           "unknown","action","adventure","animation","children","comedy",
           "crime","documentary","drama","fantasy","film_noir","horror",
           "musical","mystery","romance","sci_fi","thriller","war","western"],
)
print(ratings.shape, movies.shape)

Here’s a trap: never split time-series interaction data randomly. If you shuffle the rows before splitting, you’ll train on future ratings and predict past ones – a classic data leakage. Always split by time or by user. I’ll use a temporal split:

ratings["timestamp"] = pd.to_datetime(ratings["timestamp"], unit="s")
ratings = ratings.sort_values("timestamp")
split_point = int(len(ratings) * 0.80)
train_df = ratings.iloc[:split_point]
test_df  = ratings.iloc[split_point:]

Now we have 80,000 training interactions and 20,000 for testing. Let’s build a collaborative filtering model using Surprise. The library expects data in a specific format. We’ll use a Reader to define the rating scale:

from surprise import Dataset, Reader, SVD
reader = Reader(rating_scale=(1,5))
train_data = Dataset.load_from_df(train_df[["user_id","item_id","rating"]], reader)
trainset = train_data.build_full_trainset()

SVD (Singular Value Decomposition) is the workhorse of collaborative filtering. It learns latent factors for users and items. Here’s how to train it:

model = SVD(n_factors=100, n_epochs=30, lr_all=0.005, reg_all=0.02, random_state=42)
model.fit(trainset)

To test, we need a list of tuples: (user_id, item_id, true_rating). Build it from test_df:

testset = list(zip(test_df["user_id"], test_df["item_id"], test_df["rating"]))
from surprise import accuracy
predictions = model.test(testset)
rmse = accuracy.rmse(predictions)

You should get an RMSE around 0.93. That’s okay, but can we do better? Try tuning hyperparameters with GridSearchCV:

param_grid = {"n_factors": [50,100], "n_epochs": [20,30], "lr_all": [0.002,0.005], "reg_all": [0.02,0.1]}
from surprise.model_selection import GridSearchCV
gs = GridSearchCV(SVD, param_grid, measures=["rmse","mae"], cv=3, n_jobs=-1)
gs.fit(train_data)
print(gs.best_params["rmse"])
best_model = gs.best_estimator["rmse"]

Now let’s switch to content-based filtering. The idea is simple: represent each movie by its genre vector and compute similarity. We’ll build a binary matrix where each column is a genre (action, comedy, etc.) and rows are movies. Then for a user, we find the items they liked, average their genre vectors, and recommend items with the highest cosine similarity.

First, prepare genre data:

genre_cols = ["unknown","action","adventure","animation","children","comedy","crime",
              "documentary","drama","fantasy","film_noir","horror","musical","mystery",
              "romance","sci_fi","thriller","war","western"]
item_profiles = movies[["item_id"] + genre_cols].set_index("item_id")
from sklearn.metrics.pairwise import cosine_similarity
similarity_matrix = cosine_similarity(item_profiles)

Now define a function that, given a user ID, computes their average profile from training ratings and returns top-N similar items they haven’t rated:

def get_user_profile(user_id, train_df, item_profiles):
    user_ratings = train_df[train_df["user_id"] == user_id]
    if user_ratings.empty:
        return None
    profile = user_ratings.merge(item_profiles.reset_index(), on="item_id")[genre_cols].mean()
    return profile

def recommend_content_based(user_id, train_df, item_profiles, similarity_matrix, top_n=10):
    user_profile = get_user_profile(user_id, train_df, item_profiles)
    if user_profile is None:
        return "New user – no history"
    scores = similarity_matrix @ user_profile.values
    rated_items = train_df[train_df["user_id"] == user_id]["item_id"].values
    mask = np.ones(len(scores), dtype=bool)
    mask[rated_items] = False
    top_indices = np.argsort(scores[mask])[-top_n:][::-1]
    return item_profiles.index[mask][top_indices].tolist()

This approach is transparent: you can explain why a movie was recommended by pointing to the genres the user liked before. But it struggles with new users (no history) and item diversity.

Now let’s build a hybrid. The simplest hybrid is a weighted blend: take the collaborative filtering prediction and the content-based similarity score, combine them with a parameter alpha. For a given user–item pair, the final score = alpha * cf_score + (1-alpha) * cb_score. We’ll tune alpha on the test set.

First, generate CF predictions for every user–item pair in the test set using the best SVD model. Then for each pair, compute the content-based similarity from the user’s profile. Normalize both scores to [0,1] so the blend is meaningful. Then find the alpha that minimizes RMSE:

from sklearn.preprocessing import MinMaxScaler
alpha_range = np.linspace(0,1,21)
best_alpha, best_rmse = None, float('inf')
for alpha in alpha_range:
    hybrid_preds = alpha * cf_preds + (1-alpha) * cb_preds
    rmse = np.sqrt(mean_squared_error(true_ratings, hybrid_preds))
    if rmse < best_rmse:
        best_rmse = rmse
        best_alpha = alpha
print(f"Best alpha: {best_alpha}, RMSE: {best_rmse}")

Hybrid often beats either pure method on this dataset, especially for users with sparse interactions. Why? Because when a user has rated only three movies, collaborative filtering has almost nothing to go on, but content-based can still leverage the genre similarities of those three.

Now, how do you know which model is truly better? You need proper evaluation metrics. RMSE is good for rating prediction, but in recommendation we often care about ranking. Use Precision@K, Recall@K, and NDCG@K. These measure how many of the top-K recommended items are actually relevant (rating >= 4). Let’s implement a quick evaluation function:

def precision_recall_ndcf_at_k(model, testset, k=10):
    user_ratings = collections.defaultdict(list)
    for uid, iid, true_r in testset:
        user_ratings[uid].append((iid, true_r))
    precisions, recalls, ndcgs = [], [], []
    for uid in user_ratings:
        actual = [iid for iid, r in user_ratings[uid] if r >= 4]
        if len(actual) == 0:
            continue
        predictions = model.test([(uid, iid, 0) for iid, _ in user_ratings[uid]])
        predicted_scores = [(iid, est) for (iid, _), (_, est) in zip(user_ratings[uid], predictions)]
        predicted_scores.sort(key=lambda x: x[1], reverse=True)
        recommended = [iid for iid, _ in predicted_scores[:k]]
        hits = set(recommended) & set(actual)
        precision = len(hits) / k
        recall = len(hits) / len(actual)
        # NDCG calculation (simplified)
        dcg = sum(1 / np.log2(i+2) if rec in actual else 0 for i, rec in enumerate(recommended))
        idcg = sum(1 / np.log2(i+2) for i in range(min(k, len(actual))))
        ndcg = dcg / idcg if idcg > 0 else 0
        precisions.append(precision)
        recalls.append(recall)
        ndcgs.append(ndcg)
    return np.mean(precisions), np.mean(recalls), np.mean(ndcgs)

Run this on your best model from each paradigm. Do you see that content-based might have lower recall but higher precision for new users? That’s a trade-off to explain to your stakeholders.

Now, the most critical part: experiment tracking. Without it, you’ll forget which hyperparameters gave that 0.87 RMSE. Use MLflow to log everything: model parameters, metrics, and even the model artifact. Here’s how to wrap your SVD training:

import mlflow
mlflow.set_experiment("MovieLens Recommendation")
with mlflow.start_run(run_name="SVD_tuned"):
    # log params
    mlflow.log_params(best_model.__dict__)
    # train and evaluate
    best_model.fit(trainset)
    predictions = best_model.test(testset)
    rmse = accuracy.rmse(predictions)
    prec, rec, ndcg = precision_recall_ndcf_at_k(best_model, testset, k=10)
    mlflow.log_metrics({"rmse": rmse, "precision@10": prec, "recall@10": rec, "ndcg@10": ndcg})
    # log model
    mlflow.sklearn.log_model(best_model, "svd_model")

Later, you can run mlflow ui in the terminal and see a table of all your runs. Click any run and reproduce the exact parameters and dataset split.

Finally, let’s serve the best model as a FastAPI endpoint. Assume you’ve saved the model and the item profiles. Create a file serve.py:

from fastapi import FastAPI, HTTPException
from pydantic import BaseModel
import mlflow.sklearn
import pandas as pd
import numpy as np

app = FastAPI()
model = mlflow.sklearn.load_model("file:///path/to/svd_model")
# Also load item_profiles and similarity_matrix

class RecommendRequest(BaseModel):
    user_id: int
    top_n: int = 10

@app.post("/recommend")
def recommend(req: RecommendRequest):
    # get all item ids the user hasn't rated
    all_items = movies["item_id"].unique()
    rated_items = train_df[train_df["user_id"]==req.user_id]["item_id"].values
    candidates = [iid for iid in all_items if iid not in rated_items]
    predictions = [model.predict(req.user_id, iid).est for iid in candidates]
    top_indices = np.argsort(predictions)[-req.top_n:][::-1]
    recommended_ids = [candidates[i] for i in top_indices]
    return {"user_id": req.user_id, "recommendations": recommended_ids}

Run with uvicorn serve:app --reload. Now you have a real API.

I’ve seen teams spend weeks on the model and then rush the deployment, only to find the endpoints crash under load. Start with a simple FastAPI server, then add caching with Redis, then monitor with MLflow’s model registry. But this skeleton gives you a solid foundation.

What do you do next? You can extend this to implicit feedback (clicks, watches) using the implicit library. You can add a feedback loop to update models incrementally. You can use MLflow’s Model Registry to promote models to staging and production.

Here’s my final nudge: if this guide helped you move from a notebook to a deployable system, please like it, share it with your team, and leave a comment with the biggest challenge you faced in production recommendations. I read every comment and I’ll answer with the exact code fix. Let’s stop treating recommendation systems as magic and start building them like proper software. Now go train a model that actually ships.

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Keywords: recommendation systemscikit-learnsurprisemlflowfastapi

How to Build a Production-Ready Recommendation System with Scikit-learn, Surprise, MLflow, and FastAPI

101 Books

Our Creations

We are on Medium

More from our team

Similar Posts

SHAP Model Interpretability Guide: Feature Attribution to Production Deployment with Python Examples

SHAP Model Explainability Guide: From Theory to Production Implementation with Interactive Visualizations

Complete Guide to Building Robust Feature Selection Pipelines with Scikit-learn: Statistical, Model-Based and Iterative Methods

SHAP Model Explainability Guide: Master Feature Importance and Model Decisions in Python

Complete Guide to SHAP Model Explainability: Interpret Any Machine Learning Model with Python

SHAP Model Interpretation Guide: Complete Tutorial for Explaining Machine Learning Black-Box Models