How to Track Machine Learning Experiments with MLflow and Scikit-learn

Learn how to track machine learning experiments with MLflow and Scikit-learn to log parameters, metrics, and models for reproducible workflows.

How to Track Machine Learning Experiments with MLflow and Scikit-learn

I spent last week trying to figure out which version of a model we deployed to production. It was a mess of notebooks, spreadsheets, and vague commit messages. That’s why I’m writing this. If you’ve ever lost track of a model’s performance or configuration, you’re not alone. This is about fixing that problem for good. Let’s get started.

Think about the last time you trained a model. You changed a few parameters, ran the code, and got a score. A week later, you needed to know exactly what you did to get that score. Could you find it? For most teams, the answer is a painful “no.” Our work becomes a collection of forgotten experiments. What if you could track every detail automatically?

The solution is experiment tracking. It’s the practice of recording every aspect of your machine learning runs—the data, code, parameters, and results. It turns chaos into order. Why is this suddenly so critical? Because modern machine learning is built on iteration. We try, we fail, we adjust. Without a system, we’re just guessing.

I want to show you how to build this system. We’ll use two main tools: MLflow for tracking and Scikit-learn for modeling. MLflow is like a flight recorder for your machine learning projects. It logs everything. Scikit-learn is our reliable engine. Together, they create a workflow you can trust.

First, we set up our environment. You need MLflow installed and running. You can run it locally with a simple command. This starts a small web server on your computer. It will store your experiment history.

mlflow server --host 0.0.0.0 --port 5000

Now, let’s connect to it from our Python code. This is how we tell our scripts where to send the tracking data.

import mlflow
mlflow.set_tracking_uri("http://localhost:5000")
mlflow.set_experiment("My_First_Tracked_Experiment")

With the setup done, we can focus on the data. For this guide, let’s use a simple dataset. We’ll predict credit risk. It’s a common problem with clear outcomes. How do we prepare this data in a repeatable way? We build a pipeline. A pipeline ensures every step is documented and reused.

from sklearn.pipeline import Pipeline
from sklearn.impute import SimpleImputer
from sklearn.preprocessing import StandardScaler

preprocessing_steps = [
    ('imputer', SimpleImputer(strategy='median')),
    ('scaler', StandardScaler())
]
preprocessor = Pipeline(steps=preprocessing_steps)

Notice how each step has a name. This clarity is vital for tracking. When MLflow logs this pipeline, it knows exactly what happened. Now, here’s a question for you: What’s more important, the algorithm you choose or the data you feed it? Most argue it’s the data. But without tracking, you can’t prove what data version led to your best model.

Now we train a model. But we won’t just call .fit(). We’ll wrap it in an MLflow run. This is the magic. The mlflow.start_run() context manager creates a recorded experiment.

import mlflow.sklearn
from sklearn.ensemble import RandomForestClassifier

with mlflow.start_run(run_name="Random_Forest_Baseline"):
    model = RandomForestClassifier(n_estimators=100, random_state=42)
    
    # Log the parameter
    mlflow.log_param("n_estimators", 100)
    
    model.fit(X_train, y_train)
    predictions = model.predict(X_test)
    
    # Calculate and log a metric
    from sklearn.metrics import accuracy_score
    accuracy = accuracy_score(y_test, predictions)
    mlflow.log_metric("accuracy", accuracy)
    
    # Log the model itself
    mlflow.sklearn.log_model(model, "model")

Look at that. With just a few extra lines, we logged the key parameter, the resulting accuracy, and saved the model artifact. All of this is now stored and searchable. Can you see how this changes everything? You can now compare this run to another.

Let’s try a different algorithm. What if a gradient boosting model works better? We create another run.

from sklearn.ensemble import GradientBoostingClassifier

with mlflow.start_run(run_name="Gradient_Boosting_Test"):
    model_gb = GradientBoostingClassifier(n_estimators=150, learning_rate=0.1)
    mlflow.log_params({"n_estimators": 150, "learning_rate": 0.1})
    
    model_gb.fit(X_train, y_train)
    predictions_gb = model_gb.predict(X_test)
    accuracy_gb = accuracy_score(y_test, predictions_gb)
    mlflow.log_metric("accuracy", accuracy_gb)
    
    mlflow.sklearn.log_model(model_gb, "model")

Both runs are now in your MLflow UI. Open your browser and go to http://localhost:5000. You’ll see a table comparing the accuracy, parameters, and even the time each run took. Which model is truly better? The evidence is right there. No more digging through notebooks.

But what about the model you want to use next week or next month? This is where the MLflow Model Registry comes in. It’s a place to formally store, version, and stage models. Think of it as a library for your best work. You can promote a model from “Staging” to “Production” with a click.

# After a run, register the best model
run_id = "your-best-run-id-here"
model_uri = f"runs:/{run_id}/model"
mlflow.register_model(model_uri, "CreditRiskModel")

Once registered, you can load the model from anywhere, ensuring you always use the right version.

model_production = mlflow.pyfunc.load_model(f"models:/CreditRiskModel/Production")
new_predictions = model_production.predict(new_data)

This approach solves the deployment mystery. You know exactly what is running. Your team can collaborate without confusion. New members can understand the project history instantly.

The true value isn’t just in the logging. It’s in the confidence it gives you. You can make changes knowing you have a perfect record to roll back to. You can share results with stakeholders, backed by solid data. How much time could your team save by eliminating “reproduction” as a daily task?

I built this system after one too late night searching for a lost model. It has transformed how my team works. We experiment fearlessly because we have a map of everything we’ve tried. I encourage you to start your next project with this setup.

Try it. Take a simple script and add MLflow tracking. See the history build. Share your findings with your colleagues. Comment below with the first metric you tracked—was it accuracy, precision, or something else? If this guide helped you, please like and share it. Let’s help more teams move from chaos to clarity.


As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!


📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!


Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

// Our Network

More from our team

Explore our publications across finance, culture, tech, and beyond.

// More Articles

Similar Posts