How to Deploy a PyTorch Model with TorchServe and Docker
Learn how to deploy a PyTorch model with TorchServe and Docker, from .pth files to scalable APIs. Build a production-ready inference service.
Lately, I’ve been thinking a lot about what happens after the training loop ends. We spend so much time tuning hyperparameters and chasing accuracy scores, but a model in a Jupyter notebook doesn’t solve real problems. It hit me: the true test of a model isn’t its validation loss, but its ability to handle a real request at 2 AM. That’s why I want to talk about moving from a .pth file to a live, scalable service. Let’s get our models working for us.
The journey from a saved state dictionary to a production endpoint has several clear steps. It starts with preparing your model correctly. You can’t just pickle the entire training object and hope for the best. For a framework like PyTorch, you need to save the model’s state dict or a TorchScript trace. Here’s a basic example of saving a simple image classifier.
import torch
import torchvision.models as models
# Assume a model is already defined and trained
model = models.resnet18(pretrained=False)
model.fc = torch.nn.Linear(model.fc.in_features, 10)
model.eval()
# Save the learned parameters
torch.save(model.state_dict(), "model_weights.pth")
This creates a portable file with just the weights, not the entire Python class structure. Why is this separation important for serving?
Now, we need a way to accept requests, run the model, and return results. You could write a Flask app, but then you’re responsible for threading, batching, and monitoring. A dedicated serving tool handles this infrastructure. TorchServe is built for this purpose. Its core concept is the handler, a Python class that manages the model’s lifecycle and the data transformation.
A handler defines three main stages: preprocessing the incoming data, running inference, and postprocessing the output. Think of it as a custom pipeline you build. Here is a skeleton for an image handler.
from ts.torch_handler.base_handler import BaseHandler
import torch
class ImageHandler(BaseHandler):
def initialize(self, context):
# Load your model here, once when the server starts
self.model = self._load_my_model()
self.device = torch.device("cuda" if torch.cuda.is_available() else "cpu")
self.model.to(self.device)
self.model.eval()
self.initialized = True
def preprocess(self, data):
# Convert a list of request data into a model-ready tensor
images = []
for req in data:
# Each 'req' is raw bytes from the network
image_tensor = self._convert_bytes_to_tensor(req.get("body"))
images.append(image_tensor)
return torch.stack(images).to(self.device)
def inference(self, preprocessed_data):
# The forward pass
with torch.no_grad():
predictions = self.model(preprocessed_data)
return predictions
def postprocess(self, inference_output):
# Convert tensors to a JSON-serializable format like a list of dicts
results = []
for pred in inference_output:
results.append({'class_id': pred.argmax().item(), 'score': pred.max().item()})
return results
This structure keeps your business logic clean. The initialize method is called once, saving precious time on every request. What do you think happens if you put the model loading code inside the inference function instead?
With a handler written, you package everything into a single deployable unit. TorchServe uses a .mar file, a model archive. You create it with a command-line tool. This tool bundles your model weights, your handler code, and any other necessary files.
torch-model-archiver \
--model-name mymodel \
--version 1.0 \
--model-file model_weights.pth \
--handler image_handler.py \
--export-path ./model_store \
--force
After running this, you’ll have a file like mymodel.mar in the model_store directory. This file is self-contained. You can move it to any server with TorchServe installed and run it. How do you think this simplifies deployment across different machines?
Running the server locally is straightforward. You point it to the directory containing your .mar file.
torchserve --start --model-store ./model_store --models mymodel=mymodel.mar
By default, this starts a server on localhost port 8080. You now have a /predictions/mymodel endpoint. You can send a POST request with image data, and it will return the inference result. But local deployment is just the beginning. What about ensuring consistency across your team’s laptops, the testing server, and the cloud?
This is where Docker becomes essential. A Docker container wraps your model server, its dependencies, and your model file into a single, runnable image. Anyone with Docker can run it identically. A Dockerfile for this setup is quite simple.
# Use the official PyTorch base image
FROM pytorch/torchserve:latest
# Copy your model archive into the container's model store
COPY ./model_store/mymodel.mar /home/model-server/model-store/
# Set the default command to start TorchServe
CMD ["torchserve", \
"--start", \
"--model-store", "/home/model-server/model-store", \
"--models", "mymodel=mymodel.mar"]
You build the image with docker build -t my-model-server . and run it with docker run -p 8080:8080 my-model-server. Suddenly, your model is isolated, portable, and ready to be scaled. If you need to handle more traffic, you can use Docker Compose or Kubernetes to run multiple copies of this container behind a load balancer.
But a good API is more than just running. You need to know how it’s performing. TorchServe provides management APIs out of the box. While 8080 is for predictions, port 8081 is for management. You can query http://localhost:8081/models/mymodel to see the model’s status. You can even scale the number of worker processes up or down without restarting the server.
Let’s test it. Here’s how you might send a request using Python’s requests library.
import requests
import json
# Open an image file
with open("test_image.jpg", "rb") as f:
image_data = f.read()
# Send it to the prediction endpoint
response = requests.post(
"http://localhost:8080/predictions/mymodel",
data=image_data
)
print(json.dumps(response.json(), indent=2))
You should get back a clean JSON response with the prediction. This is the moment of truth—your model is now a service. Are you checking the latency and success rate of these calls?
Finally, think about the path ahead. You can configure TorchServe for production by adjusting its config file. You can enable request batching to improve throughput when many requests come in at once. You can set up logging and metrics collection to monitor health. The model archive also supports versioning, allowing you to roll out a new .mar file and update the server without downtime.
This process transforms your project from a local experiment into a reliable component. It’s about building a bridge between your data science work and the applications that need it. The code you write for the handler and the Dockerfile is the foundation for that bridge.
I hope this guide helps you take that critical step. The feeling of seeing your model respond to an API call is worth the setup. If you found this walkthrough useful, please share it with a colleague who’s also working on deployment. What was the biggest hurdle you faced when serving your first model? Let me know in the comments below.
As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!
101 Books
101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.
Check out our book Golang Clean Code available on Amazon.
Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!
📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!
Our Creations
Be sure to check out our creations:
Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools
We are on Medium
Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva