How to Shrink and Speed Up Deep Learning Models with PyTorch Quantization

deep_learning

How to Shrink and Speed Up Deep Learning Models with PyTorch Quantization

Learn how to reduce model size and boost inference speed using dynamic, static, and QAT quantization in PyTorch.

Jan 8, 2026

How to Shrink and Speed Up Deep Learning Models with PyTorch Quantization

I’ve been working with deep learning models for years, and there’s a moment we all face: the training is done, the accuracy looks great, but the model is just too big and too slow. It happens when you try to put it on a phone, or in a small device, or you just need it to answer queries faster for thousands of users. The excitement of a high-performing model crashes into the reality of deployment. That’s why I’m writing this today. I want to show you a practical way to make your models smaller and faster, without starting from scratch. If you stick with me, I’ll show you how to shrink your model’s size and boost its speed, often by four times or more. Ready to begin? Let’s get into it.

Why does this work? Think of the numbers inside your model—the weights and activations. They are usually 32-bit floating-point numbers, which are very precise. But do we always need that much precision? Often, we don’t. By converting these numbers into 8-bit integers, we can store four times as many in the same space. It’s like swapping a heavy, detailed blueprint for a clear, efficient sketch that still gets the job done.

The core idea is called quantization. It’s a process of mapping a wide range of values to a smaller set. In our case, we map float32 numbers to int8 integers. This happens using a simple formula: you divide the number by a ‘scale’ factor, round it, and add a ‘zero_point’ offset. To get the value back, you do the reverse. The trick is in choosing the right scale and zero_point to lose as little information as possible.

So, how do we actually do this in PyTorch? The framework provides straightforward tools. Let’s look at the simplest method first: dynamic quantization. Here, we pre-convert the model’s weights to int8. The activations—the numbers calculated during inference—are converted on the fly. This is very effective for models with lots of operations like Linear and LSTM layers.

import torch
from torch import nn

# Let's say we have a simple recurrent model
class SimpleLSTM(nn.Module):
    def __init__(self, input_dim=128, hidden_dim=256):
        super().__init__()
        self.lstm = nn.LSTM(input_dim, hidden_dim, num_layers=2)
        self.fc = nn.Linear(hidden_dim, 10)

    def forward(self, x):
        out, _ = self.lstm(x)
        out = self.fc(out[-1, :, :])
        return out

model_fp32 = SimpleLSTM()
# Load your trained weights here
# model_fp32.load_state_dict(torch.load('model.pth'))

# Apply dynamic quantization
model_int8 = torch.quantization.quantize_dynamic(
    model_fp32,  # The original model
    {nn.LSTM, nn.Linear},  # The layers we want to quantize
    dtype=torch.qint8
)

print(f"Original model size: Heavy")
print(f"Quantized model size: Much lighter")

Dynamic quantization is great, but what if you want even more speed? For that, we use static quantization. This method is a bit more involved but gives better performance. It requires a calibration step. You run some sample data through the model to observe the range of activation values. This data is used to determine the optimal ‘scale’ and ‘zero_point’ for each layer, permanently fixing them. Have you considered what a small batch of your own data could tell your model about itself?

# Example setup for static quantization on a vision model
from torch.quantization import QuantStub, DeQuantStub, prepare, convert

class QuantizableModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.quant = QuantStub()  # Marks where quantization starts
        self.conv = nn.Conv2d(3, 16, kernel_size=3)
        self.relu = nn.ReLU()
        self.fc = nn.Linear(16 * 26 * 26, 10)
        self.dequant = DeQuantStub() # Marks where dequantization happens

    def forward(self, x):
        x = self.quant(x)
        x = self.relu(self.conv(x))
        x = x.view(x.size(0), -1)
        x = self.fc(x)
        x = self.dequant(x)
        return x

model = QuantizableModel()
model.eval()  # Model must be in eval mode

# Specify the quantization configuration
model.qconfig = torch.quantization.get_default_qconfig('fbgemm')  # For x86 CPUs

# Prepare the model for calibration
torch.quantization.prepare(model, inplace=True)

# Now, run calibration data through the model
# This is typically 100-1000 representative images from your dataset
# for _ in calibration_data_loader:
#     model(calibration_data)

# Finally, convert the model
torch.quantization.convert(model, inplace=True)
# Your model is now statically quantized

Sometimes, converting a trained model leads to a noticeable drop in accuracy. What then? This is where Quantization-Aware Training (QAT) comes in. It’s a bit like training with ankle weights. You simulate the quantization process during training, so the model learns to adapt to the lower precision from the start. When you later convert it for real, it performs much better.

The result of all this work is a model that is fundamentally different. It’s not just a compressed file; it’s a network that performs calculations using integers. This makes it drastically faster on compatible hardware like most modern CPUs and specialized neural processing units. The model file on your disk will be about a quarter of the original size.

How do you know it worked? You must measure. Compare the size of your model files before and after. More importantly, benchmark the inference time. Use the same input and time both models. Check the final accuracy on your test set. The goal is a tiny drop in accuracy for a massive gain in speed and size. Would a 1% accuracy trade for a 4x speedup be worth it for your application?

Finally, you need to deploy it. The quantized PyTorch model can be used directly. For broader compatibility, especially on mobile or web, you can export it to the ONNX format with the quantization information preserved. This allows runtimes like ONNX Runtime to execute it efficiently across different platforms.

# Exporting a quantized model to ONNX (conceptual example)
dummy_input = torch.randn(1, 3, 28, 28)
# Note: Actual export of quantized models requires careful handling of the model state
torch.onnx.export(model, dummy_input, "quantized_model.onnx")

This journey from a bulky, slow model to a lean, fast one is one of the most satisfying in machine learning engineering. It bridges the gap between theoretical achievement and practical application. You built something smart, and now you’ve made it accessible. I hope this guide helps you bring your models to life in the real world.

If you found this walkthrough useful, please share it with a colleague who’s battling with model size. Have you tried quantization before? What was your biggest hurdle? Let me know in the comments below—I read every one.

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

deep_learning