Deep learning Apr 14, 2026

Mixed Precision Training in PyTorch: Faster AI Training With Less GPU Memory

Learn mixed precision training in PyTorch to speed up deep learning, cut GPU memory use, and scale models efficiently with AMP.

I’ve been thinking about speed. Not just any speed, but the raw, hungry pace of modern artificial intelligence. We build larger models with billions of parameters, but our hardware groans under the weight. The bottleneck isn’t just processing power; it’s memory. Every layer, every gradient, every weight stored as a 32-bit floating-point number consumes precious GPU resources, slowing everything down. This constant tug-of-war between ambition and hardware is what brought me to a powerful, often underutilized technique. What if you could train models almost twice as fast while using half the memory? This isn’t a hypothetical. It’s the practical reality of mixed precision training.

Why does precision matter so much? In deep learning, we traditionally use 32-bit floating-point numbers, known as FP32. They offer a wide range and fine precision. But modern GPUs have special hardware called Tensor Cores that are built for 16-bit math, or FP16. These cores can perform many more operations per second on these smaller numbers. The catch is that FP16 has a much smaller range. Very large numbers can overflow to infinity, and very small numbers, like tiny gradients, can underflow to zero. If your gradients become zeros, your model stops learning.

So, do we just run everything in 16-bit and hope for the best? Absolutely not. That’s a direct path to failure. The clever solution is mixed precision training. The core idea is simple: use the right tool for each job. Use fast FP16 for the heavy computational lifting during the forward and backward passes where the GPU’s Tensor Cores shine. But safeguard the process by keeping a master copy of your model’s weights in stable, high-precision FP32. Think of the FP16 model as a fast, agile scout, and the FP32 master weights as the reliable, accurate map.

Here’s where the magic, and the necessity, of gradient scaling comes in. During the backward pass, those all-important gradients are often very small numbers. In FP16, these can vanish—they become zero, and learning halts. To prevent this, we artificially scale the loss value upward before the backward pass begins. This shifts the gradients into a safer range for FP16. After the gradients are calculated, we simply scale them back down before using them to update our master FP32 weights. It’s a simple trick with profound impact.

How do you actually implement this? PyTorch makes it straightforward with its Automatic Mixed Precision (AMP) module. You don’t need to manually cast tensors. You wrap parts of your training loop in a context manager. Let’s look at the heart of a training step.

First, the standard setup without AMP:

# Standard training step (FP32)
for data, target in dataloader:
    optimizer.zero_grad()
    output = model(data)
    loss = criterion(output, target)
    loss.backward()
    optimizer.step()

Now, let’s integrate AMP. Notice the introduction of a GradScaler.

from torch.cuda.amp import autocast, GradScaler

scaler = GradScaler()  # Handles the gradient scaling

for data, target in dataloader:
    optimizer.zero_grad()
    # Use autocast for the forward pass (runs in FP16 where beneficial)
    with autocast():
        output = model(data)
        loss = criterion(output, target)

    # Scales loss, calls backward, un-scales gradients
    scaler.scale(loss).backward()
    # Step the optimizer (gradients are unscaled inside)
    scaler.step(optimizer)
    # Update the scaler for the next iteration
    scaler.update()

The autocast context manager automatically chooses the precision for operations inside it. The GradScaler object manages the entire scaling process, protecting those small gradients.

What about memory? The savings are significant. Since activations and gradients are stored in FP16 during the most memory-intensive part of training, you can often double your batch size. This is a direct path to faster convergence. Consider a model that uses 10GB of memory per batch in FP32. With mixed precision, that could drop to 5-6GB, allowing you to increase the batch size from, say, 32 to 64 samples. Larger batches mean more efficient hardware utilization and fewer steps per epoch.

But it’s not all automatic. You must be cautious with certain operations. Functions like softmax or layer normalization are sensitive to precision. Thankfully, autocast is aware of this. It has a list of operations that are best kept in FP32 for numerical stability, and it handles this for you. However, if you write custom kernels or use complex functions, you should test thoroughly. A good practice is to always keep the reduction part of your loss function inside the autocast context to ensure consistency.

Can you visualize the benefit? Let’s write a small script to compare memory usage. This is where the personal payoff becomes clear.

import torch
import torch.nn as nn
from torch.cuda.amp import autocast, GradScaler
import gc

# A simple, bulky network
class BigNet(nn.Module):
    def __init__(self):
        super().__init__()
        self.layers = nn.Sequential(
            *[nn.Linear(1000, 1000) for _ in range(20)]
        )
    def forward(self, x):
        return self.layers(x)

model = BigNet().cuda()
data = torch.randn(512, 1000).cuda()  # A large batch

# Clear cache and measure
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()

# Run a forward/backward pass in FP32
output = model(data)
loss = output.sum()
loss.backward()
print(f"FP32 Peak memory: {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")

# Clear and measure again for AMP
model.zero_grad()
torch.cuda.empty_cache()
torch.cuda.reset_peak_memory_stats()
scaler = GradScaler()

with autocast():
    output = model(data)
    loss = output.sum()
scaler.scale(loss).backward()

print(f"AMP Peak memory:  {torch.cuda.max_memory_allocated() / 1e9:.2f} GB")

Running this, you might see the AMP footprint be 40-50% smaller. This freed-up memory is your resource to use for bigger models or data.

Is there a newer, better format than FP16? Yes, meet BF16 (Brain Floating Point). It was designed by Google for machine learning. It has the same dynamic range as FP32 (so overflow/underflow is less of a worry) but the same memory footprint as FP16. If your hardware supports it (like newer NVIDIA A100s or TPUs), you can use it by setting the dtype in autocast.

with autocast(dtype=torch.bfloat16):
    output = model(data)
    loss = criterion(output, target)

With BF16, gradient scaling is often less critical because of the wider range, but using the GradScaler is still generally recommended as a safe default.

The final step is integration. This isn’t a laboratory trick. It’s used to train the largest models in the world. When you combine mixed precision with techniques like gradient accumulation (simulating a larger batch size over several steps) and distributed data parallel training, you unlock the true potential of your hardware stack. You move from waiting for experiments to finish to iterating on ideas rapidly.

I started this by thinking about the constraints of hardware. I’m ending it by realizing the technique is really about removing constraints on creativity. By smartly managing precision, we can ask bigger questions of our models. We can experiment more freely. The code change is minimal—a few extra lines in your training loop—but the impact on your workflow can be transformative. Give it a try on your next project. Measure the time saved, the memory freed, and see how much further you can go.

Did you find this walkthrough helpful? Have you tried mixed precision training and seen interesting results or faced unexpected challenges? I’d love to hear about your experiences. Please share your thoughts in the comments below, and if this guide helped you train faster, consider sharing it with a colleague who might be pushing against similar limits.

As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!

101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!

📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!

Our Creations

Be sure to check out our creations:

We are on Medium

Keywords: mixed precision trainingPyTorch AMPGPU memory optimizationFP16 traininggradient scaling

Mixed Precision Training in PyTorch: Faster AI Training With Less GPU Memory

101 Books

Our Creations

We are on Medium

More from our team

Similar Posts

Build Multi-Modal Image Captioning System: Vision Transformers + GPT-2 PyTorch Tutorial

Build Real-Time Object Detection System with YOLOv8 OpenCV Python Complete Tutorial 2024

Build and Train Custom Vision Transformers in PyTorch: Complete Modern Image Classification Guide

SimCLR Explained: Build Powerful Vision Models Without Labeled Data

Build BERT Sentiment Analysis System: Complete PyTorch Guide from Fine-Tuning to Production Deployment

Complete PyTorch Multi-Class Image Classifier Tutorial: Data Loading to Production Deployment