How to Build a Semantic Segmentation Model with PyTorch: Complete U-Net Implementation Tutorial

deep_learning

How to Build a Semantic Segmentation Model with PyTorch: Complete U-Net Implementation Tutorial

Learn to build semantic segmentation models with PyTorch and U-Net architecture. Complete guide covering data preprocessing, training strategies, and evaluation metrics for computer vision projects.

Dec 3, 2025

How to Build a Semantic Segmentation Model with PyTorch: Complete U-Net Implementation Tutorial

I spend a lot of time staring at images, not for their beauty, but for what they hide. In my work, I often need a machine to see the world not as a flat picture, but as a collection of distinct parts. Where is the road? Which pixels form a tree? This precise pixel-by-pixel understanding is the goal of semantic segmentation. This guide comes from a practical need: to move from theory to a working model you can build today. We’ll do that by creating a U-Net, a powerful and surprisingly elegant architecture, using PyTorch. This is the blueprint that brought advanced segmentation to many fields.

Think about a self-driving car’s camera feed. Object detection draws boxes around cars. But what about the road itself, the sidewalk, or a distant pedestrian? Semantic segmentation gives every single pixel a label, creating a detailed map of the scene. This level of detail is why it’s vital for medical imaging to isolate tumors, for satellite analysis to track deforestation, and for augmented reality to blend digital objects seamlessly into our world. Can you see how this changes everything?

Let’s start with the data, because a model is only as good as what it learns from. In segmentation, you have an image and its partner: a mask. This mask is a grayscale image where each shade of gray corresponds to a different class (like 0 for background, 1 for car, 2 for person). Loading them in sync is crucial. Here’s a simple way to create a dataset class in PyTorch that ensures every image is paired with its correct mask.

import torch
from torch.utils.data import Dataset
from PIL import Image
import os

class SegmentationDataset(Dataset):
    def __init__(self, image_dir, mask_dir, transform=None):
        self.image_dir = image_dir
        self.mask_dir = mask_dir
        self.transform = transform
        self.images = sorted(os.listdir(image_dir))

    def __len__(self):
        return len(self.images)

    def __getitem__(self, idx):
        img_path = os.path.join(self.image_dir, self.images[idx])
        mask_path = os.path.join(self.mask_dir, self.images[idx].replace('.jpg', '_mask.png'))
        image = Image.open(img_path).convert("RGB")
        mask = Image.open(mask_path).convert("L")  # 'L' mode for grayscale labels

        if self.transform:
            image = self.transform(image)
            # Note: For masks, we often use different transforms. More on this below.
        return image, mask

Did you notice the challenge? When you flip or rotate an image for augmentation, you must apply the exact same geometric change to its mask. You can’t just randomly change the colors of a mask—it would destroy the labels. This coordinated dance is key to effective training. Why do you think data augmentation is even more critical for segmentation than for simple image classification?

Now, the star of our show: the U-Net. Its name comes from its symmetrical, U-shaped design. The left side is the encoder: a series of layers that compress the image, learning “what” is in the scene. The right side is the decoder: a path that expands this compressed knowledge back to full resolution, learning “where” things are. The genius is in the skip connections—bridges that connect layers from the encoder to the decoder. These bridges pass forward fine-grained spatial details that would otherwise be lost during compression, helping the decoder paint a precise output.

Building this in PyTorch is a satisfying exercise in module assembly. We start by defining a basic block used twice per layer for both the contracting and expanding paths.

import torch.nn as nn

class DoubleConv(nn.Module):
    """(Convolution => BatchNorm => ReLU) * 2"""
    def __init__(self, in_channels, out_channels):
        super().__init__()
        self.double_conv = nn.Sequential(
            nn.Conv2d(in_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True),
            nn.Conv2d(out_channels, out_channels, kernel_size=3, padding=1),
            nn.BatchNorm2d(out_channels),
            nn.ReLU(inplace=True)
        )
    def forward(self, x):
        return self.double_conv(x)

The encoder uses this block and then downsamples with a max-pooling layer. The decoder upsamples, concatenates the feature map with the corresponding skip connection from the encoder, and then processes it through another DoubleConv. This process repeats until we reach the original image size. The final layer is a 1x1 convolution that maps the learned features to the desired number of output classes.

Training this model requires a special loss function. Since we’re making a prediction for every pixel, we need a loss that compares two whole images. The standard is a pixel-wise cross-entropy loss. However, if your classes are imbalanced (e.g., lots of background, very few pixels for a rare object), you might use Dice Loss or a combination. What happens if you ignore class imbalance during training?

# A common practice: Combining Cross-Entropy with Dice Loss
def dice_loss(pred, target, smooth=1e-6):
    pred = torch.softmax(pred, dim=1)
    target_one_hot = torch.nn.functional.one_hot(target, num_classes=pred.shape[1]).permute(0, 3, 1, 2).float()
    intersection = (pred * target_one_hot).sum(dim=(2, 3))
    union = pred.sum(dim=(2, 3)) + target_one_hot.sum(dim=(2, 3))
    dice = (2. * intersection + smooth) / (union + smooth)
    return 1 - dice.mean()

Measuring success isn’t just about a dropping loss number. We need metrics that understand image structure. The Intersection over Union (IoU), also called the Jaccard Index, is the gold standard. For each class, you measure the area of overlap between the predicted mask and the true mask, divided by the area of union. A high IoU means your predicted shape closely matches the ground truth shape. Tracking the mean IoU across all classes gives you a single, powerful number to judge your model’s performance.

The journey from a blank script to a model that can trace objects in an image is incredibly rewarding. You begin to see the world through a different lens—one of shapes, boundaries, and contexts. I built my first U-Net to help analyze microscopic images, and the moment it correctly outlined a specific cell structure was a revelation. The process teaches you not just about code, but about how machines learn to interpret visual space.

I encourage you to take this foundation and experiment. Start with a small dataset, like the Oxford-IIIT Pet dataset where you segment pets from their background. Watch the model learn. The path from here involves exploring newer architectures, but U-Net remains a timeless and effective starting point. What will you build with it?

If this guide helped you see the pieces of the puzzle, please share it with others who might be starting their own journey. I’d love to hear about your projects or answer any questions in the comments below. Let’s keep building.

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

deep_learning

How to Build a Semantic Segmentation Model with PyTorch: Complete U-Net Implementation Tutorial

Our Creations

We are on Medium

Similar Posts

Build Real-Time Object Detection System with YOLOv8 and OpenCV Python Tutorial

Build Multi-Modal Sentiment Analysis with Vision and Text Using PyTorch: Complete Guide

Build Custom Object Detection Model PyTorch: Complete Guide from Data to Production Deployment

Custom CNN for Multi-Class Image Classification with PyTorch: Complete Training and Deployment Guide

Complete Guide: Building Image Classification Systems with TensorFlow Transfer Learning

How to Shrink and Speed Up Deep Learning Models with PyTorch Quantization