Build Multi-Modal Image Captioning with PyTorch: Vision Transformers and Language Models Tutorial

deep_learning

Build Multi-Modal Image Captioning with PyTorch: Vision Transformers and Language Models Tutorial

Learn to build a multi-modal image captioning system combining Vision Transformers and language models in PyTorch. Step-by-step guide with code examples.

Aug 18, 2025

Build Multi-Modal Image Captioning with PyTorch: Vision Transformers and Language Models Tutorial

Lately, I’ve been captivated by how AI interprets visual scenes and describes them in human language. This fascination led me to build an image captioning system that merges vision transformers with language models. I’ll share the journey of creating this multi-modal solution using PyTorch, from environment setup to deployment. If you’ve ever wondered how machines “see” and “describe” images, you’re in the right place.

Setting up our workspace is straightforward. We’ll use PyTorch with these essential packages:

# Install dependencies
!pip install torch torchvision transformers datasets pycocotools wandb

Our core imports establish the foundation:

import torch
from torch import nn
from transformers import ViTFeatureExtractor, GPT2Tokenizer, GPT2LMHeadModel
import torchvision.transforms as T
from PIL import Image

For data, I chose MS COCO - 330k images with 5 captions each. Here’s how we process an image sample:

# Image transformations
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], 
                std=[0.229, 0.224, 0.225])
])

# Load and preprocess image
image = Image.open("coco_image.jpg")
tensor_image = transform(image).unsqueeze(0).to(device)

The vision encoder uses a pre-trained Vision Transformer. Notice how we extract features without flattening spatial information:

class VisionEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.vit = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
        self.projection = nn.Linear(768, 512)  # Project to decoder dimension

    def forward(self, x):
        features = self.vit(x, return_tensors="pt").pixel_values
        return self.projection(features)

For language generation, we adapt GPT-2. The decoder starts with special tokens:

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.add_special_tokens({'bos_token': '<start>'})
tokenizer.add_special_tokens({'eos_token': '<end>'})

# Encode sample caption
caption = "A dog playing in the park"
inputs = tokenizer(caption, return_tensors='pt', padding=True)

How do we bridge these two distinct modalities? Through cross-attention layers in the decoder. This allows text generation to dynamically focus on relevant image regions. The connection happens in the multi-modal module:

class MultiModalModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = VisionEncoder()
        self.decoder = GPT2LMHeadModel.from_pretrained('gpt2')
        self.decoder.resize_token_embeddings(len(tokenizer))  # Adjust for new tokens

    def forward(self, images, captions):
        visual_features = self.encoder(images)
        outputs = self.decoder(input_ids=captions, encoder_hidden_states=visual_features)
        return outputs.logits

During training, we use cross-entropy loss with label smoothing. The optimizer combines AdamW with learning rate warmup:

criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id, label_smoothing=0.1)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, 
    max_lr=5e-4,
    steps_per_epoch=len(train_loader),
    epochs=10
)

Evaluation uses metrics like BLEU-4 and CIDEr. But how do we measure if captions truly match human perception? I found that qualitative inspection reveals nuances metrics miss. For inference, beam search generates diverse options:

def generate_caption(image, beam_size=3):
    visual_features = model.encoder(image)
    sequences = [torch.tensor([tokenizer.bos_token_id])]
    
    for _ in range(max_length):
        candidates = []
        for seq in sequences:
            outputs = model.decoder(seq.unsqueeze(0), encoder_hidden_states=visual_features)
            next_token_logits = outputs.logits[0, -1, :]
            top_tokens = torch.topk(next_token_logits, beam_size).indices
            
            for token in top_tokens:
                candidate = torch.cat([seq, token.unsqueeze(0)])
                candidates.append((candidate, next_token_logits[token].item()))
        
        candidates.sort(key=lambda x: x[1], reverse=True)
        sequences = [c[0] for c in candidates[:beam_size]]
    
    return tokenizer.decode(sequences[0])

For production, I export using TorchScript and optimize with ONNX runtime. Quantization reduces model size by 4x with minimal accuracy drop. The system now runs efficiently on mobile devices - imagine describing surroundings in real-time through your phone camera!

This project transformed my understanding of how vision and language intersect in AI. What new applications can you envision with this technology? Share your thoughts below. If you found this useful, give it a like and share with others exploring multi-modal AI. What challenges have you faced in similar projects? Let’s discuss in the comments.

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

deep_learning

Build Multi-Modal Image Captioning with PyTorch: Vision Transformers and Language Models Tutorial

Our Creations

We are on Medium

Similar Posts

Build PyTorch Multi-Modal Image Captioning: CNN Encoder + Transformer Decoder Tutorial

Mastering Time Series Forecasting with PyTorch: From LSTM to Transformers

Build Real-Time Image Classification System with PyTorch FastAPI Complete Tutorial

Build CLIP Multi-Modal Image-Text Classification System with PyTorch: Complete Tutorial Guide

Build Real-Time Emotion Detection System with CNNs OpenCV Python Complete Tutorial 2024

Custom CNN Architectures with PyTorch: From Scratch to Production Deployment Guide