deep_learning

Build Multi-Modal Image Captioning with PyTorch: Vision Transformers and Language Models Tutorial

Learn to build a multi-modal image captioning system combining Vision Transformers and language models in PyTorch. Step-by-step guide with code examples.

Build Multi-Modal Image Captioning with PyTorch: Vision Transformers and Language Models Tutorial

Lately, I’ve been captivated by how AI interprets visual scenes and describes them in human language. This fascination led me to build an image captioning system that merges vision transformers with language models. I’ll share the journey of creating this multi-modal solution using PyTorch, from environment setup to deployment. If you’ve ever wondered how machines “see” and “describe” images, you’re in the right place.

Setting up our workspace is straightforward. We’ll use PyTorch with these essential packages:

# Install dependencies
!pip install torch torchvision transformers datasets pycocotools wandb

Our core imports establish the foundation:

import torch
from torch import nn
from transformers import ViTFeatureExtractor, GPT2Tokenizer, GPT2LMHeadModel
import torchvision.transforms as T
from PIL import Image

For data, I chose MS COCO - 330k images with 5 captions each. Here’s how we process an image sample:

# Image transformations
transform = T.Compose([
    T.Resize((224, 224)),
    T.ToTensor(),
    T.Normalize(mean=[0.485, 0.456, 0.406], 
                std=[0.229, 0.224, 0.225])
])

# Load and preprocess image
image = Image.open("coco_image.jpg")
tensor_image = transform(image).unsqueeze(0).to(device)

The vision encoder uses a pre-trained Vision Transformer. Notice how we extract features without flattening spatial information:

class VisionEncoder(nn.Module):
    def __init__(self):
        super().__init__()
        self.vit = ViTFeatureExtractor.from_pretrained('google/vit-base-patch16-224')
        self.projection = nn.Linear(768, 512)  # Project to decoder dimension

    def forward(self, x):
        features = self.vit(x, return_tensors="pt").pixel_values
        return self.projection(features)

For language generation, we adapt GPT-2. The decoder starts with special tokens:

tokenizer = GPT2Tokenizer.from_pretrained('gpt2')
tokenizer.add_special_tokens({'pad_token': '[PAD]'})
tokenizer.add_special_tokens({'bos_token': '<start>'})
tokenizer.add_special_tokens({'eos_token': '<end>'})

# Encode sample caption
caption = "A dog playing in the park"
inputs = tokenizer(caption, return_tensors='pt', padding=True)

How do we bridge these two distinct modalities? Through cross-attention layers in the decoder. This allows text generation to dynamically focus on relevant image regions. The connection happens in the multi-modal module:

class MultiModalModel(nn.Module):
    def __init__(self):
        super().__init__()
        self.encoder = VisionEncoder()
        self.decoder = GPT2LMHeadModel.from_pretrained('gpt2')
        self.decoder.resize_token_embeddings(len(tokenizer))  # Adjust for new tokens

    def forward(self, images, captions):
        visual_features = self.encoder(images)
        outputs = self.decoder(input_ids=captions, encoder_hidden_states=visual_features)
        return outputs.logits

During training, we use cross-entropy loss with label smoothing. The optimizer combines AdamW with learning rate warmup:

criterion = nn.CrossEntropyLoss(ignore_index=tokenizer.pad_token_id, label_smoothing=0.1)
optimizer = torch.optim.AdamW(model.parameters(), lr=5e-5)

# Learning rate scheduler
scheduler = torch.optim.lr_scheduler.OneCycleLR(
    optimizer, 
    max_lr=5e-4,
    steps_per_epoch=len(train_loader),
    epochs=10
)

Evaluation uses metrics like BLEU-4 and CIDEr. But how do we measure if captions truly match human perception? I found that qualitative inspection reveals nuances metrics miss. For inference, beam search generates diverse options:

def generate_caption(image, beam_size=3):
    visual_features = model.encoder(image)
    sequences = [torch.tensor([tokenizer.bos_token_id])]
    
    for _ in range(max_length):
        candidates = []
        for seq in sequences:
            outputs = model.decoder(seq.unsqueeze(0), encoder_hidden_states=visual_features)
            next_token_logits = outputs.logits[0, -1, :]
            top_tokens = torch.topk(next_token_logits, beam_size).indices
            
            for token in top_tokens:
                candidate = torch.cat([seq, token.unsqueeze(0)])
                candidates.append((candidate, next_token_logits[token].item()))
        
        candidates.sort(key=lambda x: x[1], reverse=True)
        sequences = [c[0] for c in candidates[:beam_size]]
    
    return tokenizer.decode(sequences[0])

For production, I export using TorchScript and optimize with ONNX runtime. Quantization reduces model size by 4x with minimal accuracy drop. The system now runs efficiently on mobile devices - imagine describing surroundings in real-time through your phone camera!

This project transformed my understanding of how vision and language intersect in AI. What new applications can you envision with this technology? Share your thoughts below. If you found this useful, give it a like and share with others exploring multi-modal AI. What challenges have you faced in similar projects? Let’s discuss in the comments.

Keywords: image captioning PyTorch, Vision Transformers tutorial, multi-modal machine learning, PyTorch image captioning system, Vision Transformers language models, deep learning computer vision, Transformer architecture PyTorch, image caption generation, multi-modal AI development, PyTorch neural networks



Similar Posts
Blog Image
Custom CNN Image Classification with Transfer Learning in PyTorch: Complete Guide

Build Custom CNN for Image Classification with Transfer Learning in PyTorch. Learn architecture design, data augmentation & model optimization techniques.

Blog Image
Build Custom Vision Transformer from Scratch: Complete PyTorch Implementation Guide with Training Optimization

Learn to build and train a custom Vision Transformer (ViT) from scratch using PyTorch. Master patch embedding, attention mechanisms, and advanced optimization techniques for superior computer vision performance.

Blog Image
Complete Guide to Graph Neural Networks for Node Classification with PyTorch Geometric

Learn to build Graph Neural Networks for node classification using PyTorch Geometric. Master GCN, GraphSAGE & GAT architectures with hands-on implementation guides.

Blog Image
Complete Multi-Class Image Classifier with Transfer Learning: TensorFlow and Keras Tutorial

Learn to build multi-class image classifiers with transfer learning using TensorFlow and Keras. Complete guide with code examples and optimization tips.

Blog Image
Multi-Modal Sentiment Analysis with PyTorch: Text and Image Data Fusion Guide

Learn to build a multi-modal sentiment analysis system using PyTorch that combines text and image data. Step-by-step tutorial with BERT, ResNet, and fusion techniques for superior AI performance.

Blog Image
Build Custom CNN Architectures with PyTorch: Complete Guide from Design to Production Deployment

Learn to build custom CNN architectures with PyTorch from scratch to production. Master training pipelines, transfer learning, optimization, and deployment techniques.