Build Multi-Modal Image Captioning System with CLIP and GPT-2 in PyTorch: Complete Tutorial

deep_learning

Build Multi-Modal Image Captioning System with CLIP and GPT-2 in PyTorch: Complete Tutorial

Learn to build an advanced multi-modal image captioning system using CLIP and GPT-2 with PyTorch. Complete tutorial with code, architecture design, and deployment tips.

Dec 19, 2025

Build Multi-Modal Image Captioning System with CLIP and GPT-2 in PyTorch: Complete Tutorial

I was looking at a photograph the other day, a simple picture of a dog in a park. My immediate thought was, “A golden retriever runs through green grass under a blue sky.” It struck me how effortless this translation from pixel to prose is for us, yet it remains a monumental task for machines. That gap between what we see and what we describe is the puzzle I wanted to solve. I decided to build a system that could bridge that gap, to teach a computer to look at a picture and speak about it. The result is a project combining two powerful models: CLIP, which understands images in the context of language, and GPT-2, which crafts coherent sentences. Let me show you how it works.

Think of it as a creative partnership. CLIP acts as the keen-eyed observer. You show it an image, and it doesn’t just see shapes and colors; it understands concepts. It knows the image contains a “dog,” “grass,” and “sunny weather” because it learned from millions of image-text pairs. We need to extract this understanding. First, we set up our tools.

import torch
import clip
from PIL import Image

# Load CLIP model and its image preprocessor
device = "cuda" if torch.cuda.is_available() else "cpu"
model, preprocess = clip.load("ViT-B/32", device=device)

# Prepare an image
image = Image.open("dog_park.jpg")
image_input = preprocess(image).unsqueeze(0).to(device)

# CLIP gives us a dense vector representing the image's meaning
with torch.no_grad():
    image_features = model.encode_image(image_input)
print(f"Image features shape: {image_features.shape}")  # Output: torch.Size([1, 512])

That 512-dimensional vector is the distilled essence of the photograph. But how do we turn this silent understanding into a flowing description? That’s where GPT-2, the storyteller, enters. It excels at generating text, but it needs a nudge, a starting point informed by the image. We can’t just hand it the raw vector; we need a translator.

We design a small adapter network. Its job is to convert CLIP’s vision-based features into a format GPT-2 can use as its initial context. This is the crucial link in our system. What do you think is the most effective way to connect these two different languages of vision and text?

import torch.nn as nn
from transformers import GPT2LMHeadModel, GPT2Tokenizer

class FeatureProjector(nn.Module):
    def __init__(self, clip_dim=512, gpt_dim=768):
        super().__init__()
        # A simple linear layer with activation to transform features
        self.projection = nn.Sequential(
            nn.Linear(clip_dim, gpt_dim),
            nn.LayerNorm(gpt_dim),
            nn.GELU(),
            nn.Linear(gpt_dim, gpt_dim)
        )

    def forward(self, clip_features):
        return self.projection(clip_features)

# Initialize our models
tokenizer = GPT2Tokenizer.from_pretrained("gpt2")
tokenizer.pad_token = tokenizer.eos_token  # Set a padding token
gpt_model = GPT2LMHeadModel.from_pretrained("gpt2")
projector = FeatureProjector()

# Project CLIP's features to GPT-2's space
prompt_features = projector(image_features)

Now we have a visual prompt GPT-2 can understand. The final step is generation. We start GPT-2 with special tokens and our projected features to guide its thinking. We use a method called “causal language modeling,” where the model predicts the next word repeatedly.

def generate_caption(image_features, model, tokenizer, max_length=30):
    # Start with a beginning-of-sentence token
    start_token = tokenizer.encode(tokenizer.bos_token, return_tensors="pt")
    
    # Generate text, using the image features to influence the output
    outputs = model.generate(
        inputs=start_token,
        max_length=max_length,
        do_sample=True,        # Allows for creative, non-deterministic output
        top_p=0.95,            # Uses nucleus sampling for better coherence
        temperature=0.9,       # Controls randomness
        pad_token_id=tokenizer.pad_token_id,
        # In a full model, the image_features would be fed as cross-attention input here.
    )
    
    caption = tokenizer.decode(outputs[0], skip_special_tokens=True)
    return caption

# For illustration, let's simulate a caption generation step.
caption = generate_caption(prompt_features, gpt_model, tokenizer)
print(f"Generated Caption: {caption}")

The true magic happens during training. We show the system thousands of image-caption pairs, like those from the COCO dataset. We feed it the image, let it generate a caption, and then compare that caption to the human-written one. The difference between them is the loss, a signal we use to adjust the projector’s connections. Over time, it learns to create prompts that lead GPT-2 to produce more accurate and descriptive sentences.

This approach has trade-offs. We’re using two large, pre-trained models, which is efficient but requires careful tuning to get them to collaborate effectively. The adapter layer, while simple, is the key component we train. Is this system creating new knowledge, or is it cleverly recombining what it has already seen in its training data?

Building this was a lesson in connection. It’s about taking two experts—one in sight, one in speech—and facilitating a conversation between them. The process from loading an image to printing a caption involves several steps, each reliant on the last. When it works, it feels like introducing two friends who discover they have a lot to talk about.

The potential is vast. Imagine assistive technology that narrates the visual world, content creation tools, or enhanced search engines that understand photos as well as text. This project is a starting point, a demonstration that these separate modes of intelligence can be combined to create something new and useful.

I hope walking through this process has been insightful. It’s a fascinating area where creativity meets engineering. If you enjoyed this look under the hood, feel free to share your thoughts or your own experiments in the comments below. Let’s keep the conversation going

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

deep_learning

Build Multi-Modal Image Captioning System with CLIP and GPT-2 in PyTorch: Complete Tutorial

Our Creations

We are on Medium

Similar Posts

How to Build a Semantic Segmentation Model with PyTorch: Complete U-Net Implementation Tutorial

Build Complete Sentiment Analysis Pipeline: Transformers, PyTorch Training to Production Deployment Guide

Complete Guide to Multi-Class Image Classification with Transfer Learning in TensorFlow

How to Build Real-Time Object Detection with YOLOv8 and Python: Complete Training Guide

Build Sentiment Analysis with BERT: Complete PyTorch Guide from Pre-training to Custom Fine-tuning

Complete PyTorch Image Classification Tutorial: From Custom CNNs to Production API Deployment