Build Multi-Modal Sentiment Analysis System with PyTorch: Text and Image Fusion for Emotion Detection

deep_learning

Build Multi-Modal Sentiment Analysis System with PyTorch: Text and Image Fusion for Emotion Detection

Learn to build a multi-modal sentiment analysis system with PyTorch that combines text and image data for superior emotion detection accuracy.

Jan 11, 2026

Build Multi-Modal Sentiment Analysis System with PyTorch: Text and Image Fusion for Emotion Detection

Have you ever scrolled through social media and felt that the text alone doesn’t capture the whole story? A sarcastic caption paired with a joyful image, or a neutral review attached to a photo of a broken product. I found myself constantly facing this disconnect in my work with AI. Relying solely on text for sentiment analysis felt like trying to understand a conversation by only hearing every other word. That’s what pushed me to explore how we can teach machines to see and read together, leading to this project on building a multi-modal sentiment analysis system with PyTorch. Join me as I walk you through creating a system that understands emotion by combining text and images.

Think about it: when you feel happy, you might post a bright photo with an excited caption. Your words and your picture tell the same story. But what if someone writes “Great job” under a picture of a messy desk? The text seems positive, but the image hints at frustration. A model that only reads the text would get it wrong. So, how can we build an AI that considers both clues? The answer lies in multi-modal learning, where we process different types of data—like text and images—simultaneously to make a single, smarter prediction.

To get started, we need to set up our toolbox. I’ll be using PyTorch because of its flexibility, along with a few key libraries. Here’s a quick look at the essential imports to kick things off.

import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, models
from transformers import BertTokenizer, BertModel
from PIL import Image
import pandas as pd
import numpy as np

# Let's set up our device to use GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Working on: {device}")

Data is the foundation of any good model. In the real world, you might collect posts from Twitter or product reviews with photos. For this guide, I’ll create a synthetic dataset to simulate this. It will have text samples and corresponding image paths labeled with sentiment—negative, neutral, or positive. This approach lets us focus on the model mechanics first.

But here’s a question: how do we handle two completely different types of data in one system? Text comes as sequences of words, while images are grids of pixels. We need a way to bring them together. The first step is building a custom dataset class in PyTorch that can load and preprocess both modalities at once.

class MultiModalDataset(Dataset):
    def __init__(self, dataframe, tokenizer, max_length=128, image_size=(224, 224)):
        self.data = dataframe
        self.tokenizer = tokenizer
        self.max_length = max_length
        # Define basic image transformations
        self.image_transform = transforms.Compose([
            transforms.Resize(image_size),
            transforms.ToTensor(),
            transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
        ])
    
    def __len__(self):
        return len(self.data)
    
    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        text = row['text']
        # Tokenize the text
        encoding = self.tokenizer.encode_plus(
            text,
            add_special_tokens=True,
            max_length=self.max_length,
            padding='max_length',
            truncation=True,
            return_tensors='pt'
        )
        # For images, in practice, you'd load from 'image_path'
        # Here's a synthetic image placeholder for demonstration
        if row['sentiment'] == 2:  # positive
            image = torch.randn(3, 224, 224) * 0.1 + 0.8  # Simulate a bright image
        elif row['sentiment'] == 1:  # neutral
            image = torch.randn(3, 224, 224) * 0.2 + 0.5  # Mid-tones
        else:  # negative
            image = torch.randn(3, 224, 224) * 0.1 + 0.2  # Darker tones
        
        sentiment = torch.tensor(row['sentiment'], dtype=torch.long)
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'image': image,
            'sentiment': sentiment
        }

With our data ready, the next piece is the model itself. I chose to use BERT for understanding text because it captures context so well, and a pre-trained CNN like ResNet for images, which is great at spotting visual patterns. Now, the real magic happens when we merge these two streams of information. One common method is to simply concatenate the features from both models and pass them through a few neural network layers to make the final prediction.

Why not just average the results from two separate models? Because that misses the interaction between modalities. For instance, a picture of a sunset might make a vague text like “It’s over” feel more melancholic. By fusing features early, our model can learn these subtle connections.

class MultiModalSentimentModel(nn.Module):
    def __init__(self, num_classes=3):
        super().__init__()
        # Text branch using BERT
        self.text_model = BertModel.from_pretrained('bert-base-uncased')
        # Freeze BERT layers initially to speed up training
        for param in self.text_model.parameters():
            param.requires_grad = False
        
        # Image branch using ResNet
        self.image_model = models.resnet18(pretrained=True)
        # Replace the final layer to get features
        num_features = self.image_model.fc.in_features
        self.image_model.fc = nn.Identity()  # Remove the classification layer
        
        # Fusion layers
        text_feature_size = 768  # BERT base output size
        image_feature_size = num_features  # ResNet18 feature size
        combined_size = text_feature_size + image_feature_size
        
        self.classifier = nn.Sequential(
            nn.Linear(combined_size, 512),
            nn.ReLU(),
            nn.Dropout(0.3),
            nn.Linear(512, num_classes)
        )
    
    def forward(self, input_ids, attention_mask, image):
        # Process text
        text_outputs = self.text_model(input_ids=input_ids, attention_mask=attention_mask)
        text_features = text_outputs.pooler_output  # Use the pooled output
        
        # Process image
        image_features = self.image_model(image)
        
        # Combine features
        combined = torch.cat((text_features, image_features), dim=1)
        
        # Predict sentiment
        logits = self.classifier(combined)
        return logits

Training this model involves feeding it batches of text and images, comparing its predictions to the true labels, and adjusting the weights. I like to start by training only the fusion layers and classifier, then gradually unfreeze parts of BERT and ResNet for fine-tuning. This prevents overfitting and helps the model learn efficiently. Have you considered what metrics to use? Accuracy is a start, but looking at precision and recall per sentiment class gives a clearer picture of performance.

Once trained, you can test it on new data. Imagine inputting a tweet with a photo and seeing if the model catches the true emotion. The improvement over text-only models can be significant, often by 10-15% in accuracy, because it’s using more clues.

I hope this guide helps you see the power of combining text and images in AI. Building this system was a rewarding challenge that opened my eyes to how machines can better understand human expression. If you found this useful, try experimenting with different fusion strategies or adding audio for a three-modal approach! Please share your thoughts in the comments, and if you enjoyed this walkthrough, feel free to like and share it with others who might be interested. Let’s keep the conversation going on making AI more perceptive.

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

deep_learning

Build Multi-Modal Sentiment Analysis System with PyTorch: Text and Image Fusion for Emotion Detection

Our Creations

We are on Medium

Similar Posts

How to Build a Stable GAN: From Noisy Outputs to Realistic Images

Build Multi-Modal Sentiment Analysis with CLIP and PyTorch: Text and Image Processing Guide

Build Real-Time YOLOv8 Object Detection System: Complete Python Training to Deployment Guide

Build Vision Transformers for Image Classification: Complete PyTorch Guide with Fine-tuning Techniques

Real-Time Object Detection with YOLO and OpenCV: Complete Python Implementation Guide

Build and Train a Variational Autoencoder VAE for Image Generation with PyTorch Tutorial