Build Multi-Modal Sentiment Analysis with PyTorch: Text-Image Fusion for Enhanced Opinion Mining Performance

deep_learning

Build Multi-Modal Sentiment Analysis with PyTorch: Text-Image Fusion for Enhanced Opinion Mining Performance

Learn to build a multi-modal sentiment analysis system with PyTorch, combining text and image data using BERT and ResNet for enhanced opinion mining accuracy.

Feb 1, 2026

Build Multi-Modal Sentiment Analysis with PyTorch: Text-Image Fusion for Enhanced Opinion Mining Performance

Ever wonder why a picture of a messy room with the caption “living my best life” makes you laugh? Or why a beautiful sunset photo paired with a sad quote feels so poignant? I was scrolling through my social feeds recently, caught in this exact puzzle. We naturally combine what we see with what we read to understand how someone truly feels. Yet, most AI sentiment tools only look at the words. What if we could teach a machine to do what we do—to read the room, and the text? That question led me straight into building a system that can do just that.

Let’s build a tool that can understand sentiment by looking at both text and images together. Think of a product review with a photo of a broken item, or a social media post with a sarcastic caption under a cheerful selfie. By combining these clues, we can get a much clearer picture of the true opinion being expressed.

First, we need to set the stage. We’ll use PyTorch as our foundation because of its flexibility. For understanding text, we’ll use a pre-trained model called BERT, which is excellent with language. For images, we’ll use a pre-trained CNN like ResNet, which is great at recognizing visual patterns. The real magic happens when we merge these two streams of information.

Here’s a basic look at how we start structuring our project. We define a configuration to keep all our settings in one place.

import torch
import torch.nn as nn
from transformers import AutoModel, AutoTokenizer

class ModelConfig:
    def __init__(self):
        # Text model setup
        self.text_model_name = 'bert-base-uncased'
        self.max_text_len = 128
        
        # Image model setup
        self.image_feature_size = 2048  # Output from ResNet
        
        # How we combine them
        self.fusion_output_size = 512
        self.num_classes = 3  # Positive, Neutral, Negative

config = ModelConfig()

Now, how do we actually feed data to this model? We need a custom dataset that loads a piece of text and its corresponding image at the same time.

from torch.utils.data import Dataset
from PIL import Image
import torchvision.transforms as T

class TextImageDataset(Dataset):
    def __init__(self, dataframe, tokenizer):
        self.data = dataframe
        self.tokenizer = tokenizer
        # Basic transformations for the image
        self.img_transform = T.Compose([
            T.Resize((224, 224)),
            T.ToTensor(),
        ])

    def __getitem__(self, idx):
        row = self.data.iloc[idx]
        text = row['review_text']
        label = row['sentiment_label']  # e.g., 0, 1, 2
        
        # Process text
        inputs = self.tokenizer(text, padding='max_length', 
                                truncation=True, max_length=config.max_text_len, 
                                return_tensors='pt')
        
        # Process image
        img_path = row['image_path']
        image = Image.open(img_path).convert('RGB')
        image = self.img_transform(image)
        
        return {
            'input_ids': inputs['input_ids'].squeeze(),
            'attention_mask': inputs['attention_mask'].squeeze(),
            'image': image,
            'label': torch.tensor(label)
        }

But here’s a crucial question: simply extracting features from text and images separately isn’t enough, is it? How do we make them talk to each other? One effective method is through a simple yet powerful fusion technique. We can concatenate the features and then run them through a neural network to find relationships.

This is the core of our model architecture. It has two encoders and a fusion module.

class MultiModalSentimentModel(nn.Module):
    def __init__(self, config):
        super().__init__()
        # Text Encoder
        self.text_encoder = AutoModel.from_pretrained(config.text_model_name)
        # Freeze early layers, fine-tune later ones (common practice)
        for param in self.text_encoder.parameters():
            param.requires_grad = False
        # Unfreeze the last few layers
        for param in self.text_encoder.encoder.layer[-2:].parameters():
            param.requires_grad = True
            
        # Image Encoder (using a pre-trained ResNet, removing its final layer)
        self.img_encoder = torch.hub.load('pytorch/vision:v0.10.0', 'resnet50', pretrained=True)
        self.img_encoder = nn.Sequential(*list(self.img_encoder.children())[:-1])
        # Freeze most of it, fine-tune a bit
        for param in self.img_encoder.parameters():
            param.requires_grad = False
        for param in self.img_encoder[-1].parameters():
            param.requires_grad = True
            
        # Fusion Layer
        self.fusion = nn.Sequential(
            nn.Linear(768 + config.image_feature_size, config.fusion_output_size),
            nn.ReLU(),
            nn.Dropout(0.3),
        )
        
        # Classifier Head
        self.classifier = nn.Linear(config.fusion_output_size, config.num_classes)

    def forward(self, input_ids, attention_mask, image):
        # Get text features
        text_outputs = self.text_encoder(input_ids=input_ids, attention_mask=attention_mask)
        # Use the [CLS] token's representation
        text_features = text_outputs.last_hidden_state[:, 0, :]
        
        # Get image features
        image_features = self.img_encoder(image)
        image_features = image_features.view(image_features.size(0), -1)
        
        # Combine them
        combined = torch.cat((text_features, image_features), dim=1)
        fused = self.fusion(combined)
        
        # Predict sentiment
        logits = self.classifier(fused)
        return logits

Training this model requires a good dataset. While large public datasets exist, you can start with something simpler. Imagine collecting tweets with images and labeling the sentiment. The training loop looks like a standard PyTorch training loop, but we pass our three inputs: text IDs, attention masks, and images.

What’s fascinating is watching the model learn. At first, it might heavily rely on the text. But over time, it begins to weigh the image too. For instance, it might learn that the word “great” alongside a dark, blurry photo is often sarcasm, not a genuine positive statement.

So, why does this matter? In a world overflowing with mixed-media content, a tool that understands the full context is incredibly powerful. It can help brands gauge genuine customer reactions, assist in moderating content more accurately, or even provide richer analytics for social research.

Building this was a reminder that the most meaningful understanding often lies at the intersection of different types of information. We process the world multi-modally, and now, our machines are starting to catch up.

Did you find this walk-through helpful? What kind of text and image combinations do you think would be most challenging for this system? Share your thoughts in the comments below—I’d love to hear your ideas and continue the conversation. If you enjoyed this article, please like and share it with others who might be curious about the future of AI and sentiment analysis.

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

deep_learning

Build Multi-Modal Sentiment Analysis with PyTorch: Text-Image Fusion for Enhanced Opinion Mining Performance

Our Creations

We are on Medium

Similar Posts

Build Real-Time Image Classification with PyTorch and FastAPI: Complete Training to Production Guide

Build Custom CNN Models for Image Classification: TensorFlow Keras Tutorial with Advanced Training Techniques

PyTorch Semantic Segmentation: Complete Guide from Data Preparation to Production Deployment

Build Real-Time BERT Sentiment Analysis System with Gradio: Complete Training to Production Guide

How to Build a Neural Machine Translation System with Transformers

Build an Image Captioning System: PyTorch CNN-RNN Tutorial with Vision-Language Models and Attention Mechanisms