Building Multi-Modal Sentiment Analysis with BERT-CNN Fusion in PyTorch: Complete Implementation Guide

deep_learning

Building Multi-Modal Sentiment Analysis with BERT-CNN Fusion in PyTorch: Complete Implementation Guide

Learn to build a multi-modal sentiment analysis system combining BERT and CNN fusion in PyTorch. Complete guide with code examples and deployment tips.

Aug 29, 2025

Building Multi-Modal Sentiment Analysis with BERT-CNN Fusion in PyTorch: Complete Implementation Guide

I’ve been thinking about how we express emotions online lately. It’s never just text or just an image—it’s the combination that tells the real story. That’s why I decided to build a system that understands both. Let me show you how we can create something that reads between the lines, and between the pixels.

Have you ever wondered how machines can understand the emotional context of a social media post that mixes text and images? The challenge is teaching them to see the connections we naturally make.

We start by preparing our environment. Here’s what we need:

import torch
import torch.nn as nn
from transformers import BertTokenizer, BertModel
from torchvision.models import resnet50

device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')

The text processing uses BERT to capture linguistic nuances. Why BERT? Because it understands context better than traditional methods.

text_input = tokenizer("This sunset is breathtaking!", 
                      padding='max_length', 
                      max_length=128, 
                      truncation=True, 
                      return_tensors="pt")

For images, we use a CNN backbone. ResNet50 gives us rich visual features without starting from scratch.

image_model = resnet50(pretrained=True)
image_model = nn.Sequential(*list(image_model.children())[:-1])
image_model = image_model.to(device).eval()

Now comes the interesting part: fusion. How do we combine text and image features meaningfully? We use attention mechanisms that learn which features matter most.

class FusionLayer(nn.Module):
    def __init__(self, text_dim=768, image_dim=2048, hidden_dim=512):
        super().__init__()
        self.text_proj = nn.Linear(text_dim, hidden_dim)
        self.image_proj = nn.Linear(image_dim, hidden_dim)
        self.attention = nn.MultiheadAttention(hidden_dim, num_heads=8)
        
    def forward(self, text_features, image_features):
        text_proj = self.text_proj(text_features)
        image_proj = self.image_proj(image_features)
        combined = torch.cat([text_proj, image_proj], dim=1)
        attended, _ = self.attention(combined, combined, combined)
        return attended

Training requires careful balancing. We don’t want one modality dominating the other. The loss function needs to consider both streams equally.

What happens when the text says one thing but the image suggests another? Our system learns to weigh the evidence.

During evaluation, we measure more than just accuracy. We check how the model performs on different sentiment intensities and modality combinations.

def evaluate_model(model, dataloader, device):
    model.eval()
    total_correct = 0
    total_samples = 0
    
    with torch.no_grad():
        for batch in dataloader:
            texts, images, labels = batch
            # Forward pass and calculations
            outputs = model(texts, images)
            predictions = torch.argmax(outputs, dim=1)
            total_correct += (predictions == labels).sum().item()
            total_samples += labels.size(0)
            
    return total_correct / total_samples

The real test comes with ambiguous cases. A picture of a crowded beach with the text “Perfect solitude” creates interesting tension for the model.

Deployment considerations include optimizing for speed and memory. We can quantize the model and use ONNX for production readiness.

I’ve found that the most rewarding part is seeing the system correctly identify sarcasm and irony—those moments when text and image deliberately contradict each other.

What applications can you imagine for this technology? Customer service analysis, content moderation, or perhaps mental health monitoring?

Building this system taught me that human communication is wonderfully complex. Machines are getting better at understanding that complexity, but there’s always more to learn.

If you found this exploration helpful, I’d appreciate if you could share it with others who might benefit. Feel free to leave comments about your experiences with multi-modal systems—I read and respond to every one.

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

deep_learning

Building Multi-Modal Sentiment Analysis with BERT-CNN Fusion in PyTorch: Complete Implementation Guide

Our Creations

We are on Medium

Similar Posts

From Encoder-Decoder to Attention: How Machines Learn Human Language

Build Sentiment Analysis with BERT: Complete PyTorch Guide from Pre-training to Custom Fine-tuning

How I Built a Real-World Text Classifier Using BERT From Scratch

Complete PyTorch Transfer Learning Pipeline: From Pre-trained Models to Production Deployment

Build Multi-Modal Image-Text Classification with CLIP: Complete Python Fine-Tuning Guide for Custom AI Models

Build Custom Vision Transformers with PyTorch: Complete Guide from Architecture to Production Deployment