Have you ever scrolled through social media and felt that the text alone doesn’t capture the whole story? A sarcastic caption paired with a joyful image, or a neutral review attached to a photo of a broken product. I found myself constantly facing this disconnect in my work with AI. Relying solely on text for sentiment analysis felt like trying to understand a conversation by only hearing every other word. That’s what pushed me to explore how we can teach machines to see and read together, leading to this project on building a multi-modal sentiment analysis system with PyTorch. Join me as I walk you through creating a system that understands emotion by combining text and images.
Think about it: when you feel happy, you might post a bright photo with an excited caption. Your words and your picture tell the same story. But what if someone writes “Great job” under a picture of a messy desk? The text seems positive, but the image hints at frustration. A model that only reads the text would get it wrong. So, how can we build an AI that considers both clues? The answer lies in multi-modal learning, where we process different types of data—like text and images—simultaneously to make a single, smarter prediction.
To get started, we need to set up our toolbox. I’ll be using PyTorch because of its flexibility, along with a few key libraries. Here’s a quick look at the essential imports to kick things off.
import torch
import torch.nn as nn
from torch.utils.data import Dataset, DataLoader
from torchvision import transforms, models
from transformers import BertTokenizer, BertModel
from PIL import Image
import pandas as pd
import numpy as np
# Let's set up our device to use GPU if available
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
print(f"Working on: {device}")
Data is the foundation of any good model. In the real world, you might collect posts from Twitter or product reviews with photos. For this guide, I’ll create a synthetic dataset to simulate this. It will have text samples and corresponding image paths labeled with sentiment—negative, neutral, or positive. This approach lets us focus on the model mechanics first.
But here’s a question: how do we handle two completely different types of data in one system? Text comes as sequences of words, while images are grids of pixels. We need a way to bring them together. The first step is building a custom dataset class in PyTorch that can load and preprocess both modalities at once.
class MultiModalDataset(Dataset):
def __init__(self, dataframe, tokenizer, max_length=128, image_size=(224, 224)):
self.data = dataframe
self.tokenizer = tokenizer
self.max_length = max_length
# Define basic image transformations
self.image_transform = transforms.Compose([
transforms.Resize(image_size),
transforms.ToTensor(),
transforms.Normalize(mean=[0.485, 0.456, 0.406], std=[0.229, 0.224, 0.225])
])
def __len__(self):
return len(self.data)
def __getitem__(self, idx):
row = self.data.iloc[idx]
text = row['text']
# Tokenize the text
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_length,
padding='max_length',
truncation=True,
return_tensors='pt'
)
# For images, in practice, you'd load from 'image_path'
# Here's a synthetic image placeholder for demonstration
if row['sentiment'] == 2: # positive
image = torch.randn(3, 224, 224) * 0.1 + 0.8 # Simulate a bright image
elif row['sentiment'] == 1: # neutral
image = torch.randn(3, 224, 224) * 0.2 + 0.5 # Mid-tones
else: # negative
image = torch.randn(3, 224, 224) * 0.1 + 0.2 # Darker tones
sentiment = torch.tensor(row['sentiment'], dtype=torch.long)
return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'image': image,
'sentiment': sentiment
}
With our data ready, the next piece is the model itself. I chose to use BERT for understanding text because it captures context so well, and a pre-trained CNN like ResNet for images, which is great at spotting visual patterns. Now, the real magic happens when we merge these two streams of information. One common method is to simply concatenate the features from both models and pass them through a few neural network layers to make the final prediction.
Why not just average the results from two separate models? Because that misses the interaction between modalities. For instance, a picture of a sunset might make a vague text like “It’s over” feel more melancholic. By fusing features early, our model can learn these subtle connections.
class MultiModalSentimentModel(nn.Module):
def __init__(self, num_classes=3):
super().__init__()
# Text branch using BERT
self.text_model = BertModel.from_pretrained('bert-base-uncased')
# Freeze BERT layers initially to speed up training
for param in self.text_model.parameters():
param.requires_grad = False
# Image branch using ResNet
self.image_model = models.resnet18(pretrained=True)
# Replace the final layer to get features
num_features = self.image_model.fc.in_features
self.image_model.fc = nn.Identity() # Remove the classification layer
# Fusion layers
text_feature_size = 768 # BERT base output size
image_feature_size = num_features # ResNet18 feature size
combined_size = text_feature_size + image_feature_size
self.classifier = nn.Sequential(
nn.Linear(combined_size, 512),
nn.ReLU(),
nn.Dropout(0.3),
nn.Linear(512, num_classes)
)
def forward(self, input_ids, attention_mask, image):
# Process text
text_outputs = self.text_model(input_ids=input_ids, attention_mask=attention_mask)
text_features = text_outputs.pooler_output # Use the pooled output
# Process image
image_features = self.image_model(image)
# Combine features
combined = torch.cat((text_features, image_features), dim=1)
# Predict sentiment
logits = self.classifier(combined)
return logits
Training this model involves feeding it batches of text and images, comparing its predictions to the true labels, and adjusting the weights. I like to start by training only the fusion layers and classifier, then gradually unfreeze parts of BERT and ResNet for fine-tuning. This prevents overfitting and helps the model learn efficiently. Have you considered what metrics to use? Accuracy is a start, but looking at precision and recall per sentiment class gives a clearer picture of performance.
Once trained, you can test it on new data. Imagine inputting a tweet with a photo and seeing if the model catches the true emotion. The improvement over text-only models can be significant, often by 10-15% in accuracy, because it’s using more clues.
I hope this guide helps you see the power of combining text and images in AI. Building this system was a rewarding challenge that opened my eyes to how machines can better understand human expression. If you found this useful, try experimenting with different fusion strategies or adding audio for a three-modal approach! Please share your thoughts in the comments, and if you enjoyed this walkthrough, feel free to like and share it with others who might be interested. Let’s keep the conversation going on making AI more perceptive.