deep_learning

Custom PyTorch Transformer for Text Classification: Implementing Multi-Head Attention from Scratch

Learn to build transformer-based text classification with custom attention mechanisms in PyTorch. Master multi-head attention, positional encoding & advanced training techniques for production-ready sentiment analysis models.

Custom PyTorch Transformer for Text Classification: Implementing Multi-Head Attention from Scratch

I’ve been working with text classification for years, but the rise of transformer architectures completely changed how we approach language tasks. When I first encountered the limitations of traditional models on complex sentiment analysis problems, I knew we needed a better solution. That’s what led me to explore custom transformer implementations - and today I’ll show you how to build one from scratch in PyTorch.

Transformers handle sequential data differently than RNNs or LSTMs. Instead of processing words in order, they examine relationships between all words simultaneously. This parallel processing makes them incredibly efficient. But how exactly do they understand context without sequential processing? The secret lies in attention mechanisms.

Let’s start by setting up our environment. We’ll need these essential libraries:

pip install torch torchvision torchaudio transformers datasets
pip install scikit-learn matplotlib seaborn

Now, consider this fundamental question: How do we prepare text data for a transformer? Unlike images, text requires careful tokenization and encoding. Here’s a dataset class I’ve found effective:

class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __getitem__(self, idx):
        encoding = self.tokenizer(
            str(self.texts[idx]),
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

For our movie review sentiment analysis, we’ll use the IMDB dataset. But here’s something interesting: Did you know you can build your own tokenizer instead of relying on pre-trained ones? This gives finer control over vocabulary:

class SimpleTokenizer:
    def __init__(self, vocab_size=10000):
        self.word2idx = {'<PAD>': 0, '<UNK>': 1}
        self.idx2word = {0: '<PAD>', 1: '<UNK>'}
    
    def build_vocab(self, texts):
        word_freq = Counter()
        for text in texts:
            tokens = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower()).split()
            word_freq.update(tokens)
        
        for word, _ in word_freq.most_common(self.vocab_size - 2):
            idx = len(self.word2idx)
            self.word2idx[word] = idx
            self.idx2word[idx] = word

Now, the core innovation in transformers: positional encoding. Since transformers don’t process sequences sequentially, we need to explicitly encode position information. This trigonometric approach works remarkably well:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length=512):
        super().__init__()
        position = torch.arange(max_seq_length).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_seq_length, d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))
    
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

But what makes transformers truly powerful is multi-head attention. It allows the model to focus on different aspects of the input simultaneously. Here’s a simplified implementation:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = F.softmax(attn_scores, dim=-1)
        return torch.matmul(attn_probs, V)
    
    def forward(self, x, mask=None):
        Q = self.W_q(x).view(x.size(0), -1, self.num_heads, self.d_k).transpose(1,2)
        K = self.W_k(x).view(x.size(0), -1, self.num_heads, self.d_k).transpose(1,2)
        V = self.W_v(x).view(x.size(0), -1, self.num_heads, self.d_k).transpose(1,2)
        
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        attn_output = attn_output.transpose(1,2).contiguous().view(x.size(0), -1, self.num_heads * self.d_k)
        return self.W_o(attn_output)

When training, I always use learning rate scheduling and gradient clipping - they significantly improve convergence. This training loop incorporates both:

optimizer = optim.Adam(model.parameters(), lr=1e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

for epoch in range(epochs):
    for batch in train_loader:
        optimizer.zero_grad()
        outputs = model(batch['input_ids'], batch['attention_mask'])
        loss = F.cross_entropy(outputs, batch['labels'])
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
    scheduler.step()

After training, visualizing attention weights reveals fascinating insights. We can see which words the model considers important for classification. For example, in negative reviews, words like “disappointing” or “waste” often receive high attention scores.

The real test comes when we deploy. I convert models to TorchScript for production:

traced_model = torch.jit.trace(model, example_inputs=(sample_input, sample_mask))
torch.jit.save(traced_model, "transformer_classifier.pt")

Custom transformers outperform traditional models, but they require thoughtful implementation. What aspects would you tweak for your specific use case? The flexibility to modify attention mechanisms or add custom layers makes this architecture incredibly powerful.

If you found this walkthrough helpful, please share it with others who might benefit. Have questions or suggestions? Let’s discuss in the comments - I’d love to hear about your experiences with custom transformer implementations!

Keywords: transformer text classification, pytorch custom attention mechanism, multi-head attention implementation, text classification transformer, pytorch nlp tutorial, custom transformer architecture, sentiment analysis pytorch, attention mechanism nlp, transformer model training, pytorch text processing



Similar Posts
Blog Image
How to Build a Real-Time Object Detection System with YOLOv8 and PyTorch

Learn to train, evaluate, and deploy a production-ready object detection model using YOLOv8 and PyTorch in real-time systems.

Blog Image
How to Build Real-Time Object Detection with YOLOv8 and OpenCV Python Tutorial

Learn to build a real-time object detection system using YOLOv8 and OpenCV in Python. Complete tutorial with code examples, setup, and optimization tips. Start detecting objects now!

Blog Image
Complete YOLOv8 Real-Time Object Detection: Python Training to Production Deployment Guide

Learn to build a complete real-time object detection system with YOLOv8 and Python. Covers custom training, optimization, and production deployment with FastAPI.

Blog Image
Build Custom ResNet Architectures in PyTorch: Complete Deep Learning Guide with Training Examples

Learn to build custom ResNet architectures from scratch in PyTorch. Master residual blocks, training techniques, and deep learning optimization. Complete guide included.

Blog Image
How to Quantize Neural Networks for Fast, Efficient Edge AI Deployment

Learn how to shrink and speed up AI models using quantization techniques for real-time performance on edge devices.

Blog Image
PyTorch CNN Tutorial: Build Image Classification Models from Scratch with Transfer Learning

Learn to build and train CNNs for image classification with PyTorch. Complete guide covering architecture design, data preprocessing, training optimization, and transfer learning techniques.