Custom PyTorch Transformer for Text Classification: Implementing Multi-Head Attention from Scratch

deep_learning

Custom PyTorch Transformer for Text Classification: Implementing Multi-Head Attention from Scratch

Learn to build transformer-based text classification with custom attention mechanisms in PyTorch. Master multi-head attention, positional encoding & advanced training techniques for production-ready sentiment analysis models.

Jul 21, 2025

Custom PyTorch Transformer for Text Classification: Implementing Multi-Head Attention from Scratch

I’ve been working with text classification for years, but the rise of transformer architectures completely changed how we approach language tasks. When I first encountered the limitations of traditional models on complex sentiment analysis problems, I knew we needed a better solution. That’s what led me to explore custom transformer implementations - and today I’ll show you how to build one from scratch in PyTorch.

Transformers handle sequential data differently than RNNs or LSTMs. Instead of processing words in order, they examine relationships between all words simultaneously. This parallel processing makes them incredibly efficient. But how exactly do they understand context without sequential processing? The secret lies in attention mechanisms.

Let’s start by setting up our environment. We’ll need these essential libraries:

pip install torch torchvision torchaudio transformers datasets
pip install scikit-learn matplotlib seaborn

Now, consider this fundamental question: How do we prepare text data for a transformer? Unlike images, text requires careful tokenization and encoding. Here’s a dataset class I’ve found effective:

class TextClassificationDataset(Dataset):
    def __init__(self, texts, labels, tokenizer, max_length=512):
        self.texts = texts
        self.labels = labels
        self.tokenizer = tokenizer
        self.max_length = max_length
    
    def __getitem__(self, idx):
        encoding = self.tokenizer(
            str(self.texts[idx]),
            truncation=True,
            padding='max_length',
            max_length=self.max_length,
            return_tensors='pt'
        )
        return {
            'input_ids': encoding['input_ids'].flatten(),
            'attention_mask': encoding['attention_mask'].flatten(),
            'labels': torch.tensor(self.labels[idx], dtype=torch.long)
        }

For our movie review sentiment analysis, we’ll use the IMDB dataset. But here’s something interesting: Did you know you can build your own tokenizer instead of relying on pre-trained ones? This gives finer control over vocabulary:

class SimpleTokenizer:
    def __init__(self, vocab_size=10000):
        self.word2idx = {'<PAD>': 0, '<UNK>': 1}
        self.idx2word = {0: '<PAD>', 1: '<UNK>'}
    
    def build_vocab(self, texts):
        word_freq = Counter()
        for text in texts:
            tokens = re.sub(r'[^a-zA-Z0-9\s]', '', text.lower()).split()
            word_freq.update(tokens)
        
        for word, _ in word_freq.most_common(self.vocab_size - 2):
            idx = len(self.word2idx)
            self.word2idx[word] = idx
            self.idx2word[idx] = word

Now, the core innovation in transformers: positional encoding. Since transformers don’t process sequences sequentially, we need to explicitly encode position information. This trigonometric approach works remarkably well:

class PositionalEncoding(nn.Module):
    def __init__(self, d_model, max_seq_length=512):
        super().__init__()
        position = torch.arange(max_seq_length).unsqueeze(1)
        div_term = torch.exp(torch.arange(0, d_model, 2) * (-math.log(10000.0) / d_model))
        pe = torch.zeros(max_seq_length, d_model)
        pe[:, 0::2] = torch.sin(position * div_term)
        pe[:, 1::2] = torch.cos(position * div_term)
        self.register_buffer('pe', pe.unsqueeze(0))
    
    def forward(self, x):
        return x + self.pe[:, :x.size(1)]

But what makes transformers truly powerful is multi-head attention. It allows the model to focus on different aspects of the input simultaneously. Here’s a simplified implementation:

class MultiHeadAttention(nn.Module):
    def __init__(self, d_model, num_heads):
        super().__init__()
        self.d_k = d_model // num_heads
        self.num_heads = num_heads
        self.W_q = nn.Linear(d_model, d_model)
        self.W_k = nn.Linear(d_model, d_model)
        self.W_v = nn.Linear(d_model, d_model)
        self.W_o = nn.Linear(d_model, d_model)
    
    def scaled_dot_product_attention(self, Q, K, V, mask=None):
        attn_scores = torch.matmul(Q, K.transpose(-2, -1)) / math.sqrt(self.d_k)
        if mask is not None:
            attn_scores = attn_scores.masked_fill(mask == 0, -1e9)
        attn_probs = F.softmax(attn_scores, dim=-1)
        return torch.matmul(attn_probs, V)
    
    def forward(self, x, mask=None):
        Q = self.W_q(x).view(x.size(0), -1, self.num_heads, self.d_k).transpose(1,2)
        K = self.W_k(x).view(x.size(0), -1, self.num_heads, self.d_k).transpose(1,2)
        V = self.W_v(x).view(x.size(0), -1, self.num_heads, self.d_k).transpose(1,2)
        
        attn_output = self.scaled_dot_product_attention(Q, K, V, mask)
        attn_output = attn_output.transpose(1,2).contiguous().view(x.size(0), -1, self.num_heads * self.d_k)
        return self.W_o(attn_output)

When training, I always use learning rate scheduling and gradient clipping - they significantly improve convergence. This training loop incorporates both:

optimizer = optim.Adam(model.parameters(), lr=1e-4)
scheduler = optim.lr_scheduler.StepLR(optimizer, step_size=5, gamma=0.1)

for epoch in range(epochs):
    for batch in train_loader:
        optimizer.zero_grad()
        outputs = model(batch['input_ids'], batch['attention_mask'])
        loss = F.cross_entropy(outputs, batch['labels'])
        loss.backward()
        torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
        optimizer.step()
    scheduler.step()

After training, visualizing attention weights reveals fascinating insights. We can see which words the model considers important for classification. For example, in negative reviews, words like “disappointing” or “waste” often receive high attention scores.

The real test comes when we deploy. I convert models to TorchScript for production:

traced_model = torch.jit.trace(model, example_inputs=(sample_input, sample_mask))
torch.jit.save(traced_model, "transformer_classifier.pt")

Custom transformers outperform traditional models, but they require thoughtful implementation. What aspects would you tweak for your specific use case? The flexibility to modify attention mechanisms or add custom layers makes this architecture incredibly powerful.

If you found this walkthrough helpful, please share it with others who might benefit. Have questions or suggestions? Let’s discuss in the comments - I’d love to hear about your experiences with custom transformer implementations!

Share: Facebook Twitter Reddit LinkedIn WhatsApp Telegram Pinterest Email Instagram

deep_learning

Custom PyTorch Transformer for Text Classification: Implementing Multi-Head Attention from Scratch

Our Creations

We are on Medium

Similar Posts

Build U-Net Semantic Segmentation in PyTorch: Complete Implementation Guide with Training Tips

Complete PyTorch Image Classification Pipeline: Transfer Learning Tutorial with Custom Data Loading and Deployment

Build Fraud Detection System with Deep Learning and Class Imbalance Handling Python

How to Build Fast Neural Style Transfer with PyTorch for Real-Time Art

How to Shrink and Speed Up Deep Learning Models with PyTorch Quantization

Build Real-Time YOLOv8 Object Detection API with FastAPI and Python Tutorial