Ever looked at a mountain of customer reviews and wished a computer could sort them for you? Or scanned endless support tickets, hoping to spot the urgent ones faster? That’s where text classification comes in. It’s the engine behind spam filters, sentiment trackers, and content moderators. For years, getting good at this meant teaching machines very rigid rules. Then, BERT changed the game. Let’s build something real together—a custom text classifier you can adapt to your own projects. Think of this as your practical workshop.
I remember first training simpler models. They’d often get confused by sentences like “This movie is so bad it’s good.” The context was everything. When BERT arrived, with its ability to understand words from both sides, it felt like the right tool for the job. Why do we fine-tune it instead of training from scratch? Imagine being handed a library’s worth of language knowledge; you only need to teach it your specific cataloging system.
First, we set the stage. You’ll need PyTorch and the Hugging Face transformers library. These are the core tools.
# Installation
!pip install torch transformers datasets pandas scikit-learn
Let’s talk data. A model is only as good as what it learns from. We’ll use a classic: movie reviews labeled as positive or negative. Clean data matters. We’ll remove HTML tags and extra spaces.
import pandas as pd
from datasets import load_dataset
# Load the dataset
dataset = load_dataset('imdb')
df_train = pd.DataFrame(dataset['train'])
df_test = pd.DataFrame(dataset['test'])
# A quick peek
print(df_train['text'][0][:200]) # First review snippet
See the raw text? Our first job is to prepare it for BERT. This involves a tokenizer, which breaks text into pieces BERT understands and adds special tokens. Have you considered how a single word can change a sentence’s meaning?
from transformers import BertTokenizer
tokenizer = BertTokenizer.from_pretrained('bert-base-uncased')
# See tokenization in action
sample_text = "A captivating, flawed masterpiece."
tokens = tokenizer.tokenize(sample_text)
print(tokens)
# Output: ['a', 'cap', '##tivat', '##ing', ',', 'flawed', 'masterpiece', '.']
Notice how “captivating” is split? This is BERT’s WordPiece tokenization handling complex vocabulary. Next, we build a PyTorch Dataset to serve our data efficiently.
import torch
from torch.utils.data import Dataset
class ReviewDataset(Dataset):
def __init__(self, texts, labels, tokenizer, max_len=512):
self.texts = texts
self.labels = labels
self.tokenizer = tokenizer
self.max_len = max_len
def __len__(self):
return len(self.texts)
def __getitem__(self, idx):
text = str(self.texts[idx])
label = self.labels[idx]
encoding = self.tokenizer.encode_plus(
text,
add_special_tokens=True,
max_length=self.max_len,
return_token_type_ids=False,
padding='max_length',
truncation=True,
return_attention_mask=True,
return_tensors='pt',
)
return {
'input_ids': encoding['input_ids'].flatten(),
'attention_mask': encoding['attention_mask'].flatten(),
'labels': torch.tensor(label, dtype=torch.long)
}
The model itself builds on a pre-trained BERT base. We add a simple classifier layer on top. This is the fine-tuning part.
from transformers import BertModel
import torch.nn as nn
class BertTextClassifier(nn.Module):
def __init__(self, n_classes=2):
super(BertTextClassifier, self).__init__()
self.bert = BertModel.from_pretrained('bert-base-uncased')
self.drop = nn.Dropout(p=0.3)
self.out = nn.Linear(self.bert.config.hidden_size, n_classes)
def forward(self, input_ids, attention_mask):
_, pooled_output = self.bert(
input_ids=input_ids,
attention_mask=attention_mask,
return_dict=False
)
output = self.drop(pooled_output)
return self.out(output)
Training is where the magic happens. We use an optimizer designed for transformers and a standard loss function. How long do you think it takes for the model to start recognizing patterns?
from transformers import AdamW
from torch.utils.data import DataLoader
# Setup
model = BertTextClassifier()
model.train()
device = torch.device('cuda' if torch.cuda.is_available() else 'cpu')
model.to(device)
# DataLoader
train_dataset = ReviewDataset(df_train['text'].tolist(), df_train['label'].tolist(), tokenizer)
train_loader = DataLoader(train_dataset, batch_size=16, shuffle=True)
# Optimizer
optimizer = AdamW(model.parameters(), lr=2e-5)
loss_fn = nn.CrossEntropyLoss().to(device)
# Training loop
for epoch in range(3): # Small number for demonstration
for batch in train_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
loss = loss_fn(outputs, labels)
loss.backward()
optimizer.step()
optimizer.zero_grad()
print(f'Epoch {epoch + 1} completed.')
Finally, we should check its work. Evaluation tells us if our fine-tuning was effective.
from sklearn.metrics import accuracy_score, classification_report
def evaluate(model, data_loader, device):
model.eval()
predictions, true_labels = [], []
with torch.no_grad():
for batch in data_loader:
input_ids = batch['input_ids'].to(device)
attention_mask = batch['attention_mask'].to(device)
labels = batch['labels'].to(device)
outputs = model(input_ids=input_ids, attention_mask=attention_mask)
_, preds = torch.max(outputs, dim=1)
predictions.extend(preds.cpu().tolist())
true_labels.extend(labels.cpu().tolist())
print(f'Accuracy: {accuracy_score(true_labels, predictions):.4f}')
print(classification_report(true_labels, predictions))
And there you have it. You’ve just built a custom text classifier. This isn’t just about movie reviews. You can adapt this core to analyze product feedback, sort support emails, or filter content. The framework is yours to modify. What problem will you solve with it? I encourage you to take this code, run it, break it, and rebuild it for your own data. The real learning starts when you apply it. If this guide helped you connect the pieces, please share it with others who might be on a similar path. Feel free to comment below with your results or questions—let’s keep the conversation going.