SimCLR Explained: Build Powerful Vision Models Without Labeled Data
Learn how SimCLR uses contrastive learning to train vision models on unlabeled images, cut labeling costs, and boost results with fewer labels.
I’ve spent countless hours labeling datasets. It’s tedious, expensive, and frankly, the biggest blocker to doing anything interesting in machine learning. What if we could teach a model to see the world without needing those labels first? That’s the promise I want to explore with you today. This isn’t just theory; it’s a practical path to building powerful models with the data you already have. Let’s get started.
Think about how you recognize a cat. You don’t need someone to point at ten thousand cats and say “cat.” You see one from different angles, in different lights, sometimes just an ear peeking from under a couch. Your brain learns the idea of “cat-ness” by comparing and contrasting. This is the simple, brilliant idea behind contrastive learning: teach a model by showing it what’s similar and what’s different.
SimCLR puts this idea into a clear, repeatable recipe. You take an image and create two altered versions of it. These are a “positive pair.” The model’s job is to learn that these two different-looking images are, at their core, the same. Meanwhile, it must learn that these images are not the same as altered versions of other images in your batch. The magic isn’t in the model architecture, but in how you create those altered views.
Why are the alterations so important? If you only changed images slightly, the model would learn trivial shortcuts, like matching exact color patterns. The alterations must be strong enough to force the model to look past surface-level noise and find the true, underlying subject. This is where we design the curriculum.
Let’s look at the code that creates this curriculum. We’ll build a transformation that makes two unique, randomly altered views from one input image.
import torch
from torchvision import transforms
import random
class SimCLRTransform:
"""Creates two strongly augmented views for contrastive learning."""
def __init__(self, size=96):
# This is the core augmentation stack.
self.transform = transforms.Compose([
transforms.RandomResizedCrop(size=size, scale=(0.08, 1.0)),
transforms.RandomHorizontalFlip(p=0.5),
transforms.RandomApply([transforms.ColorJitter(0.8, 0.8, 0.8, 0.2)], p=0.8),
transforms.RandomGrayscale(p=0.2),
transforms.GaussianBlur(kernel_size=9, sigma=(0.1, 2.0)),
transforms.ToTensor(),
])
def __call__(self, x):
# Apply the randomized transform twice, independently.
return self.transform(x), self.transform(x)
Notice the RandomResizedCrop with a very wide scale. An image could be cropped to just 8% of its original area. The model might only see a tire, but it must still know it’s looking at the same car as in the other view, which might show the hood. This forces robust feature learning.
With our data pipeline ready, we need a way to tell the model if it’s doing a good job. We need a loss function that measures similarity. This is where the NT-Xent loss comes in. It works by treating the paired views as a single positive example and every other image in the batch as a negative example.
How do we measure “similarity” for the model? We use cosine similarity. Imagine the model’s output as an arrow in space. If two arrows point in the same direction, they are similar. The loss function then tries to pull the arrows for our positive pair closer together while pushing all other arrows farther away.
Here is a clear, functional implementation of this loss.
import torch.nn.functional as F
def nt_xent_loss(features, temperature=0.5):
"""Computes the contrastive loss for a batch of feature vectors."""
batch_size = features.shape[0]
# Normalize the features to use cosine similarity
features = F.normalize(features, dim=1)
# Compute similarity matrix between all features
similarity_matrix = torch.matmul(features, features.T)
# Create labels: the positive pair is the next image in the batch.
# For batch [img1_view1, img1_view2, img2_view1, img2_view2, ...],
# the pair for index 0 is index 1.
labels = torch.arange(batch_size, device=features.device)
labels = (labels + 1 - labels % 2 * 2) # Maps 0->1, 1->0, 2->3, 3->2, ...
# Mask out the self-similarity diagonal
mask = torch.eye(batch_size, device=features.device, dtype=torch.bool)
similarity_matrix[mask] = -1e9
# Compute cross-entropy loss
loss = F.cross_entropy(similarity_matrix / temperature, labels)
return loss
The temperature parameter is a knob. A lower temperature makes the model more strict, focusing only on the hardest negatives. Getting this value right is often the key to good performance.
Now, what does the model itself look like? It has two parts. The first part is an “encoder” – a standard backbone like a ResNet that extracts features. The second part is a small “projection head,” usually just a couple of linear layers, that maps the features to the space where we apply the contrastive loss. After training, we throw away the projection head and use the encoder for real tasks.
So you’ve trained this model on 100,000 unlabeled images. How do you know it learned anything useful? This is the critical test. We perform “linear evaluation.” We freeze the weights of our pretrained encoder and attach a single, new linear layer on top. We then train only this new layer on a small set of labeled data, like 10 images per class.
If the representations are good, this simple linear classifier will achieve high accuracy very quickly. It proves the encoder has organized the visual world into a meaningful space where concepts like “cat” and “truck” are linearly separable. The hard work of seeing is already done.
The results can be startling. A model pretrained this way on generic images often performs better on a specific medical imaging task with limited labels than a model trained from scratch on that medical data. It has learned a more general, robust way of seeing.
What does this mean for you? It means the barrier to applying deep learning just got lower. You can start with the piles of unlabeled data you already have. Use SimCLR to build a foundation model that understands your specific domain—be it satellite imagery, factory floor photos, or historical documents. Then, with a handful of labels, you can fine-tune it for your precise task with remarkable efficiency.
This shift from supervised to self-supervised learning is one of the most practical advances in our field. It turns data scarcity from a wall into a speed bump. I encourage you to take the code here, run it on a dataset you care about, and see what happens. What could you build if labeling wasn’t your first step?
If this approach to learning from data resonates with you, please share this article with a colleague who’s also wrestling with labeling costs. Have you tried a self-supervised method before? What was your experience? Let me know in the comments—I read every one and learn from your perspectives.
As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!
101 Books
101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.
Check out our book Golang Clean Code available on Amazon.
Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!
📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!
Our Creations
Be sure to check out our creations:
Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools
We are on Medium
Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva