Build SimCLR in PyTorch: Self-Supervised Learning for Unlabeled Images

Learn to build SimCLR in PyTorch for unlabeled image datasets with contrastive learning, NT-Xent loss, and linear probing tips.

Build SimCLR in PyTorch: Self-Supervised Learning for Unlabeled Images

I remember the moment clearly. I was staring at a folder containing ten thousand chest X-rays, ready to build a pneumonia detection model. The dataset was pristine, but there were no labels. Zero. No radiologist had annotated them, and I didn’t have the budget to hire one. For any supervised approach, I was dead in the water. That night, I discovered contrastive learning, and specifically the SimCLR framework. It felt like stumbling onto a secret passage in a locked castle. The idea was elegantly simple: teach a model to recognize that two augmented versions of the same X‑ray are the same image, without ever telling it what “pneumonia” means. The model would then learn meaningful features on its own. This article is about that technique—building SimCLR from scratch in PyTorch, using only raw, unlabeled images to create visual representations that rival supervised learning.

Have you ever found yourself with thousands of images but zero labels? That pain is exactly what drove the development of self‑supervised learning. SimCLR, introduced by researchers at Google Brain in 2020, became a landmark because it proved that contrastive learning could match supervised performance when the right ingredients—augmentations, a projection head, and the NT‑Xent loss—were combined. The core principle is almost childlike in its logic: if you take an image and create two distorted versions of it, the model should pull those distorted versions close together in an embedding space, while pushing apart distortions from other images. This forces the encoder to ignore superficial differences like color shifts or cropping and focus on the actual content of the image.

To start, you need the right tools. I assume you have PyTorch installed, along with torchvision, matplotlib, scikit‑learn, and tqdm. The project is simple: a few Python files for augmentations, the model, the loss, training, and evaluation. But the real magic starts with data augmentation. SimCLR is ridiculously dependent on the quality of its augmentations. The paper showed that a random crop combined with color jitter produces far better representations than either alone. So I build a class that applies two separate, random transformations to the same image. Each transformation includes a random resized crop, a horizontal flip, and a color jitter with a touch of grayscale conversion. The result is a pair of images that look different to the human eye but are semantically identical to the content. For example, a cat image might become a zoom‑in of the cat’s face with blue‑tinted colors, while the other might show the whole cat with a slight rotation and desaturated tones.

Why such a specific set? Because the model needs to learn invariance to factors that don’t change the label. A cat is still a cat whether it’s cropped, flipped, or turned black‑and‑white. But if I gave the model two barely different crops—say, shifting by five pixels—it would learn to simply match pixel positions, not semantic concepts. That’s why the augmentations must be strong and diverse. The SimCLR authors ran exhaustive ablations: removing the color jitter dropped top‑1 accuracy on ImageNet by nearly 10%. So don’t skip it. I personally learned this the hard way when my first run produced embeddings that were essentially the identity of the image, not its content.

Once the augmentations are ready, I wrap the dataset to output these pairs. For CIFAR‑10, each call to __getitem__ returns (x_i, x_j, label). I keep the label only for later evaluation; during pre‑training it is completely ignored. The model itself is divided into two parts. First, an encoder—typically a ResNet‑18 or ResNet‑50—that takes an augmented image and produces a feature vector h. Second, a small projection head that maps h to a lower‑dimensional space z. This projection head is a two‑layer MLP with batch normalization and ReLU. Why a projection head? The paper discovered that if you compute the contrastive loss directly on h, the representations are worse. The projection head forces the model to refine features that are more invariant, and after training, you discard it—you use only h for downstream tasks. I find that deeply satisfying: you build a temporary structure to teach the encoder to see better, then tear it down.

Now the loss function, NT‑Xent (Normalized Temperature‑scaled Cross‑Entropy). For a batch of N images, you get 2N augmented views. Each image has one positive pair (its two views) and 2(N‑1) negative pairs (views from all other images in the batch). The loss for a positive pair (i,j) is the log of the probability that i is most similar to j among all other views. That probability is computed using temperature‑scaled cosine similarities. I implement it in a few lines of PyTorch: gather the similarity matrix, divide by temperature, then compute cross‑entropy loss where the diagonal (or the appropriately permuted indices) represent the positive pairs. The temperature is a critical hyperparameter. Too high and the model cannot differentiate; too low and it becomes overly confident (and collapses). I typically start at 0.5 and tune. A collapsed model will output a constant vector for all images—a sure sign your temperature is too high or your augmentations are too weak.

Training is straightforward: for each batch, you pass both views through the encoder and projection head, compute the loss, backpropagate, and step. But I add a small personal touch. After every epoch, I save the learned h features for a small validation set. This allows me to quickly check if the features are starting to cluster by the hidden labels (which I never show to the model during training). The first few epochs look like noise; then, around epoch 20, I see blobs forming. It is one of the most satisfying visual experiences in machine learning: watching the model spontaneously discover the ten classes of CIFAR‑10 without ever being told about them.

After a hundred epochs of pre‑training, I evaluate using linear probing. I freeze the encoder weights, train a single linear layer on top of the h features using the true labels, and measure accuracy. For CIFAR‑10, a properly trained SimCLR can reach 90%+ accuracy with a ResNet‑18, compared to 95% for fully supervised training. That gap of 5% is a tiny price to pay for not needing any labels. I have used this same pipeline on medical imaging datasets with fewer than 200 labeled examples and achieved results that surprised even domain experts. The key insight is that the pre‑trained encoder learned robust feature maps that transfer much better than random initialization.

You might wonder: can I apply SimCLR to any dataset? Absolutely. As long as you have a folder of images, you can use the same augmentation and model code. The only thing you need to adjust is the image size and the normalization statistics. For non‑natural images, like satellite data or histopathology slides, I often modify the augmentations—for example, avoiding random grayscale on H&E stains where color carries diagnostic information. SimCLR is framework, not a fixed recipe.

Before I wrap up, I want to share one debugging story. On my first attempt, the loss went down perfectly, but the linear probe accuracy was random. I had accidentally used the projection head’s output z for linear probing instead of the encoder’s h. Once I corrected that, the accuracy shot up. The SimCLR paper explicitly warns against this mistake. So, a friendly piece of advice: always test your evaluation pipeline by training a small supervised model first, then compare with the self‑supervised one. If your self‑supervised linear probe is far lower than the supervised baseline, check if you are using the right features.

If you found this guide helpful, like this article to help others discover it. Share it with a colleague who is drowning in unlabeled data. And comment below with your own experiences—what worked, what didn’t, or which dataset you applied contrastive learning to. I read every comment and often incorporate insights into future tutorials. Together, we can make self‑supervised learning the default, not the exception.

Now go take those unlabeled images and let the model teach itself. You already have everything you need.


As a best-selling author, I invite you to explore my books on Amazon. Don’t forget to follow me on Medium and show your support. Thank you! Your support means the world!


101 Books

101 Books is an AI-driven publishing company co-founded by author Aarav Joshi. By leveraging advanced AI technology, we keep our publishing costs incredibly low—some books are priced as low as $4—making quality knowledge accessible to everyone.

Check out our book Golang Clean Code available on Amazon.

Stay tuned for updates and exciting news. When shopping for books, search for Aarav Joshi to find more of our titles. Use the provided link to enjoy special discounts!


📘 Checkout my latest ebook for free on my channel!
Be sure to like, share, comment, and subscribe to the channel!


Our Creations

Be sure to check out our creations:

Investor Central | Investor Central Spanish | Investor Central German | Smart Living | Epochs & Echoes | Puzzling Mysteries | Hindutva | Elite Dev | JS Schools


We are on Medium

Tech Koala Insights | Epochs & Echoes World | Investor Central Medium | Puzzling Mysteries Medium | Science & Epochs Medium | Modern Hindutva

// Our Network

More from our team

Explore our publications across finance, culture, tech, and beyond.

// More Articles

Similar Posts