When you started with acquiring knowledge about deep learning methods, neural network classifiers were probably part of your learnings. These networks, which are trained on datasets which have many samples for a limited amount of classed, are strongly supervised.
But they don't scale well to samples that go beyond the distribution present within your training set.
And neither do they allow you to reuse them on different classification problems. When trained properly, they work well for the problem they were trained for - but that's it.
In 2021, researchers at OpenAI looked into methods to solve this. Specifically, they wondered if it would be possible to build a mechanism for zero-shot classification, in other words, a method that works on a variety of examples without explicit training nor the provisioning of similar examples in the prompt at inference time.
And yes, this turned out to be possible! In this article, we're taking a look at CLIP, which enables zero-shot behavior by being trained on text and images... at the same time.
Building a zero-shot classifier can be done in multiple ways; one architecture is not necessarily better than another. When designing such architectures, it is common to use the Transformer (Vaswani et al., 2017). It is the state of the art in NLP these days. In fact, many modern language models can originally be traced back to (parts) of this architecture.
In one of their first attempts, the OpenAI researchers tried jointly training an image and text model in prediction image captions:
Our initial approach, similar to VirTex, jointly trained an image CNN and text transformer from scratch to predict the caption of an image. However, we encountered difficulties efficiently scaling this method (Radford et al., 2021).
Unfortunately, that didn't work well. According to the authors, the problem which the model attempted solving - effectively a predictive problem - is a really difficult one:
[The model tries] to predict the exact words of the text accompanying each image. This is a difficult task due to the wide variety of descriptions, comments, and related text that co-occur with images (Radford et al., 2021).
Instead, they argue, a contrastive approach may work better. Imagine that you are teaching a child what a banana looks like. Often, you show it a banana, saying "banana!". Then, you alternate with other fruits ("other fruit!"), from which the child eventually learns to distinguish bananas from these fruits.
In other words, you're employing a contrastive learning strategy.
Contrastive learning is an approach to learning that focuses on extracting meaningful representations by contrasting positive and negative pairs of instances (Encord, 2023).
Now that we understand what contrastive learning is, let's take a look at CLIP itself.
CLIP, or Contrastive Language-Image Pretraining , is a model trained jointly on pairs of languages and images. There's a text encoder which encodes texts and an image encoder which encodes into shared embedding space with dimensionality d_e
. Before looking at how CLIP is trained, let's take a look at these individual components.
Text wise, byte-pair encodings with a maximum sequence length of 76 are fed through a Transformer, which is from the Radford et al. (2019) paper - so GPT-2. Activations of the highest layer are considered as the textual feature representations. These are then linearly projected into embedding space with dimensionality d_e
.
The text encoder is a Transformer (Vaswani et al., 2017) with the architecture modifications described in Radford et al. (2019) (Radford et al., 2019)
Something similar is done with the image batches. They are also encoded and projected into embedding space with dimensionality d_e
. However, in their work, the authors were a bit more experimental with regards to images compared to text. Instead of using just one Transformer (GPT-2), they used two different architectures to see which one worked best. Specifically, they used a series of modified ResNets and a series of Vision Transformers (as introduced by Dosovitskiy et al. in 2020. The latter ones worked better, with one trained at higher image resolution in particular. Radford et al. mention that if they're talking about CLIP, they mean the combination of GPT-2 with that specific Vision Transformer.
The magic of CLIP happens when the two modalities are combined in the forward pass of which the outcome is used for optimization. I must admit, it was slightly complex to understand at first, but the beauty of it became really clear later.
Let's take a look at the image above. Currently, you know that the two batches - when fed through both transformers - both have the same shape at this point in time: there are N
images/texts of dimensionality d_e
, so [N, d_e]
is the shape of both the images and the texts.
Visually, they are the purple row and green column in the left part of the above image.
Subsequently, the the following happens:
Given a batch of N (image, text) pairs, CLIP is trained to predict which of the N × N possible (image, text) pairings across a batch actually occurred. To do this, CLIP learns a multi-modal embedding space by jointly training an image encoder and text encoder to maximize the cosine similarity of the image and text embeddings of the N real pairs in the batch while minimizing the cosine similarity of the embeddings of the N2 − N incorrect pairings. We optimize a symmetric cross entropy loss over these similarity scores (Radford et al., 2021).
This is the pseudocode of CLIP's symmetric loss function (Radford et al., 2021):
# symmetric loss function
labels = np.arange(n)
loss_i = cross_entropy_loss(logits, labels, axis=0)
loss_t = cross_entropy_loss(logits, labels, axis=1)
loss = (loss_i + loss_t)/2
In other words, symmetric here means that it's just the average over cross-entropy loss computed between some labels and the logits across two axes.
But what confused me was how said loss can be computed over the logits while labels are defined as np.arange(n)
.
I've finally found out. Let me explain.
It begins with understanding what is actually meant with computing np.arange
:
Return evenly spaced values within a given interval (NumPy, n.d.).
In other words, this gives us a list of integers, with n=3
e.g. [0, 1, 2]
. Then, I got even more confused. How can each integer ever be used to compare with some logits vector (each row or column of the similarity matrix)?
After quite a bit of searching, I looked again at the image provided above. As you can see, the correct pairs of texts and images represent a diagonal. Index-wise, this means that each pair can be found at [i, i]
, where i
is in the range between 0 and n
. In other words, np.arange(n)
represents the true indices of the diagonal: np.arange(3) = np.array([0, 1, 2])
. This is equal to the positions [(0, 0), (1, 1), (2, 2)]
. The diagonal!
When I understood that, I was a step closer. But what I still did not understand is how one integer value (for example, 0
) could be used in computing the loss when compared with a vector.
Then I learned that cross-entropy loss in many deep learning frameworks (such as PyTorch) accepts an integer as a target, being a class index:
The target that this criterion expects should contain either (...) class indices in the range [0, C) where C is the number of classes (PyTorch, n.d.).
Internally, PyTorch converts the class index to a one-hot encoded representation. In other words, class index 3 for 5 possible classes means that the corresponding label becomes [0, 0, 0, 1, 0]
after conversion. After that, it is possible to compute cross-entropy loss between the logits and the label - because the two tensors can be compared.
Let's now make this even more clear with an example - it is the core of CLIP, after all, so it's important to understand thoroughly.
Suppose that we're having this similarity matrix, which are the logits - the dot products produced as a similarity measure for every (text, vector
) combination:
t1 | t2 | t3 | |
---|---|---|---|
i1 | 0.9 | 0.2 | 0.1 |
i2 | 0.3 | 0.8 | 0.2 |
i3 | 0.1 | 0.4 | 0.7 |
You can clearly see that the diagonal has the highest values. This makes sense, because the diagonal contains the true pairs (t1, i1)
, (t2, i2)
and (t3, i3)
. And that is why the original paper defines labels
as np.arange(n)
: this numeric range contains the correct pairs 0, 1 and 2.
Subsequently, CLIP loss is computed across both dimensions: once for the text/images pairs and another time for the image/texts pairs.
In both directions, you'd expect to find this:
[1 0 0]
.[0 1 0]
.[0 0 1]
. Suppose you're computing the first loss, over axis=0
, so downwards across all rows. You're passing index = 0
for the first row, index = 1
for the second, and so on; in other words, passing [1 0 0]
, [0 1 0]
et cetera. In other words, when computing loss, you'll compare [0.9 0.2 0.1]
with [1 0 0]
, [0.3 0.8 0.2]
with [0 1 0]
and so on. This leads to a loss: there are small but significant differences between the logits and the expected values. There is no perfect diagonal (i.e., an identity matrix) after all, especially at first, when the network is initialized pseudo-randomly (or with different initialization schemes resembling it). But we're trying to get there, so we compute the loss over axis=0
!
Then, you'll do the same over axis=1
so horizontally across all columns. You finally combine both losses and divide them by 2, so that the losses over the horizontal and vertical axes contribute equally to how wrong the model is.
This is then used in the backwards pass leading to model optimization.
Eventually, when trained, the best similarity matrix for the passed batch is an identity matrix. However, that would probably happen only in theory. In practice, it'll never be that perfect, but results have shown that they are good enough for practical use!
Let's now take a look at a few use cases for CLIP:
Let's revisit the image. Above, we discussed what's visualized under (1) - contrastive pre-training. When OpenAI was done training, they found a model that is capable to understand relevance between a text and an image. This allows you to perform zero-shot classification, which is defined as follows:
Zero-shot (...) classification is a task in natural language processing where a model is trained on a set of labeled examples but is then able to classify new examples from previously unseen classes. (HuggingFace, n.d.)
Even better, because of the fact that it was trained on pairs of texts and images, it's possible to create any classifier you want! Let's take a look.
What we're seeing here under (2) and (3) is the following:
plane, car, dog, ..., bird
. This set is subsequently used to format newer strings, such as A photo of a car
. These are passed to the trained text encoder, meaning that they are converted into vectors that have the dimensionality of the embedding space.Indeed, in our case, the third textual input is most likely: it's a photo of a dog.
What this means is that you'll be able to compose a textual set of any object you'd like (for example, banana
, apple
and orange
), pass it through the trained text encoder, pass an image through the trained image encoder, compute similarity and find the most probable class. A true zero-shot image classifier!
If you want to create such a classifier yourself, make sure to read this article.
There's another interesting CLIP use case: it can be used as an image search engine. Suppose that you have a dataset of images, say, 1000 photos that you'd like to search through. Using CLIP, it would not be too difficult to build a web page that takes a textual input (for example, "a yellow car in New York City") and have it return images which are similar.
CLIP was trained on a large-scale dataset, but the objects in the images were relatively common and taken from specific angles (as you would expect with regular images). It is hence difficult to use the model for use cases which require other angles, such as satellite images.
Similar to any large model, it's possible to fine-tune CLIP with your own dataset. Here is an example which demonstrates how to do that using satellite imagery.
Today, you've learned to understand CLIP better: what its components are, how it is trained, and how it can be used. I hope the article has helped you gain a deeper understanding about this model. Thanks for reading - and let's connect if you have any questions or remarks!
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.
Encord. (2023, July 14). Full guide to contrastive learning. Data Engine for AI Model Development | Encord. https://encord.com/blog/guide-to-contrastive-learning
NumPy. (n.d.). Numpy.arange — NumPy v1.26 manual. https://numpy.org/doc/stable/reference/generated/numpy.arange.html
PyTorch. (n.d.). CrossEntropyLoss — PyTorch 2.1 documentation. https://pytorch.org/docs/stable/generated/torch.nn.CrossEntropyLoss.html
HuggingFace. (n.d.). Zero-shot classification. Hugging Face – The AI community building the future. https://huggingface.co/tasks/zero-shot-classification
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., ... & Polosukhin, I. (2017). Attention is all you need. Advances in neural information processing systems, 30.
Radford, A., Wu, J., Child, R., Luan, D., Amodei, D., & Sutskever, I. (2019). Language models are unsupervised multitask learners. OpenAI blog, 1(8), 9.
Dosovitskiy, A., Beyer, L., Kolesnikov, A., Weissenborn, D., Zhai, X., Unterthiner, T., ... & Houlsby, N. (2020). An image is worth 16x16 words: Transformers for image recognition at scale. arXiv preprint arXiv:2010.11929.
Learn how large language models and other foundation models are working and how you can train open source ones yourself.
Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.
Read about the fundamentals of machine learning, deep learning and artificial intelligence.
To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!
The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.
All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.
If you have any questions or remarks, feel free to get in touch.
TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.
PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.
Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.
Mathjax is licensed under the Apache License, Version 2.0.