The CLIP model by OpenAI is a model which really fascinates me. When it was released in 2021, it was the first approach that successfully paired images with texts at scale. It does so by learning similarities between text-image pairs. For this reason, it can be used for zero-shot classification. This is also known as classification without being explicitly trained and without any explicit examples up front.
The article linked above gives you a deeper understanding of CLIP, its components and why it works. Instead, this articles focuses on building something valuable with these tools: a zero-shot image classifier. Indeed, we're going to build a pipeline that is capable of classifying an image - by text. That is, you provide the classes in text; you provide the image; the model does the rest.
It is structured as follows:
CLIPProcessor
and CLIPModel
to classify the loaded image
.Let's take a look! ๐
The CLIP model, which stands for Contrastive Language-Image Pre-training, is a 2021 OpenAI model. It is the first of its class and is capable of predicting similarity of pairs of texts and images at scale. Contrary to previous approaches (which attempt to do the same by using predictive modelling, i.e. predicting the text label for a corresponding image which serves as model input), CLIP uses a contrastive approach which contrasts between correct and incorrect pairs.
By consequence of training this model at scale, with the text Transformer and the vision Transformer trained concurrently, the embeddings of correct text/image pairs will have high similarity whereas those of incorrect ones have low similarity. This can be used for classifying images in a zero-shot fashion, as illustrated below (Radford et al., 2021):
Let's now build the actual zero-shot classifier! Open your IDE or a Jupyter Notebook and create a new file (e.g. clipzeroshot.py
or its Notebook equivalent), to ensure that we can get to work.
Of course, building anything with Python requires specifying a set of imports. Let's specify all of them - and why we need them - and then add the first lines of code to our script:
PIL
is required for loading the image (as Image
) and providing a format that can be passed to the CLIP model.transformers
is used for loading the CLIP model (CLIPModel
) and the CLIP processor (which converts the inputs into an encoded representation; CLIPProcessor
). Also, BatchEncoding
is imported (because it's the type of what one of the definitions that follows will return).torch
is imported for defining a return type.logging
for logging, os
for specifying paths and json
for parsing a string as JSON.Here's everything in code:
from PIL import Image
from transformers import CLIPProcessor, CLIPModel
from transformers.tokenization_utils_base import BatchEncoding
import torch
import logging
import os
import json
logging.basicConfig(level=logging.INFO)
I'm a big fan of separating the configuration of a program from its code. That is, whenever possible, I try to avoid hardcoding details into a script. Even though you'll see many of the benefits mostly with large software programs, it too is nice in smaller ones. Hence, let's specify a configuration (save it as configuration.json
in the same folder as your .py
script) that specifies:
What processor
and what model
to use. We're using clip-vit-base-patch32
.
The model uses a ViT-B/32 Transformer architecture as an image encoder and uses a masked self-attention Transformer as a text encoder (Radford et al., 2021).
What classes must be used in generating the prediction. Indeed, the beauty of CLIP is that through learning (text, image)
similarity pairs it is possible to generate a classifier by text only (the authors discuss this in their paper, by arguing that the CLIP framework essentially produces a hypernetwork. It is one of the things of CLIP which I find most fascinating!):
[The] image encoder is the computer vision backbone which computes a feature representation for the image and the text encoder is a hypernetwork (Ha et al., 2016) which generates the weights of a linear classifier based on the text specifying the visual concepts that the classes represent (Radford et al., 2021).
Next, you'll specify the image
path.
In other words,
{
"processor": "openai/clip-vit-base-patch32",
"model": "openai/clip-vit-base-patch32",
"classes": [
"spaceship",
"fisherman's boat",
"tanker"
],
"image": "./image_to_classify.jpg"
}
is your configuration for today.
Feel free to make some adaptations, such as specifying more or fewer classes, changing them, and so forth!
In your Python script, add a def
which opens the configuration file, parses its JSON contents and returns a dictionary:
def load_configuration(path: str = os.path.join(os.getcwd(), "configuration.json")) -> dict:
"""
Loads the configuration file.
Args:
path (str, optional): The path to the configuration file. Defaults to "config.json".
Returns:
dict: The configuration
"""
with open(path, "r") as f:
return json.load(f)
CLIPModel
Next step: getting an instance of the CLIP model!
With transformers
, this is really easy: you'll just specify the class, tell it to use a pretrained model, et voila. That's the minimum amount of work necessary to load the CLIP model.
Indeed, as suggested before, we're using the openai/clip-vit-base-patch32
pretrained CLIP model. You will see that it is passed from the configuration file later, but to ensure the def keeps working out of the box, it is passed as the default value too (if passing a value is omitted, it will be used too).
def get_model(model: str = "openai/clip-vit-base-patch32") -> CLIPModel:
"""
Returns the CLIP model which uses a vision transformer and a language transformer to map images and text to a common latent space.
Subsequently, the similarity between the two can be measured using cosine similarity and used to classify images.
Args:
model (str, optional): The model to use. Defaults to "openai/clip-vit-base-patch32".
Returns:
CLIPModel: The CLIP model
"""
return CLIPModel.from_pretrained(model)
CLIPProcessor
The same is true for the processor, which essentially wraps the image processor and text processor for easy generation of encoded texts and images:
The
CLIPProcessor
wrapsCLIPImageProcessor
andCLIPTokenizer
into a single instance to both encode the text and prepare the images (HuggingFace, n.d.)
Just call from_pretrained
on the CLIPProcessor
class with the pretrained processor
(the OpenAI one in our case, also as default) and that's it:
def get_processor(processor: str = "openai/clip-vit-base-patch32") -> CLIPProcessor:
"""
Returns the CLIP processor which uses the CLIP tokenizer to convert text and images to features.
Args:
processor (str, optional): The processor to use. Defaults to "openai/clip-vit-base-patch32".
Returns:
CLIPProcessor: The CLIP processor
"""
return CLIPProcessor.from_pretrained(processor)
Let's now focus on the image that must be classified.
I've asked DALL-E 3 to generate an image.
Show me what a fisherman's boat looks like.
We'll save it as image_to_classify.jpg
into the same folder as the Python script, because that's what we've configured in the configuration JSON above. If you wish to store the image in a different location, feel free to adapt the configured path yourself.
Clearly, out of the 3 possible classes in the configuration, this is a fisherman's boat
.
Now that we have an image
, it's time to use the instantiated processor
to encode the textual and image inputs.
We create another definition for this: it accepts the processor, the loaded image and a set of classes
, which are just strings. It passes them to the processor (specifying that return tensors must be torch
ones) and returns the result.
def get_encoded_input(processor: CLIPProcessor, image: Image.Image, classes: list[str]) -> BatchEncoding:
"""
Returns the encoded input for the CLIP model.
Args:
processor (CLIPProcessor): The CLIP processor
image (Image.Image): The image
classes (list[str]): The classes
Returns:
BatchEncoding: The encoded input.
"""
return processor(text=classes, images=image, return_tensors="pt", padding=True)
Okay, now it's time for the actual magic. If we pass the encoded_input
to the model
, it generates a prediction
. That's what we do in the def below, and we subsequently extract the image logits. Depending on whether we want to apply Softmax (this is not strictly necessary, but will give you a nice distribution of which the components sum to 1) we apply it, then return the prediction.
def get_prediction(model: CLIPModel, encoded_input: BatchEncoding, use_softmax: bool = True) -> torch.Tensor:
"""
Returns the prediction for the CLIP model.
Args:
model (CLIPModel): The CLIP model
encoded_input (BatchEncoding): The encoded input
use_softmax (bool, optional): Whether to use softmax. Defaults to True.
Returns:
torch.Tensor: The prediction
"""
prediction = model(**encoded_input).logits_per_image
if use_softmax:
prediction = prediction.softmax(dim=-1)
return prediction
We now have all the components ready! Time to merge them together into a main
def which we use when running the script.
Here's what is happening:
That's it!
def main() -> None:
"""
The main function which runs the script.
Returns:
None
"""
config = load_configuration()
model = get_model(config["model"])
processor = get_processor(config["processor"])
image = Image.open(config["image"])
encoded_input = get_encoded_input(processor, image, config["classes"])
prediction = get_prediction(model, encoded_input)
predicted_class = config["classes"][prediction.argmax()]
logging.info(f"Prediction over classes: {prediction.tolist()}")
logging.info(f"Predicted class: {predicted_class}")
if __name__ == "__main__":
main()
Let's now run our script. It's very easy: just run python clipzeroshot.py
(ensure that you have all the dependencies installed). What follows next are results from a few examples; some simple ones, some difficult ones, and a different set of classes, showing you how easy it is to change classes and re-use the model for something else.
First, the fisherman's boat
.
INFO:root:Prediction over classes: [[0.00018303414981346577, 0.9983105659484863, 0.0015064344042912126]] INFO:root:Predicted class: fisherman's boat
Now a spaceship
:
INFO:root:Prediction over classes: [[0.9909811615943909, 0.001415808335877955, 0.007602964993566275]] INFO:root:Predicted class: spaceship
Let's make it a bit more difficult now.
Mix the spaceship and the fisherman's boat, with a ratio of approximately 70% to 30%.
I have never seen such a strange object before, but here goes:
INFO:root:Prediction over classes: [[0.7287755012512207, 0.26701247692108154, 0.004212076310068369]] INFO:root:Predicted class: spaceship
Indeed, as expected, it is still classified as a spaceship
, but the fisherman's boat
has a 26.7% probability, suggesting that it's indeed a mix (and interestingly, with ~73% / ~27% vs ~0% quite close to the 70%/30% mix from the prompt!)
Before, I suggested how easy it is to swap the classes. Let's do so by changing the JSON:
{
"processor": "openai/clip-vit-base-patch32",
"model": "openai/clip-vit-base-patch32",
"classes": [
"apple",
"not an apple"
],
"image": "./image_to_classify.jpg"
}
Time to classify this apple. And indeed:
INFO:root:Prediction over classes: [[0.5734764337539673, 0.4265235960483551]] INFO:root:Predicted class: apple
(However, you can see that the model is not really certain, probably because the large empty background. If we had used apple
and banana
instead, the class probabilities would have been 0.999/0.001
.)
HuggingFace. (n.d.). Openai/clip-Vit-base-patch32 ยท Hugging face. Hugging Face โ The AI community building the future. https://huggingface.co/openai/clip-vit-base-patch32
Radford, A., Kim, J. W., Hallacy, C., Ramesh, A., Goh, G., Agarwal, S., ... & Sutskever, I. (2021, July). Learning transferable visual models from natural language supervision. In International conference on machine learning (pp. 8748-8763). PMLR.
HuggingFace. (n.d.). Clip. Hugging Face โ The AI community building the future. https://huggingface.co/docs/transformers/model_doc/clip#usage-tips-and-example
MachineCurve. (2023, December 22). CLIP: How it works, how it's trained and how to use it. MachineCurve.com | Machine Learning Tutorials, Machine Learning Explained. https://machinecurve.com/index.php/2023/12/22/clip-how-it-works-is-trained-and-used
Learn how large language models and other foundation models are working and how you can train open source ones yourself.
Keras is a high-level API for TensorFlow. It is one of the most popular deep learning frameworks.
Read about the fundamentals of machine learning, deep learning and artificial intelligence.
To get in touch with me, please connect with me on LinkedIn. Make sure to write me a message saying hi!
The content on this website is written for educational purposes. In writing the articles, I have attempted to be as correct and precise as possible. Should you find any errors, please let me know by creating an issue or pull request in this GitHub repository.
All text on this website written by me is copyrighted and may not be used without prior permission. Creating citations using content from this website is allowed if a reference is added, including an URL reference to the referenced article.
If you have any questions or remarks, feel free to get in touch.
TensorFlow, the TensorFlow logo and any related marks are trademarks of Google Inc.
PyTorch, the PyTorch logo and any related marks are trademarks of The Linux Foundation.
Montserrat and Source Sans are fonts licensed under the SIL Open Font License version 1.1.
Mathjax is licensed under the Apache License, Version 2.0.