Released in January of 2021, the source code for OpenAI’s Contrastive Language-Image Pre-Training (CLIP) framework has, at the time of writing, been forked into 1,700 branches, and obtained 11,200 stars on GitHub. CLIP crops up increasingly in computer vision research papers, most particularly in research related to image and video synthesis – but also as a tool for a variety of related tasks, such as auto-captioning.
CLIP was trained on 400 million image/text pairings, mostly obtained from minimally-curated web-scraped data. This means that the quality and appropriateness of the pairing depends on how appositely and accurately the images were captioned, and the extent to which the captions may reflect any biases or agenda of the captioner.
Since manually checking those captions (and the quality of their relationship to the associated images) is a logistically and financially prohibitive task, CLIP is not a perfect system, and reflects some of the biases inherent in this ‘free’ data – but since it solves so many traditional problems related to image synthesis, and in such an adroit way, CLIP has captured the imagination of the research community, and now represents a notable transformative power in generative frameworks.
CLIP has been adapted into non-English languages, including Chinese; has been described as a ‘strong yet embarrassingly simple baseline’ for many of the thorniest problems in continual learning; has been used for NeRF-based 3D model generation; is being leveraged across various projects for zero shot semantic segmentation; is proving a useful resource in robotics, by bridging the semantic gap between what a machine sees and related natural language concepts; has, as mentioned, been adapted into a text-only image captioning system; and is becoming a mainstay and quality-driver in the hugely popular Stable Diffusion text-to-image latent diffusion system (through the open source OpenCLIP).
Let’s take a look at what CLIP can do – and some of the things that it can’t do; at least, yet.
How CLIP Works
CLIP attempts to form relationships between images and text, by learning text/image associations from very large amounts of data pairs of this type (i.e. images that are drawn from public internet resources, and which have associated text, either in the form of metadata, such as alt-tags, or more expressly, which have been explicitly captioned).
Because it is trained on such a vast and diverse corpus of data, CLIP is able to make zero-shot predictions relating to user queries; which is to say, that it has a good chance of associating the user-submitted query ‘cat’ with a picture of a cat (see image above), or of selecting an appropriate image from the user-submitted text-query cat – without having been explicitly trained to detect cats, animals, or any related higher-level category.
Since the prior standard in computer vision research is that a model would be trained on specific data (for instance, a recognition model exclusively designed to detect intruders, in a security system), CLIP is a notable departure and augmentation of existing research culture.
Besides being a multimodal (in this case, text+image) system, CLIP’s utility mainly lies in the sheer breadth of the data that it was trained on. This means that CLIP is capable of performing useful operations on out-of-distribution (OOD) data, i.e., images/concepts that it has never been exposed to.
In this way, CLIP can act as a functional intermediary for far more limited computer vision and also image synthesis systems, since their specific interests (such as faces, bridges, churches, etc.) are likely to already have been well-incorporated into CLIP.
During training, image and associated text data is fed at volume into the system until common features are identified. In this sense, a feature is a persistent impression of a concept such as ‘dog’, where the most common visual canine traits form into an impressionistic embedding. This embedding becomes co-related to the words that accompanied the pictures that created it.
Working With Available Material
For SEO purposes, much of the text data for the images on which CLIP was trained features ‘keyword stuffing‘ – an old trick whose effectiveness has diminished greatly over the years, where available text space is used to include as many ‘related words’ and concepts as can be included, in the hope that the image, when it is indexed by search engines, will crop up in as wide a range of results as possible – which the uploader also hopes will lead to additional clicks, traffic, and better search engine results ranking.
Since there is little scope to manually edit so many captions, inaccurate or tangential captions remained in place during CLIP training. In the right-most image above, we see some typical ‘black hat’ SEO tricks, where the uploader has appended the absolutely unrelated ‘home design ideas’ and other similar tags to a popular gaming/outdoors image.
If this kind of unrelated alt-tag is infrequent, it won’t become damagingly embedded into a core concept for what the image is truly about – but it does represent ‘noisy’ data, and can affect CLIP’s accuracy in some edge cases; and even in general usage.
Likewise, captions or alt-tags may be overly minimal or unhelpfully lyrical – for example, the middle-image above contains information about what the eagle depicted is doing (‘soaring’), but does not contain the words ‘eagle’ (or the higher-level concept/class ‘bird’), leaving CLIP to re-associate that concept by itself, based on whatever other ‘eagle’ or avian images it has assimilated.
For this reason, among others, CLIP-dependent text-to-image generative systems such as Stable Diffusion may exhibit strange or unexpected associations with particular phrases, or to present bizarre conjunctions of concepts that most people would not have associated with the user’s text-prompt.
Ironically, since CLIP is currently becoming popular as a potential automated way of generating image captions, any such inaccuracies in its text/image associations risk to become multiplied and truly ‘ingrained’ over time, as future versions of CLIP and its many derived forks begin to ‘feed’ on web-based images that were themselves captioned by CLIP.
Limitations of Prediction Capabilities in CLIP
CLIP is so impressive, and so useful, that it’s easy to forget that it’s only regurgitating human-annotated images; that the ‘intelligence’ on display is actually human intelligence, iterated systematically; and that, by itself, CLIP can make some rather unhelpful assumptions, associations and predictions in regard to image data.
One recent paper, a collaboration between Columbia University and Microsoft Research, observes ‘We find that state-of-the-art visual models, such as CLIP, often provide incorrect rationales for their categorical predictions.’
In other words, CLIP cannot easily explain why it is presenting an association or an image, because it is simply a very complex web of image/word associations, albeit containing complex arrays of detailed class hierarchies (i.e., animal>dog>golden retriever)
The aforementioned paper offers an approach missing in CLIP – a method of justifying the way that contributing elements can lead a visual recognition system to make a prediction:
In the above example from the paper, it’s possible that CLIP is capable of recognizing every single constituent part of the image, such as individual vegetables; but it is more likely to make a correct ‘Greek salad’ prediction based on its trained knowledge of whole pictures of Greek salads (i.e., based on pixel-derived features of a particular formation of colors and shapes falling into a ‘food’ class hierarchy) than by identifying the individual facets of the salad and recognizing the correct association that leads to ‘Greek salad’.
This is due, again, to the nature of the training data, which is more likely to have broad and encompassing text/image captions, rather than fine-grained multiple captions for constituent parts of any particular image. It falls to CLIP to decide whether or not such relationships will be noticed and embedded in their own right, and depends on the amount of available images for any given concept.
Where the data is scant, such relationships are unlikely to form. If there are twenty ‘Greek salad’ images captioned ‘Greek salad’, and only one that actually describes and annotates the ingredients, the latter will probably be treated as outlier data (unless the ‘inner objects’ described correspond enough to other points in the training data).
Reading the Room
Further to the challenge of inferring objects/concepts from their context and relationship to other objects, another recent paper, this time a collaboration between KAUST and Snap Inc., explores the extent to which an object’s relationship with other objects may help us to define what that object is – or, put more simply, the extent to which we more easily recognize things when they are in a recognizable rather than abstract context.
However, inference through context can be a trap rather than a benefit for CLIP, which tends to entangle objects and concepts with their environments and/or adjacent concepts; and which will usually, operating as a component in a generative model such as Stable Diffusion, produce ‘obvious’ contexts.
For example, people in swimwear will normally be on the beach, and food will normally be on a table (and, due to the large amount of Instagramming of food that’s making its way into AI-facing datasets, will often be represented in an aerial view).
In the example (non-cherrypicked) images below, from a vanilla Stable Diffusion local installation, not one of the men is out of a beach context, and not one of the ‘meals’ is presented in the sense of a ‘family meal’. Instead, dominant trends in the training data seem to have dictated the overriding ambience and style, respectively, of well-voted ‘holiday’ images and Instagram/menu content.
Additional limitations, outlined by OpenAI in the original presentation of the framework, include the fact that CLIP is not resilient enough to interpret imagery not covered in its original training data. This does not mean, necessarily, that such images were necessarily absent from the training data, but could indicate instead that the captioning was not of a sufficient quality to categorize the imagery correctly.
One example of this is that CLIP achieves only 88% accuracy on character recognition, crucial functionality on interpreting text that may exist in images, such as in signs, menus or instructions.
OpenAI also notes that CLIP’s categories and classes are not necessarily granular enough for all types of object that it may encounter, such as particular models of car or other types of vehicles, or species of flowers, and so forth.