Badly-Compressed Images Affect CLIP’s Performance, New Research Contends

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

The historic quality of images scraped from the internet to be trained in computer vision models has a notable impact on current research.

Partly, this is because of the need for the academic sector to have continuous benchmarking schemes against which the latest  approaches can be comparatively tested. This allows year-on-year continuity, and reliable standards – and allows that researchers can demonstrate advances on the state-of-the-art in a manner that agrees with common scientific consensus.

Though this scrupulous level of rigor and adherence to much older standards is not subject to much entropy in fields where all the variables are long-since established (such as base chemistry research, or the study of molecules in general), where computer vision is concerned, the ground has shifted considerably since the governing standards were established.

Prior to the broadband revolution, images on the internet were saved with much higher levels of compression than they currently are, and with compression methods that were notably inferior to the efficiency and effectiveness of modern image compression codecs.

Even eleven years ago, when the highly influential ImageNet photo dataset was first scraped from the web, image quality was lower than it is today, with more efficient and better-performing codecs such as webp yet to make an impact on the overall quality of images.

A historic look at the original ImageNet dataset, whose influence persists in computer vision, reveals the extent to which low-quality photos pervade the collection – and such photos will not always be filtered out by pretraining methods. Source: https://huggingface.co/datasets/imagenet-1k/viewer/default/train?p=438
A historic look at the original ImageNet dataset, whose influence persists in computer vision, reveals the extent to which low-quality photos pervade the collection – and such photos will not always be filtered out by pretraining methods. Source: https://huggingface.co/datasets/imagenet-1k/viewer/default/train?p=438

Besides the slowly-improving resource constraints incumbent upon image uploaders after the turn of the century, the growth of social networks tended to subject images to the Photocopier effect, by converting user-uploaded photos to sizes and/or formats that the platforms preferred:

Sometimes this was done not necessarily for a better user experience, or even to conform to the sites’ diverse layouts, but rather in order to more effectively write user-specific metadata into photos, that could later be exploited by advertising schemas and statistical systems (and, years later, this metadata, self-serving or not, would be sucked into major AI generative networks as a form of ‘cheap’ annotation) .

Though it is quite possible to add such metadata to lossy formats like JPEG non-destructively (i.e., to leave the original image data intact), popular ‘plug and play’ server-side libraries such as Magick.NET did not support this. Reencoding would therefore occur, because it was cheaper and easier to use massively popular open source libraries with a few shortcomings than to concoct costly bespoke solutions.

Thus it is that the same image could quite possibly get worse over the years, as the larger industrialized sites stayed online, with their degraded versions of the source image, while the smaller, often hobbyist domains that first provided a higher-quality original version of the photo, gradually fell into disuse, and disappeared – a phenomenon known as link rot.

For these, and sundry other reasons, a great deal of the data that gets fed into modern transformative computer vision systems is old or degraded in some way. In the case of the development of core loss functions – algorithms that help to tell systems how to train data more effectively – the entropy is most indelible, since these functions reflect the state of data at the time they were trained, and are rarely updated. Even if they could be updated, this would, again, break like-for-like continuity of results for the computer vision sector over time.

The loss function Structural Similarity Index (SSIM) here evaluates the difference between source imagery and possible perturbations – but is it living in the past? Source: https://www.nsf.gov/news/mmg/mmg_disp.jsp?med_id=79419
The loss function Structural Similarity Index (SSIM) here evaluates the difference between source imagery and possible perturbations – but is it living in the past? Source: https://www.nsf.gov/news/mmg/mmg_disp.jsp?med_id=79419

Living With Degraded Data

The net result of this is that the current crop of generative systems (just one example of affected systems) are tinged with nostalgia, and notably influenced by the state of internet images between 2000-2013, more or less, despite the trend in recent years towards larger images, and the subsequent utter dominance of smartphone photography, which has led to 1.3 billion new images being shared on Instagram alone each day.

But if you need a non-watermarked picture of Ben Affleck from 1996, not only are the available versions going to be celluloid-based, and with an ‘analog’ quality, but they will be rarer, because film was more expensive than the reusable memory cards that decimated chemical photography from the early 2000s. The smaller sites that may have provided a better original are either gone, or in any case, were likewise subject to resource constraints, and uploaded only a low-res or badly compressed image.

Somewhere in storage, in private .RAW files, in slide positives and in developed negative reels, are the better versions that would be such a boon to computer vision; as it stands, the sector must accommodate itself not only to the adulterated versions that were made available, but to accept the negative general effects of ‘legacy’ datasets, algorithms and practices.

In the case of the LAION dataset that powers the original and most influential V1.5 Stable Diffusion generative text-to-image model, the contributing images for any concept are likely to contain at least a handful of poor quality images.

Exploring the LAION database that powers Stable Diffusion, we can see images of the actor Ben Affleck, from over ten years ago, that are subject to notable compression, but which will not necessarily be filtered out through computer vision training processes. Source: https://rom1504.github.io/clip-retrieval/
Exploring the LAION database that powers Stable Diffusion, we can see images of the actor Ben Affleck, from over ten years ago, that are subject to notable compression, but which will not necessarily be filtered out through computer vision training processes. Source: https://rom1504.github.io/clip-retrieval/

In an ideal world, the pretraining routines that datasets are subject to before being passed into a training cycle would weed out the low-quality data. But what should be your criteria when low-quality data is all there is? If you’re looking for a real photo of Rasputin, a wide-reaching ‘quality’ filter would probably filter out every single one that’s available; and with hundreds of millions of images and sub-concepts to process, there is no feasible way to give each concept specialized attention and custom criteria.

CLIP's Robustness to Image Compression

In the case of the original Stable Diffusion model, a version of the OpenAI Contrastive Language-Image Pre-training (CLIP) was used to form connections between concepts (i.e., words that were associated with a web-scraped internet image, such as captions and file-names) and their associated images. These learned relationships power the formidable generative capabilities of the system.

To prevent low-quality outlier data from adversely affecting CLIP’s ability to form these associations, the system has a number of filtering procedures that rank and prioritize data quality more intelligently. In the original work, the authors claimed that CLIP was remarkably robust to novel datasets with varying distributions (i.e., it was not overfitted, and could generalize well to new data).

Subsequently, CLIP has been widely adopted into a broad range of computer vision applications, as a state-of-the-art semantic image/text classifier.

However, a new work from the UK claims to have proved what many would feel to be logical, given the aforementioned constraints around data quality: that CLIP is indeed adversely affected by poor image compression in the source data of the 400 million text/image pairs on which it was trained.

The paper states:

‘[We] find that CLIP’s zero-shot prediction is sensitive to the quality of the input images. For example, the predicted text label for the same image can differ significantly when the image has been compressed using the discrete cosine [transform].

‘This is surprising because CLIP has been trained on over 400 million image-text pairs with images of various qualities and we would therefore expect it to be robust against degradation of the quality of the input images.’

The researchers have used a mathematical technique called Integrated Gradients to prove their hypothesis, and in doing so offer a method to analyze similar vulnerabilities in other foundation models.

If a sector consensus forms that CLIP’s performance is affected by badly compressed images, it has some implications in terms of wider recognition of the need to develop improved preprocessing methodologies. The current (perhaps slightly desperate) hope of the scene is that entirely automated and algorithmic ways can be used to perform filtering and pre-processing.

But so long as scant historical data – which must be accepted on its own terms, even if low quality – remains an essential value proposition in hyperscale datasets, there is no obvious easy or cheap way to perform the kind of filtering that the paper’s contentions suggest.

The new paper is titled Understanding the Vulnerability of CLIP to Image Compression, and comes from three researchers at the University of Bath.

Method

The 2017 method  Integrated Gradients, used here to probe CLIP’s performance, offered a new method of evaluating the input features of a neural network, which are distinct from features in the accepted sense of the term. Input features represent the data entirely, while features in themselves are individual characteristics of data points obtained from the source data during training.  

From the original 2017 paper, a comparison of standard image-based gradients and the greater clarity obtained from analysis through Integrated Gradients. Source: http://proceedings.mlr.press/v70/sundararajan17a/sundararajan17a.pdf
From the original 2017 paper, a comparison of standard image-based gradients and the greater clarity obtained from analysis through Integrated Gradients. Source: http://proceedings.mlr.press/v70/sundararajan17a/sundararajan17a.pdf

Polling the gradients of a models output (with respect to its input) is a standard ML practitioner method of estimating the coefficients that have been trained in the otherwise relatively opaque latent space of a trained model.

The paper explains:

‘The Integrated Gradients method satisfies two axioms ‘sensitivity’ and ‘implementation invariance’ and can be used for most deep learning models. ‘Sensitivity’ refers to the property that when the outputs of the network are different at two features, the attributes should also be different. ‘Implementation invariance’ means that the attributes are the same for two functionally equivalent networks, that is networks having the same outputs given the same inputs.

‘Integrated gradients satisfy these axioms because it is defined as a path integral of the deep network (as a function) from the baseline input to the target.’

Data and Tests

To prove the approach, the researchers tested it on image classification tasks. This is an apt approach, since it covers the core functionality of CLIP, and straddles multiple possible and current applications of OpenAI’s approach (and its derivatives), from text-to-image generation to the use of CLIP in Large Language Models (LLMs) and various multimodal systems.

The two datasets used were CIFAR-10 and STL-10. Four image groups were created from CIFAR-10: one at original quality; and the other three incrementally degraded in quality using  the Image.save function of the Python Pillow (PIL) compression library.

Each group contained 10,000 images, to which CLIP was fed the prompt This is an image of [*] ( where the token represents one example from the CIFAR-10 classes airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck). Average precision was computed over all ten classes.

CLIP was then trained with all thirty of the pretrained image encoders employed in the 2021 work Learning Transferable Visual Models From Natural Language Supervision – the foundation of CLIP itself –  and the subsequent performance evaluated:

Results for average precision across the two tested datasets.
Results for average precision across the two tested datasets.

The authors comment:

‘We can observe that in the CIFAR-10 test, the precision scores decrease significantly as the image quality degrades in each case of the image encoder.

‘In the STL-10 test, we also observe a decrease in precision scores for all image encoders, although the amount of decrease is much smaller.’

Next, the authors used Integrated Gradients to probe CLIP, and to ascertain the extent to which image quality affects CLIP’s predictions. The baseline was set at original image quality, and the images resized to 224x224x3 – the only acceptable dimensions for CLIP.

Integrated Gradients were computed across various levels of compression degradation, and the impact overlaid on the original images in order to provide a visual representation of the effects.

Experiments were performed on CLIP using ResNet50 and ViT-B/32 image encoders. Each encoder was presented with two examples with different baselines, and integrated gradients were plotted with negative, positive and both polarities – and labels were also predicted.

Results across ResNet50 and Vit-B3/32.
Results across ResNet50 and Vit-B3/32.

The paper states:

‘We can observe [that] the integrated gradients provide accurate approximations to changes in the loss (which can be computed by taking difference of minus of the logarithm of the predicted scores). This shows integrated gradients serves as a good attribute for CLIP.’

The difference between the baseline quality and degraded quality are easily visualized by probing CLIP. Though we do not have space here to provide all the results of this type presented in the new paper, some example results are presented in the image below:

Visualizations of integrated gradients over ResNet50. See source paper for better resolution.
Visualizations of integrated gradients over ResNet50. See source paper for better resolution.

Conclusion

Since the life expectancy of standards and benchmarks in computer vision research often far exceeds their currency and relevancy to the modern scene (and to ever-developing standards and practices in data and data-gathering), questioning the efficacy of a new and highly popular library or methodology is a worthwhile pursuit – and even very recent history demonstrates that set-in-stone standards can be a downstream hazard for new works.

The findings of the new paper re-illustrate that the curation conundrum is not going away – and that the automated approaches that would allow better filtering of very high-volume datasets seem set to depend on tools and methodologies that continue to hark back to increasingly irrelevant data standards and practices.

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle