The historic quality of images scraped from the internet to be trained in computer vision models has a notable impact on current research.
Partly, this is because of the need for the academic sector to have continuous benchmarking schemes against which the latest approaches can be comparatively tested. This allows year-on-year continuity, and reliable standards – and allows that researchers can demonstrate advances on the state-of-the-art in a manner that agrees with common scientific consensus.
Though this scrupulous level of rigor and adherence to much older standards is not subject to much entropy in fields where all the variables are long-since established (such as base chemistry research, or the study of molecules in general), where computer vision is concerned, the ground has shifted considerably since the governing standards were established.
Prior to the broadband revolution, images on the internet were saved with much higher levels of compression than they currently are, and with compression methods that were notably inferior to the efficiency and effectiveness of modern image compression codecs.
Even eleven years ago, when the highly influential ImageNet photo dataset was first scraped from the web, image quality was lower than it is today, with more efficient and better-performing codecs such as webp yet to make an impact on the overall quality of images.
Besides the slowly-improving resource constraints incumbent upon image uploaders after the turn of the century, the growth of social networks tended to subject images to the Photocopier effect, by converting user-uploaded photos to sizes and/or formats that the platforms preferred:
Sometimes this was done not necessarily for a better user experience, or even to conform to the sites’ diverse layouts, but rather in order to more effectively write user-specific metadata into photos, that could later be exploited by advertising schemas and statistical systems (and, years later, this metadata, self-serving or not, would be sucked into major AI generative networks as a form of ‘cheap’ annotation) .
Though it is quite possible to add such metadata to lossy formats like JPEG non-destructively (i.e., to leave the original image data intact), popular ‘plug and play’ server-side libraries such as Magick.NET did not support this. Reencoding would therefore occur, because it was cheaper and easier to use massively popular open source libraries with a few shortcomings than to concoct costly bespoke solutions.
Thus it is that the same image could quite possibly get worse over the years, as the larger industrialized sites stayed online, with their degraded versions of the source image, while the smaller, often hobbyist domains that first provided a higher-quality original version of the photo, gradually fell into disuse, and disappeared – a phenomenon known as link rot.
For these, and sundry other reasons, a great deal of the data that gets fed into modern transformative computer vision systems is old or degraded in some way. In the case of the development of core loss functions – algorithms that help to tell systems how to train data more effectively – the entropy is most indelible, since these functions reflect the state of data at the time they were trained, and are rarely updated. Even if they could be updated, this would, again, break like-for-like continuity of results for the computer vision sector over time.
Living With Degraded Data
The net result of this is that the current crop of generative systems (just one example of affected systems) are tinged with nostalgia, and notably influenced by the state of internet images between 2000-2013, more or less, despite the trend in recent years towards larger images, and the subsequent utter dominance of smartphone photography, which has led to 1.3 billion new images being shared on Instagram alone each day.
But if you need a non-watermarked picture of Ben Affleck from 1996, not only are the available versions going to be celluloid-based, and with an ‘analog’ quality, but they will be rarer, because film was more expensive than the reusable memory cards that decimated chemical photography from the early 2000s. The smaller sites that may have provided a better original are either gone, or in any case, were likewise subject to resource constraints, and uploaded only a low-res or badly compressed image.
Somewhere in storage, in private .RAW files, in slide positives and in developed negative reels, are the better versions that would be such a boon to computer vision; as it stands, the sector must accommodate itself not only to the adulterated versions that were made available, but to accept the negative general effects of ‘legacy’ datasets, algorithms and practices.
In the case of the LAION dataset that powers the original and most influential V1.5 Stable Diffusion generative text-to-image model, the contributing images for any concept are likely to contain at least a handful of poor quality images.
In an ideal world, the pretraining routines that datasets are subject to before being passed into a training cycle would weed out the low-quality data. But what should be your criteria when low-quality data is all there is? If you’re looking for a real photo of Rasputin, a wide-reaching ‘quality’ filter would probably filter out every single one that’s available; and with hundreds of millions of images and sub-concepts to process, there is no feasible way to give each concept specialized attention and custom criteria.
CLIP's Robustness to Image Compression
In the case of the original Stable Diffusion model, a version of the OpenAI Contrastive Language-Image Pre-training (CLIP) was used to form connections between concepts (i.e., words that were associated with a web-scraped internet image, such as captions and file-names) and their associated images. These learned relationships power the formidable generative capabilities of the system.
To prevent low-quality outlier data from adversely affecting CLIP’s ability to form these associations, the system has a number of filtering procedures that rank and prioritize data quality more intelligently. In the original work, the authors claimed that CLIP was remarkably robust to novel datasets with varying distributions (i.e., it was not overfitted, and could generalize well to new data).
Subsequently, CLIP has been widely adopted into a broad range of computer vision applications, as a state-of-the-art semantic image/text classifier.
However, a new work from the UK claims to have proved what many would feel to be logical, given the aforementioned constraints around data quality: that CLIP is indeed adversely affected by poor image compression in the source data of the 400 million text/image pairs on which it was trained.
The paper states:
‘[We] find that CLIP’s zero-shot prediction is sensitive to the quality of the input images. For example, the predicted text label for the same image can differ significantly when the image has been compressed using the discrete cosine [transform].
‘This is surprising because CLIP has been trained on over 400 million image-text pairs with images of various qualities and we would therefore expect it to be robust against degradation of the quality of the input images.’
The researchers have used a mathematical technique called Integrated Gradients to prove their hypothesis, and in doing so offer a method to analyze similar vulnerabilities in other foundation models.
If a sector consensus forms that CLIP’s performance is affected by badly compressed images, it has some implications in terms of wider recognition of the need to develop improved preprocessing methodologies. The current (perhaps slightly desperate) hope of the scene is that entirely automated and algorithmic ways can be used to perform filtering and pre-processing.
But so long as scant historical data – which must be accepted on its own terms, even if low quality – remains an essential value proposition in hyperscale datasets, there is no obvious easy or cheap way to perform the kind of filtering that the paper’s contentions suggest.
The new paper is titled Understanding the Vulnerability of CLIP to Image Compression, and comes from three researchers at the University of Bath.
The 2017 method Integrated Gradients, used here to probe CLIP’s performance, offered a new method of evaluating the input features of a neural network, which are distinct from features in the accepted sense of the term. Input features represent the data entirely, while features in themselves are individual characteristics of data points obtained from the source data during training.
‘The Integrated Gradients method satisfies two axioms ‘sensitivity’ and ‘implementation invariance’ and can be used for most deep learning models. ‘Sensitivity’ refers to the property that when the outputs of the network are different at two features, the attributes should also be different. ‘Implementation invariance’ means that the attributes are the same for two functionally equivalent networks, that is networks having the same outputs given the same inputs.
‘Integrated gradients satisfy these axioms because it is defined as a path integral of the deep network (as a function) from the baseline input to the target.’
Data and Tests
To prove the approach, the researchers tested it on image classification tasks. This is an apt approach, since it covers the core functionality of CLIP, and straddles multiple possible and current applications of OpenAI’s approach (and its derivatives), from text-to-image generation to the use of CLIP in Large Language Models (LLMs) and various multimodal systems.
The two datasets used were CIFAR-10 and STL-10. Four image groups were created from CIFAR-10: one at original quality; and the other three incrementally degraded in quality using the Image.save function of the Python Pillow (PIL) compression library.
Each group contained 10,000 images, to which CLIP was fed the prompt This is an image of [*] ( where the token represents one example from the CIFAR-10 classes airplane, automobile, bird, cat, deer, dog, frog, horse, ship, and truck). Average precision was computed over all ten classes.
CLIP was then trained with all thirty of the pretrained image encoders employed in the 2021 work Learning Transferable Visual Models From Natural Language Supervision – the foundation of CLIP itself – and the subsequent performance evaluated:
The authors comment:
‘We can observe that in the CIFAR-10 test, the precision scores decrease significantly as the image quality degrades in each case of the image encoder.
‘In the STL-10 test, we also observe a decrease in precision scores for all image encoders, although the amount of decrease is much smaller.’
Next, the authors used Integrated Gradients to probe CLIP, and to ascertain the extent to which image quality affects CLIP’s predictions. The baseline was set at original image quality, and the images resized to 224x224x3 – the only acceptable dimensions for CLIP.
Integrated Gradients were computed across various levels of compression degradation, and the impact overlaid on the original images in order to provide a visual representation of the effects.
Experiments were performed on CLIP using ResNet50 and ViT-B/32 image encoders. Each encoder was presented with two examples with different baselines, and integrated gradients were plotted with negative, positive and both polarities – and labels were also predicted.
The paper states:
‘We can observe [that] the integrated gradients provide accurate approximations to changes in the loss (which can be computed by taking difference of minus of the logarithm of the predicted scores). This shows integrated gradients serves as a good attribute for CLIP.’
The difference between the baseline quality and degraded quality are easily visualized by probing CLIP. Though we do not have space here to provide all the results of this type presented in the new paper, some example results are presented in the image below:
Since the life expectancy of standards and benchmarks in computer vision research often far exceeds their currency and relevancy to the modern scene (and to ever-developing standards and practices in data and data-gathering), questioning the efficacy of a new and highly popular library or methodology is a worthwhile pursuit – and even very recent history demonstrates that set-in-stone standards can be a downstream hazard for new works.
The findings of the new paper re-illustrate that the curation conundrum is not going away – and that the automated approaches that would allow better filtering of very high-volume datasets seem set to depend on tools and methodologies that continue to hark back to increasingly irrelevant data standards and practices.