Not all fake news requires AI; in many cases, it’s enough to present a real image together with a misleading caption, aka cheapfakes/shallowfakes, or to perform simple Photoshop manipulations.

Though this ability to abuse the photography=reality conceit has been in existence since the invention of the photographic medium in the 1820s, a number of initiatives have sprung up in recent years that leverage machine learning to attempt to identify instances of mislabeling – or ‘malicious association’ – of otherwise innocuous or non-deceptive photographs through captioning or context.
Catching Out-of-Context Misinformation with Self-Supervised Learning (COSMOS) is one of the most effortful such projects, and offered the first viable cheapfake detection routine, together with a hyperscale dataset of 200,000 images with 450,000 text captions, with captions labeled as ‘Out-Of-Context’ (OOC) or ‘Not-Out-Of-Context’ (NOOC).
The COSMOS on Steroids project subsequently provided further innovations on the original work, by offering an additional methodology titled Differential Sensing, which adds a negative and positive switch to each caption (i.e., ‘true’ or ‘not true’).

A later work used a bottom-up detection model with visual-semantic reasoning to improve further on the COSMOS baseline results.

Though these approaches use recent and sophisticated methods of evaluating the semantic relationship between images and captions/annotations, they don’t touch on the possibilities of using the new breed of generative text-to-image systems, including Latent Diffusion Models such as Stable Diffusion, to aid in recognizing these deceptive associations.
However, a new paper from the University of Bergen in Norway has decided to take this approach, using multiple text-to-image generations from DALL-E 2 and Stable Diffusion to determine what a more likely image result would be from captions that are associated with cheapfakes – and in this way to provide a method to evaluate the shortfall in alignment between cheapfaked image/caption pairs.

The new paper is titled Detecting Out-of-Context Image-Caption Pairs in News: A Counter-Intuitive Method, and comes from three researchers at Bergen.
Approach
The new work detects OOC captions and image pairs via a comparison of the perceptual similarity between the generated images that are based on the captions from the original COSMOS dataset.

The system utilizes a feature-based approach for determining image similarity (i.e., between the original and generated images), and uses an object encoder and decoder to extract feature representations from the images. The OOC/NOOC results are then compared against the gold labels from the COSMOS dataset.
Using DALL-E 2 for this particular purpose is quite challenging, since many of the labels in the COSMOS dataset are, naturally, provocative, and DALL-E 2 is likely to refuse to generate images based on such content. One example is that DALL-E 2 will not generate material related to COVID-19, and such material (due to the time at which these earlier projects were taking place, and the central motivations of this line of research) is present in quite large quantities in COSMOS.
Therefore the researchers had to pre-process the captions using Named Entity Recognition (NER), which, for example, will replace the proper noun ‘Obama’ with the noun ‘Person’. The researchers also note that this approach helps to generalize the method:
‘NER also helps decrease the abstraction between the caption and the images, as neither of the image generation models or object detection models will distinguish between types of persons or locations. We believe this will make similarity comparisons easier, thus increasing the accuracy of our model.’
The NER augmentations are re-checked using the spaCy library from the COSMOS project, and an additional list of potentially problematic words was compiled before prompts were created for Stable Diffusion and DALL-E 2. The basis for the prompt list was supplied by the open source List of Dirty, Naughty, Obscene, and Otherwise Bad Words (LDNOOBW) from Shutterstock.
The project uses the test set from COSMOS, comprising 1,700 images with 3,400 associated captions. One image is generated for each caption that was paired with the original image, eventually producing 3,400 synthetic images in each dataset, for a total of 6,800 generated images.
For compatibility, and to conserve computing resources, the images were generated at the native Stable Diffusion V1.5 resolution of 512x512px. The generation process was automated via Python. For Stable Diffusion, each image took approximately 15 seconds to create under a Google Colab GPU, while the DALL-E 2 generations took seven seconds to create via a more burdensome API pipeline.
The choice of loss metrics, to evaluate how well the pipeline is working, is tricky when synthetic images are central to the workflow, since generative models introduce a systematic randomness that can undermine the process. Therefore, traditional approaches such as Structural Similarity Index (SSIM), the authors state, are not ideal for the use case.
Instead, they used diverse and more novel feature extraction techniques for comparison purposes, testing various object detection models such as YOLO and MASK RCNN, which identify objects within images. These extracted objects can then be treated as sub-images in their own right for the metric evaluation process.
The feature vectors extracted from these in-image segments are compared to the feature representations obtained from the entire image, with the aforementioned MASK R-CNN tested, along with the V5 and V7 iterations of YOLO.
In a second run at the challenge, the authors also tested various object encoders, including ResNet, DenseNet, EfficientNet and CLIP. Across all these approaches, parity is estimated via Cosine Similarity, and the scores used to predict OOC/NOOC captions.
Data and Tests
To test the new datasets and approach, the researchers conducted qualitative and quantitative tests. For the first, a survey was devised where users were asked to rate the perceptual similarity between 24 caption/image pairs, on a 1-10 scale. One group was developed from each dataset, with a total of 48 generated pairs used, with an even distribution of OOC and NOOC labels.

Regarding these results, the authors state:
‘Our survey shows that rating the similarity between the images is a difficult task even for humans. The rating distributing shows a high variation in similarity scores for the same images. [The image above shows] an image pair where the variation of ratings is high, showing that the perceptual similarity of images highly varies from viewer to viewer.
‘The caption pair used to generate the image pair in [the image above] is NOOC. The average rating indicates that participants correctly identify this. This shows that the similarity in the image pair correlates to the similarity in the caption pair, and the text-to-image model effectively captures the semantic similarity.’
The researchers also note that the average ratings for corresponding OOC/NOOC labels can be transposed into the ratings for the gold labels from the COSMOS dataset, for the caption sets used to generate the images.

The authors mention that the human-annotated scores allow them to know whether their own prediction model is in step with human perception – an objective further demonstrated in tests across diverse models for the survey phase of the tests, involving variations of DenseNet and ResNet, as well as CLIP:

For the quantitative round, an automated prediction model is used, cycling through eight different object encoders, in different phases of the tests. The encoders are paired with three different object detection models, testing (in the phase shown below) for accuracy, precision, recall and F1 score.

The authors comment:
‘YOLOv7 produces a slightly better detection accuracy when paired with ResNet50 than YOlOv5 and MASK-RCNN. However, the slight increase in accuracy comes with a huge increase in runtime when utilizing normal GPUs. While the other variations use around 30 minutes on prediction on the entire dataset, YOLOv7 uses around 1hour and 30 minutes. MASK-RCNN, despite boosting better detection accuracy than YOLOv5, actually performs worse than YOLOv5 paired with all object encoders expect for EfficentNet on the DALL-E 2 dataset, where it returns a 1% better accuracy.
‘However, on the Stable Diffusion dataset, utilizing MASK-RCNN is superior to YOLOv5 and provides a 5-7% boost in accuracy in general.’
The best-performing combination turned out to be MASK R-CNN in tandem with the EfficientNet-B5 object encoder, yielding a 0.57% detection accuracy.
However, the researchers state that none of the tests based around the use of object detection models outperform methods that use object encoders, such as CLIP:
‘Paired with our best performing model, YOLOv5 decreases the accuracy score of the CLIP model by 16% on the Stable Diffusion dataset. We see a general 10% accuracy decrease when utilizing object detection models, versus only utilizing object encoders. Therefore, it is a clear advantage of utilizing only object encoders for this task, both in terms of accuracy and runtime.
‘Our study shows that several object encoders are able to accurately capture the perceptual similarity between images without the need for additional detection methods.

The authors state that the quality of captions will greatly affect the results, in terms of how accurately feature representations can be extracted. Since there are so many variables at play, it will be difficult, they believe, to establish cogent optimizations of the approach, going forward. They note also that the system does not permit catching contradictions in caption pairs – a problem noted in prior and related works.
However, it is notable the extent to which the captions alone are able to re-represent the original image in certain cases, via Stable Diffusion or DALL-E 2:

The authors assert:
‘We are confident that the method proposed in this paper can further research on generative models in the field of cheapfake detection, and that the resulting datasets can be used to train and evaluate new models aimed at detecting cheapfakes.’
Conclusion
In the future, it’s possible that attributive systems such as C2PA will be able to demonstrate provenance and history for any digitally distributed photograph. If such a schema should be widely adopted, images on the internet will become divided into those with pedigree and ‘papers’, and those where…well, where you’ll need to assess your own credulity when evaluating them. Under such auspices, it could technically become impossible to mis-associate photos with mendacious captions, because the dates and initial metadata (such as GPS coordinates) won’t line up.
Until this kind of oversight becomes available, this new approach is making an innovative use of technologies that are already associated with deception and fake news – and in the process demonstrating quite starkly how close the semantic relationship is between images and their captions.
However, it has to be noted, as we have observed before, that this is an arbitrary relationship, where specious captioning or composition of metadata tends to become ingrained and aggrandized, even if it’s inaccurate or poorly-phrased. Therefore solutions such as these don’t address the wider problem of a lack of cohesive and broadly applicable semantic principles in annotation and labeling, even for English-language labels alone.