Detecting Cheapfakes With Deepfakes

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Not all fake news requires AI; in many cases, it’s enough to present a real image together with a misleading caption, aka cheapfakes/shallowfakes, or to perform simple Photoshop manipulations.  

From the new paper, examples of out-of-context illustrative images. Source: https://arxiv.org/pdf/2308.16611.pdf
From the new paper, examples of out-of-context illustrative images. Source: https://arxiv.org/pdf/2308.16611.pdf

Though this ability to abuse the photography=reality conceit has been in existence since the invention of the photographic medium in the 1820s, a number of initiatives have sprung up in recent years that leverage machine learning to attempt to identify instances of mislabeling –  or ‘malicious association’ – of otherwise innocuous or non-deceptive photographs through captioning or context.

Catching Out-of-Context Misinformation with Self-Supervised Learning (COSMOS) is one of the most effortful such projects, and offered the first viable cheapfake detection routine, together with a hyperscale dataset of 200,000 images with 450,000 text captions, with captions labeled as ‘Out-Of-Context’ (OOC) or ‘Not-Out-Of-Context’ (NOOC).

The COSMOS on Steroids project subsequently provided further innovations on the original work, by offering an additional methodology titled Differential Sensing, which adds a negative and positive switch to each caption (i.e., ‘true’ or ‘not true’).

From the Cosmos on Steroids project: Differential Sensing detects inaccuracy in a source caption. Source: https://www.semanticscholar.org/paper/COSMOS-on-Steroids%3A-a-Cheap-Detector-for-Cheapfakes-Akgul-Civelek/f3d3394964439e460085d7141ea4924af801a8c5/figure/0
From the Cosmos on Steroids project: Differential Sensing detects inaccuracy in a source caption. Source: https://www.semanticscholar.org/paper/COSMOS-on-Steroids%3A-a-Cheap-Detector-for-Cheapfakes-Akgul-Civelek/f3d3394964439e460085d7141ea4924af801a8c5/figure/0

A later work used a bottom-up detection model with visual-semantic reasoning to improve further on the COSMOS baseline results.

Misleading captions caught by the 2022 initiative ' A Combination of Visual-Semantic Reasoning and Text Entailment-based Boosting Algorithm for Cheapfake Detection'. Source: https://dl.acm.org/doi/abs/10.1145/3503161.3551595
Misleading captions caught by the 2022 initiative ' A Combination of Visual-Semantic Reasoning and Text Entailment-based Boosting Algorithm for Cheapfake Detection'. Source: https://dl.acm.org/doi/abs/10.1145/3503161.3551595

Though these approaches use recent and sophisticated methods of evaluating the semantic relationship between images and captions/annotations, they don’t touch on the possibilities of using the new breed of generative text-to-image systems, including Latent Diffusion Models such as Stable Diffusion, to aid in recognizing these deceptive associations.

However, a new paper from the University of Bergen in Norway has decided to take this approach, using multiple text-to-image generations from DALL-E 2 and Stable Diffusion to determine what a more likely image result would be from captions that are associated with cheapfakes – and in this way to provide a method to evaluate the shortfall in alignment between cheapfaked image/caption pairs.

The original dataset image is on the left, and on the right we see a generated image that uses the caption which first accompanied the source image. The caption in this case was 'I’m primarily the dog walker, but usually the kids come with me.'
The original dataset image is on the left, and on the right we see a generated image that uses the caption which first accompanied the source image. The caption in this case was 'I’m primarily the dog walker, but usually the kids come with me.'

The new paper is titled Detecting Out-of-Context Image-Caption Pairs in News: A Counter-Intuitive Method, and comes from three researchers at Bergen.

Approach

The new work detects OOC captions and image pairs via a comparison of the perceptual similarity between the generated images that are based on the captions from the original COSMOS dataset.

Conceptual architecture for the new approach.
Conceptual architecture for the new approach.

The system utilizes a feature-based approach for determining image similarity (i.e., between the original and generated images), and uses an object encoder and decoder to extract feature representations from the images. The OOC/NOOC results are then compared against the gold labels from the COSMOS dataset.

Using DALL-E 2 for this particular purpose is quite challenging, since many of the labels in the COSMOS dataset are, naturally, provocative, and DALL-E 2 is likely to refuse to generate images based on such content. One example is that DALL-E 2 will not generate material related to COVID-19, and such material (due to the time at which these earlier projects were taking place, and the central motivations of this line of research) is present in quite large quantities in COSMOS.

Therefore the researchers had to pre-process the captions using Named Entity Recognition (NER), which, for example, will replace the proper noun ‘Obama’ with the noun ‘Person’. The researchers also note that this approach helps to generalize the method:

‘NER also helps decrease the abstraction between the caption and the images, as neither of the image generation models or object detection models will distinguish between types of persons or locations. We believe this will make similarity comparisons easier, thus increasing the accuracy of our model.’

The NER augmentations are re-checked using the spaCy library from the COSMOS project, and an additional list of potentially problematic words was compiled before prompts were created for Stable Diffusion and DALL-E 2. The basis for the prompt list was supplied by the open source List of Dirty, Naughty, Obscene, and Otherwise Bad Words (LDNOOBW) from Shutterstock.

The project uses the test set from COSMOS, comprising 1,700 images with 3,400 associated captions. One image is generated for each caption that was paired with the original image, eventually producing 3,400 synthetic images in each dataset, for a total of 6,800 generated images.

For compatibility, and to conserve computing resources, the images were generated at the native Stable Diffusion V1.5 resolution of 512x512px. The generation process was automated via Python. For Stable Diffusion, each image took approximately 15 seconds to create under a Google Colab GPU, while the DALL-E 2 generations took seven seconds to create via a more burdensome API pipeline.

The choice of loss metrics, to evaluate how well the pipeline is working, is tricky when synthetic images are central to the workflow, since generative models introduce a systematic randomness that can undermine the process. Therefore, traditional approaches such as Structural Similarity Index (SSIM), the authors state, are not ideal for the use case.

Instead, they used diverse and more novel feature extraction techniques for comparison purposes, testing various object detection models such as YOLO and MASK RCNN, which identify objects within images. These extracted objects can then be treated as sub-images in their own right for the metric evaluation process.

The feature vectors extracted from these in-image segments are compared to the feature representations obtained from the entire image, with the aforementioned MASK R-CNN tested, along with the V5 and V7 iterations of YOLO.

In a second run at the challenge, the authors also tested various object encoders, including ResNet, DenseNet, EfficientNet and CLIP. Across all these approaches, parity is estimated via Cosine Similarity, and the scores used to predict OOC/NOOC captions.

Data and Tests

To test the new datasets and approach, the researchers conducted qualitative and quantitative tests. For the first, a survey was devised where users were asked to rate the perceptual similarity between 24 caption/image pairs, on a 1-10 scale. One group was developed from each dataset, with a total of 48 generated pairs used, with an even distribution of OOC and NOOC labels.

Examples and statistics from the qualitative round, with the distribution score indicating a high level of variability between respondents.
Examples and statistics from the qualitative round, with the distribution score indicating a high level of variability between respondents.

Regarding these results, the authors state:

‘Our survey shows that rating the similarity between the images is a difficult task even for humans. The rating distributing shows a high variation in similarity scores for the same images. [The image above shows] an image pair where the variation of ratings is high, showing that the perceptual similarity of images highly varies from viewer to viewer.

‘The caption pair used to generate the image pair in [the image above] is NOOC. The average rating indicates that participants correctly identify this. This shows that the similarity in the image pair correlates to the similarity in the caption pair, and the text-to-image model effectively captures the semantic similarity.’

The researchers also note that the average ratings for corresponding OOC/NOOC labels can be transposed into the ratings for the gold labels from the COSMOS dataset, for the caption sets used to generate the images.

The average ratings for image pairs in the survey, converted to OOC/NOOC predictions, compared to predictions from the models themselves, and from the COSMOS gold labels.
The average ratings for image pairs in the survey, converted to OOC/NOOC predictions, compared to predictions from the models themselves, and from the COSMOS gold labels.

The authors mention that the human-annotated scores allow them to know whether their own prediction model is in step with human perception – an objective further demonstrated in tests across diverse models for the survey phase of the tests, involving variations of DenseNet and ResNet, as well as CLIP:

Findings indicate that the model's own estimation of accuracy is in line with human perception, across diverse encoders.
Findings indicate that the model's own estimation of accuracy is in line with human perception, across diverse encoders.

For the quantitative round, an automated prediction model is used, cycling through eight different object encoders, in different phases of the tests. The encoders are paired with three different object detection models, testing (in the phase shown below) for accuracy, precision, recall and F1 score.

Results for object detection model performance, with YOLO V7 outperforming rivals, with the predictions based on gen vs gen similarity from Stable Diffusion generations.
Results for object detection model performance, with YOLO V7 outperforming rivals, with the predictions based on gen vs gen similarity from Stable Diffusion generations.

The authors comment:

‘YOLOv7 produces a slightly better detection accuracy when paired with ResNet50 than YOlOv5 and MASK-RCNN. However, the slight increase in accuracy comes with a huge increase in runtime when utilizing normal GPUs. While the other variations use around 30 minutes on prediction on the entire dataset, YOLOv7 uses around 1hour and 30 minutes. MASK-RCNN, despite boosting better detection accuracy than YOLOv5, actually performs worse than YOLOv5 paired with all object encoders expect for EfficentNet on the DALL-E 2 dataset, where it returns a 1% better accuracy.

‘However, on the Stable Diffusion dataset, utilizing MASK-RCNN is superior to YOLOv5 and provides a 5-7% boost in accuracy in general.’

The best-performing combination turned out to be MASK R-CNN in tandem with the EfficientNet-B5 object encoder, yielding a 0.57% detection accuracy.

However, the researchers state that none of the tests based around the use of object detection models outperform methods that use object encoders, such as CLIP:

‘Paired with our best performing model, YOLOv5 decreases the accuracy score of the CLIP model by 16% on the Stable Diffusion dataset. We see a general 10% accuracy decrease when utilizing object detection models, versus only utilizing object encoders. Therefore, it is a clear advantage of utilizing only object encoders for this task, both in terms of accuracy and runtime.

‘Our study shows that several object encoders are able to accurately capture the perceptual similarity between images without the need for additional detection methods.

Left, results for predictions using only CLIP for feature vector extractions, gaining the best overall performance. Right, accuracy scores for each combination of object encoder and object detection model.
Left, results for predictions using only CLIP for feature vector extractions, gaining the best overall performance. Right, accuracy scores for each combination of object encoder and object detection model.

The authors state that the quality of captions will greatly affect the results, in terms of how accurately feature representations can be extracted. Since there are so many variables at play, it will be difficult, they believe, to establish cogent optimizations of the approach, going forward. They note also that the system does not permit catching contradictions in caption pairs – a problem noted in prior and related works.

However, it is notable the extent to which the captions alone are able to re-represent the original image in certain cases, via Stable Diffusion or DALL-E 2:

The original image is on the right, labeled, with generative versions on the left and middle, informed entirely by captions as a text prompt.
The original image is on the right, labeled, with generative versions on the left and middle, informed entirely by captions as a text prompt.

The authors assert:

‘We are confident that the method proposed in this paper can further research on generative models in the field of cheapfake detection, and that the resulting datasets can be used to train and evaluate new models aimed at detecting cheapfakes.’

Conclusion

In the future, it’s possible that attributive systems such as C2PA will be able to demonstrate provenance and history for any digitally distributed photograph. If such a schema should be widely adopted, images on the internet will become divided into those with pedigree and ‘papers’, and those where…well, where you’ll need to assess your own credulity when evaluating them. Under such auspices, it could technically become impossible to mis-associate photos with mendacious captions, because the dates and initial metadata (such as GPS coordinates) won’t line up.

Until this kind of oversight becomes available, this new approach is making an innovative use of technologies that are already associated with deception and fake news – and in the process demonstrating quite starkly how close the semantic relationship is between images and their captions.

However, it has to be noted, as we have observed before, that this is an arbitrary relationship, where specious captioning or composition of metadata tends to become ingrained and aggrandized, even if it’s inaccurate or poorly-phrased. Therefore solutions such as these don’t address the wider problem of a lack of cohesive and broadly applicable semantic principles in annotation and labeling, even for English-language labels alone.

More To Explore

Images from the accompanying YouTube video for the paper ' MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos'. Source: https://www.youtube.com/watch?v=Kpbpujkh2iI
AI ML DL

Extracting Controllable CGI From the ‘Black Box’ of Neural Human Avatars

A new collaboration between China and Denmark offers a way to extract traditional CGI meshes and textures from implicit neural human avatars – a task that is extraordinarily challenging, but which could pave the way for more controllable AI-generated imagery and video in the future.

Montaged images from the paper 'From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment' - https://arxiv.org/pdf/2406.13912
AI ML DL

Generative AI’s Captioning Crisis May Not Be Fixable With Large Language Models

Text-to-image and text-to-video models such as Stable Diffusion and Sora rely on datasets of images that include captions which accurately describe the photos in the collection. Most often, these captions are either inadequate or inaccurate – frequently both. Sometimes they’re downright deceptive, damaging models trained on them. But the research sector’s hopes that multi-modal large language models can create better captions is challenged in a recent new paper from NVIDIA and Chinese researchers.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle