Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

A new project from MIT and Google Research is offering a way of detecting AI-generated images, by deconstructing a Stable Diffusion-based image, examining relevant images from the training dataset behind Stable Diffusion, and using reverse image search to find real world images that are similar to the AI-generated images.

By deconstructing Stable Diffusion images back into the noise from which they were originated, and using that noise pattern to make a near-identical image, the central feature logic becomes exposed at a very low level, and these tendencies can subsequently be identified in unseen, novel images. Both real and fake images are used, and (image right, above), generated images are also 'reverse-searched' to find real-world analogs, which process comprises a new evaluation protocol for the project. Source: https://arxiv.org/pdf/2406.08603
By deconstructing Stable Diffusion images back into the noise from which they were originated, and using that noise pattern to make a near-identical image, the central feature logic becomes exposed at a very low level, and these tendencies can subsequently be identified in unseen, novel images. Both real and fake images are used, and (image right, above), generated images are also 'reverse-searched' to find real-world analogs, which process comprises a new evaluation protocol for the project. Source: https://arxiv.org/pdf/2406.08603

The fake detection routine obtained by using this method, the researchers state, generalizes well to a wide range of the most popular AI generator frameworks, including API-based systems to which researchers have no direct access, such as DALL-E 3.

The paper states:

‘[We] are the first to show that text-conditioned DDIM inversion feature maps extracted from one diffusion model improve the ability of a detector to identify images generated by other higher-fidelity diffusion models. Moreover, we are the first to propose an evaluation procedure for GenAI detectors that ensures that the learned detector is not biased towards any style or theme, and to quantitatively verify that the resulting evaluation is more reliable.’

The decoded latent noise map for an image is one of the contributing data points for the new system, which aims to find low-level fundamental commonalities between the output of diverse generative systems.

The traditional approach, (which has been the central thrust of fake detection research for some years), instead attempts to use model-specific artifacts from generative frameworks.

The problem with this method is that it leads to a kind of tacit ‘arms race’ between developers and detectors, since the generative systems will inevitably evolve and no longer produce the same traces, leaving fake detectors that rely on them bereft of signals to key on.

The authors observe:

‘[We] propose that a model that has access to all signals required to internally perform some form of likelihood testing on input data against a particular text-to-image model (the image, its imperfect reconstruction, and the intermediate noise map) would generalize better to detect images generated via other diffusion models.

‘Text [conditioning] further amplifies differences between loglikelihoods of distributions of fake and real images, making the corresponding test more powerful and consequently making inversions even more useful for detecting fake images.

‘To sum up, GenAI detectors find discrepancies between the real data distribution and the approximation learned by the GenAI model.’

In tests, the system, which is titled FakeInversion, consistently obtained superior scores at detecting AI-generated images from both open and closed source systems. The new verification protocol uses reverse image search from Google Lens, and filters out any images prior to the advent of the original DALL-E model, to ensure that comparative images returned are not themselves AI-generated.

Comparisons of generated images to real images are accomplished via Google Lens reverse search, and only permit images that were online prior to the 'AI-generated' era.
Comparisons of generated images to real images are accomplished via Google Lens reverse search, and only permit images that were online prior to the 'AI-generated' era.

The new paper is titled FakeInversion: Learning to Detect Images from Unseen Text-to-Image Models by Inverting Stable Diffusion, and comes from four researchers across MIT and Google Research. The work also has an accompanying project site.

Method

Text-to-image Latent Diffusion Models (LDMs) such as Stable Diffusion, given a user-supplied text prompt which is passed through text conditioning via a trained image/text encoder such as CLIP, subsequently generate images by iterating through visual ‘static’ – raw latent noise, in which the text embeddings passed onto it gradually resolve into an image. To ensure that the image is unique, the noise anchors on a random seed.

These factors make the image reproducible – all you need is the same text prompt (which will inevitably resolve back into the identical embeddings from the original image generation) and the same random seed (which will produce the identical static disposition of the original) running in an identical environment to the original.

From the schema of the new paper (reproduced in full below), we see the system reconstructing an image from the known latent noise.
From the schema of the new paper (reproduced in full below), we see the system reconstructing an image from the known latent noise.

Though the latent noise can be reproduced via a numeric seed, the new approach outlined in the paper trains a detector partially on the visual representation of the noise (center in image above), rather than programmatically reproducing the noise from the random seed – and the authors have found that the original image can be closely (though not identically) reproduced by interpreting the noise in this way.

The reliability and reproducibility of these signals are key to the new detection process. The visual representation of the image, called a noise map, represents an inversion of the original image, and is obtained through a deterministic conditional forward DDIM inversion process  – rather than by just plugging in the original variables that created the image.

Essentially, DDIM forward inversion adds noise to an image until it is pure ‘static’, following the logic of the derived features of the image. Though the result of this cannot possibly yield exactly the same pixel-for-pixel image as in the purely programmatic process that originated the image, tests show (see image example above) that a near-perfect replica is obtainable, and the high-level features preserved.

This is the conditional reverse process, which essentially operates in the same way as the denoising process in the original pipeline, but with a noise map that has been generated from a completed image rather than a random seed.

The new process, having made explicit all these procedures, can then intervene and analyze the data at each stage, using the BLIP and CLIP models:

Full schema for FakeInversion.
Full schema for FakeInversion.

The original text-prompt used to create the original image is not used in this process, but rather is generated by BLIP, and the final embedding refined by CLIP. The central idea here is to jettison user-logic (which is ad hoc and unpredictable) in favor of architecture logic, which is reproducible at a low enough level that a detector keying on such traits can generalize to similar types of model, rather than just the model that was originally used.

The authors comment*:

‘But why would a diffusion detector benefit from having access to DDIM inversion of an image if it already has access to the image itself? Recent works showed that DDIM can be viewed as a first-order discretization of a neural probability-flow ODE. The bijection between [observations] and [noise maps] induced by this ODE can be used to evaluate the likelihood of the data via the change of variable.

‘If we view the forward DDIM [mapping] as an approximation of that true bijective mapping between [observation and noise maps] that introduces a discretization [error] into the inverted noise [maps], causing the resampled [image] to deviate from the original [image], it can be [shown] that, in the first-order approximation, the loglikelihood of the data given that underlying model can be estimated from the input [image], its imperfect [reconstruction], and the noise [map] alone.’

Therefore the input image, decoded latent noise map and the decoded image reconstruction are used to train a ResNet50 encoder (the aforementioned reverse image search stage is a post-inference operation which we will take a look at in ‘Data and Tests’, below).

As the authors observe, the resulting Gen-AI detector essentially seeks out discrepancies between the real data distribution and the approximation trained into the system.

They further note that the detector can get a ‘rough estimate’ of whether any particular image falls into the category of ‘likely generated by AI’ under the approximate distribution generated by the Stable Diffusion samples that feature in the workflow – and they dub this signal (which is not specific to Stable Diffusion) ‘SD-likelihood’.

Data and Tests

To test FakeInversion, the researchers selected a particularly recent group of baselines: DMDetect, a state-of-the-art method that operates only in the RGB space, and for which only an inference checkpoint is available; UFD, which detects fake images by training a linear head over CLIP embeddings of real and AI-generated images; DIRE, which also uses image-space DDIM reconstruction, whose sole released checkpoint has a known issue which required special attention in the new study; and CNNDet, a 2020 convolutional baseline network, regarding which the authors were able to retrain an official checkpoint release with their own data.

Training data used included the ProGAN+LSUN dataset from the CNNDet project, which comprises 350,000 images from the class-conditioned pre-trained ProGAN initiative from 2018, merged with data from the LSUN project (although LSUN is a fairly venerable collection at this point, the authors observe that it continues to be a useful dataset for the emerging gamut of diffusion models).

Training data in itself was provided by Stable Diffusion generations, together with original images from the LAION dataset on which Stable Diffusion was trained.

Initially detectors were trained on 300,000 fake Stable Diffusion (V1+) images from DiffusionDB, and 300,000 randomly-selected images from LAION. The authors note that although early-release Stable Diffusion models could now be considered outdated, they found that training on V1-era images still produces detector models that can identify fakes from much newer versions (and it should be noted that the 1.4/1.5 releases remain robustly popular in the community, as later models tend to be bowdlerized or even crippled to some or other extent, for legal reasons).

To obtain non-SD fake data, the authors turned to images generated by the closed-source Imagen API, and through the use of existing datasets of fakes made available at HuggingFace, including examples generated by MidJourney and DALL-E 3. Further examples, from DALL-E 2, were taken from the DMDetect database, and in addition to this the authors generated ‘several thousand’ images conditioned on MidJourney prompts.

Further frameworks involved in fake image generation were Kandinsky 2 and 3, PixArt-α, SDXL-DPO, SDXL, SegMoE, SSD-1B, Stable-Cascade, Segmind-Vega, and Wurstchen 2.

Examples from the above-mentioned sources for fake and real images.
Examples from the above-mentioned sources for fake and real images.

In order to make certain that detectors are not biased towards preferring any particular theme or style, or any other characteristics particular to a generative framework, the researchers developed a Synthetic Reverse Image Search (SynRIS) evaluation method.

The researchers used the Google Lens API to systematically obtain images from the internet that are thematically equivalent to the produced fake samples, restricting obtained results to 2020 or earlier -a period that predates the age of DALL-E, Stable Diffusion, and any kind of generative AI that could be said to be ‘photorealistic’.

Examples (right-most column of each section) of images found by reverse image search that accord with fake, AI-generated images, under SynRIS.
Examples (right-most column of each section) of images found by reverse image search that accord with fake, AI-generated images, under SynRIS.

Such methods have been used periodically to ascertain whether a generative system has memorized a too-frequently found image in a training dataset; but even where the process reveals this to be the case, this is not the scope of SynRIS, which is looking to establish real/fake distribution characteristics.

The evaluation methodology is based on two prior works, and the approach obviates the possibility that the detector may focus on styles or themes that pertain to a particular generative framework.

The ResNet50 detector backbone was trained from scratch, with each checkpoint validated against the held-out part of the dataset split of each respective training set.

Images were augmented with a standard array of methods, including flip, crop, rotate and jitter, among others.

Metrics used were Area Under the Receiver Operating Curve (AUROC ††), average precision and average accuracy, and PR, ROC and DET curves. To ensure that difference in data image sizes did not affect neutrality, all images were resized to 256px on the shortest side, and saved without compression. Fréchet Inception Distance (FID) was also used.

The researchers tested the False Positive Rate (FPR) of the state-of-the-art UniversalFakeDetect system.

Comparison of the performance of FakeInversion and UniversalFakeDetect on the test sets.
Comparison of the performance of FakeInversion and UniversalFakeDetect on the test sets.

The researchers comment*:

‘Results show that LAION-based evaluation significantly underestimates the false positive rate of the detector when evaluating its ability to discriminate fakes from closed-source text-to-image models (Imagen, DALL·E 2/3) across both training sets.

‘We also obtained real examples from the multimodal dataset used to train [Imagen], and evaluated the detector against these real examples and these results closely align with our RIS-based [eval]. The [FID] and [KID] between real and fake images is also lower for RIS eval, and matches FID/KID between WebLI and Imagen fakes, suggesting better stylistic and thematic alignment.

‘Similar trends can be seen on open-source models ([Kandinsky], SDXL) and across both training sets. These results suggest that our RIS-based eval is a more reliable way to estimate a model’s ability to detect images from closed-sourced text-to-image models trained on unknown data.’

FakeInversion, according to the authors, also consistently scores best at the detection of both open and closed-source generative systems across the diversity of training datasets tested.

Primary results against prior frameworks for across-the-board detection of fakes created via both open and closed-source generative systems.
Primary results against prior frameworks for across-the-board detection of fakes created via both open and closed-source generative systems.

Here the authors comment:

‘[Our] method consistently scores best at detecting both closed and open-source methods across various training sets. It also matches the performance of prior work on academic benchmarks. On average, our method outperforms prior work by at least 4pp on both training sets.’

In ablation studies (not covered here), the authors found that DDIM inversion was crucial for the generalization that makes the system effective.

Conclusion

Over the last 5-6 years, the deepfake detection research sector has resolved into a search for low-level traits that persist across architectures – and the more authentic that the new generation of LDM-based generative systems become, the more such traces evaporate, making it very difficult to discern artificial or artificially enhanced images.

The use of noise-maps in FakeInversion training is one of the most innovative wrinkles I’ve seen in a while. While it remains specific to LDMs, that category is currently the dominant one in the literature, and in the slew of new frameworks revealed each year. If noise analysis can prove a new consistent key, then the new solution may remain viable a little longer than many previous offerings.

* My conversion of the authors’ inline citations to hyperlinks. The frequent square brackets are necessary to remove the references to formulae in the project’s equations, as these are beyond the scope of this article. I also included an explanatory link regarding bijection. I have omitted repeated citations/links.

A trainable checkpoint was available, but the authors used this in both its off-the-shelf state, and retrained from scratch on the authors’ data, via the release code

†† The paper refers to AUCROC, which I have presumed to be an error, or else an author-specific abbreviation. Also known as AUC.

More To Explore

Images from the accompanying YouTube video for the paper ' MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos'. Source: https://www.youtube.com/watch?v=Kpbpujkh2iI
AI ML DL

Extracting Controllable CGI From the ‘Black Box’ of Neural Human Avatars

A new collaboration between China and Denmark offers a way to extract traditional CGI meshes and textures from implicit neural human avatars – a task that is extraordinarily challenging, but which could pave the way for more controllable AI-generated imagery and video in the future.

Montaged images from the paper 'From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment' - https://arxiv.org/pdf/2406.13912
AI ML DL

Generative AI’s Captioning Crisis May Not Be Fixable With Large Language Models

Text-to-image and text-to-video models such as Stable Diffusion and Sora rely on datasets of images that include captions which accurately describe the photos in the collection. Most often, these captions are either inadequate or inaccurate – frequently both. Sometimes they’re downright deceptive, damaging models trained on them. But the research sector’s hopes that multi-modal large language models can create better captions is challenged in a recent new paper from NVIDIA and Chinese researchers.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle