Combating Stable Diffusion Face Forgery Through Frequency Analysis

Fake faces broken down into the spectral ranges that may reveal them

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

As we have observed before, the deepfake-detection research community has been relatively laggard in coming up with methods to identify artificial faces created by modern methods such as Latent Diffusion Models (LDMs) – exemplified by the runaway success of the open source Stable Diffusion framework, which has become an AI hobbyist’s go-to system for creating realistic models of celebrities (and which is equally capable of producing hyper-realistic representations of non-famous people).

The civit.ai website features hundreds of freely-available user-contributed AI models, mostly trained on domestic consumer hardware, that can allow Stable Diffusion to render a famous personage (or an unknown person) with extraordinary realism, depending on the skill of the uploader. Most of these depictions are more 'stylized' than the models' default output, as this is something of a vogue in the community – but all are capable of genuinely 'photorealistic' depiction. Source: https://civitai.com/models?tag=celebrity
The civit.ai website features hundreds of freely-available user-contributed AI models, mostly trained on domestic consumer hardware, that can allow Stable Diffusion to render a famous personage (or an unknown person) with extraordinary realism, depending on the skill of the uploader. Most of these depictions are more 'stylized' than the models' default output, as this is something of a vogue in the community – but all are capable of genuinely 'photorealistic' depiction. Source: https://civitai.com/models?tag=celebrity

In effect, the considerable gap between the emergence of autoencoder deepfakes in late 2017, and the advent of Stable Diffusion, well over five years later, caused the research community to entrench itself in the study of this older method – notwithstanding that the state-of-the-art in 2017 era deepfakes has been stagnating for years (exemplified by the fact that the authors of DeepFaceLab and DeepFaceLive, the two most-diffused video deepfake systems, archived them last November).

With only 17 months since the emergence of Stable Diffusion, and rather less than that since methods such as DreamBooth and LoRA allowed the hobbyist community to create bespoke (real) personalities, the deepfake detection sector has been caught on the back foot.

Rats in a Barrel

Much like the generation of autoencoder deepfake detectors that preceded it, the slowly-emerging new generation of ‘AI face detectors’, aimed at finding ways to distinguish even the best Stable Diffusion-style fake faces from real data, have tended to key on forensic traces of whatever the latest system is.

Naturally, this doesn’t work if the system evolves, and such traces either disappear or else diminish to an indistinguishable level. Thus, these efforts risk to get caught up in a game of whack-a-mole with AI developers, producing detection systems that are specific to certain generation systems, which then evolve beyond current detection capabilities.

Soft localization maps visualized in a recent deepfake detection paper. Papers of this type are always in search of platform-agnostic features that could potentially detect deepfaked content even in unseen, future systems, based on high-level principles, rather than characteristics of any particular system. Source: https://arxiv.org/pdf/2311.04584.pdf
Soft localization maps visualized in a recent deepfake detection paper. Papers of this type are always in search of platform-agnostic features that could potentially detect deepfaked content even in unseen, future systems, based on high-level principles, rather than characteristics of any particular system. Source: https://arxiv.org/pdf/2311.04584.pdf

What’s devoutly wanted (and has been since the first emergence of autoencoder deepfakes), is to find a high-level trait in generative images that can be discerned by a detection system, and which is platform-agnostic.

Stationary Target

The locus of Stable Diffusion-era detection research has, perhaps, been hamstrung by LDMs’ difficulty in creating temporally stable video. If Stable Diffusion had been able to create authentic deepfake celebrity videos from the outset, one can only assume that the current state of the art in AI-based face detection might have been more enthusiastically-funded, and might now be more advanced.

Though the release of new systems such as Stable Video Diffusion, and tertiary frameworks such as AnimateDiff and Roop-Unleashed, have not yet caught up with the ability of 2017-era autoencoder systems in producing temporally stable output (well, not in anything you’ll actually get to use without restriction), this apparently inevitable development seems to be drawing closer.

Only this week, a new open source system titled Magic-Me is offering a Stable Diffusion-based ‘Identity-Specific Video Customized Diffusion’, with a clear accent on smooth representation of celebrity likeness in video:

A new video-generation system published legitimately via Arxiv seems to know exactly what its target demographic is. Source: https://magic-me-webpage.github.io/

Therefore it would seem that a 2017-style imperative towards more aggressive research into detecting newer types of deepfake is creeping up on the community.

Frequent Offenders

A new collaboration between institutions in France and Switzerland offers what may be the latest attempt to find a ‘common thread’ among AI-generated fake faces, regardless of whether they are produced by current or future generative platforms – by analyzing the frequency representation of facial images.

On the left, a frequency distribution map for a real photo of a person and on the right, the altered characteristics of a Stable Diffusion equivalent. Source: https://arxiv.org/pdf/2402.08750.pdf
On the left, a frequency distribution map for a real photo of a person and on the right, the altered characteristics of a Stable Diffusion equivalent. Source: https://arxiv.org/pdf/2402.08750.pdf

Researchers for the new work, titled Towards the Detection of AI-Synthesized Human Face Images, actually trained a model on such frequency-based images as those depicted above, and found that applying the results to a variety of older AI-face detection models notably improved their accuracy.

The work offers no direct alternate framework of its own, since the aim of the project is to advance the state-of-the-art by establishing a benchmark dataset and methodology, using popular LDM and Generative Adversarial Network (GAN) frameworks.

The authors state:

‘The paper [aims] to draw new insights for developing more generalizable detectors. To that end, a frequency domain analysis on the synthetic face images is carried out, examining the deviation of their spectra from that of real images.

‘Consequently, our experimental results demonstrate that training a learning-based detector using frequency representations yields outstanding performance and generalization ability in the benchmark.’

Method

In the first instance, the researchers curated a new dataset containing real images from CelebA-HQ, a subset of the original Celeb-A dataset, which in itself curates 30,000 high res images from the original collection.

Then, a collection of synthetic faces was generated using diverse GAN and LDM methods. The GAN frameworks used were ProGAN, StyleGAN2, and VQGAN; the latent diffusion models used were Stable Diffusion, DDPM, DDIM, and PNDM.

Six example facial generations (i.e., non-real people) created for the new benchmark dataset. Clockwise from upper left, faces created by ProGAN; StyleGAN2; DDPM; DDIM; PNDM; and LDM (Stable Diffusion).
Six example facial generations (i.e., non-real people) created for the new benchmark dataset. Clockwise from upper left, faces created by ProGAN; StyleGAN2; DDPM; DDIM; PNDM; and LDM (Stable Diffusion).

The StyleGAN2 framework was trained on the FFHQ dataset, while all the other architectures were pretrained on CelebA-HQ.

Since 256x256px is the most common size among the real collections, this was chosen as the target resolution in all cases, and models that naturally produce larger images (such as a minimum of 512x512px for Stable Diffusion, since that is the source resolution it was trained on) had their output downsized to this scale.

For each framework, 40,000 images were created, and then split into training, validation and test sets of 38,000, 1,000 and 1,000, respectively.

A number of earlier methods were chosen as experiments for the benchmark, including the approach put forward in 2019 by UC Berkeley and Adobe, for the paper CNN-generated images are surprisingly easy to spot…for now (denoted as ‘Wang2020’ in results); the method from the 2021 Italian offering Are GAN generated images easy to detect? (denoted as ‘Grag2023’ in results); the method from the 2022 Italian paper Detecting GAN-generated Images by Orthogonal Training of Multiple CNNs (denoted as ‘Mandelli2022’ in results); and the method from the 2023 University of Wisconsin publication Towards Universal Fake Image Detectors that Generalize Across Generative Models (denoted as ‘Ojha2023’ in results*).

Though output from the various architectures obtained differing Fréchet Inception Distance (FID) scores, this accorded with the need for a variegated dataset of fakes of differing quality, since the detection method being considered here is intended to operate at high level, rather than being specific to the eccentricities of any one generative approach:

Diverse FID scores (lower is better) for each of the generative methods used.
Diverse FID scores (lower is better) for each of the generative methods used.

The new benchmark has two ambits: to see if the generalization capabilities of a detector can adapt to synthetic human facial images; and to see whether a detector trained on a specific model can remain performant when asked to evaluate the output from GANs and LDM models that it was not trained on.

In the first case, the new work uses prior open source detection methods trained, rather traditionally, on the various classes (‘bridge’, ‘church’, etc.) of the LSUN dataset.

Examples from the now-venerable LSUN dataset, used as one source in the new project. Source: https://www.tensorflow.org/datasets/catalog/lsun
Examples from the now-venerable LSUN dataset, used as one source in the new project. Source: https://www.tensorflow.org/datasets/catalog/lsun

For the latter case, the older models were tested with out-of-distribution AI face images synthesized by alternate GANs and LDMs.

Besides the ability to generalize to high-level perception of ‘falseness’ in output from unseen or unspecified systems, the proposed benchmark also analyzes the impact of JPEG compression at various levels of intensity; blur effects, rendered through a Gaussian blur kernel; Gaussian noise; and resizing operations such as downsampling to smaller resolutions through standard operations such as bicubic interpolation.

(These factors are assessed separately in the new benchmarking schema, rather than in tandem or collectively)

The frequency images illustrated earlier in the article are obtained by converting images to grayscale by averaging the RGB color channels, before applying a high-pass filter and subtracting the results of this from the image (a technique originally proposed by Wang2020).

Next, a Fast Fourier Transform (FFT) is applied, resulting in an image that extracts the frequency spectrum. Below we see examples of the average frequency spectrum across 1,000 images randomly taken from the CelebA-HQ dataset and the seven aforementioned fake image datasets:

Above, mean frequency spectra from a real images sourced from CelebA-HQ (left), followed by three images from the annotated related generative method; below, the same scenario with four diffusion models (Stable Diffusion is 'LDM').
Above, mean frequency spectra from a real images sourced from CelebA-HQ (left), followed by three images from the annotated related generative method; below, the same scenario with four diffusion models (Stable Diffusion is 'LDM').

The authors comment**:

‘[The] common grid-like artifacts found in [previous studies] are notably absent in our GAN-generated face datasets. However, datasets created by ProGAN and VQGAN exhibit numerous high-frequency noises.

‘The more advanced StyleGAN2 contains relatively fewer such artifacts but remains distinguishable from real image spectra. On the other hand, [the lower row] shows that the FFT spectra of [LDM]-created face images closely resemble the real spectrum, except for [Stable Diffusion] which contains both high-frequency noise and grid-form artifacts.

‘While images produced by DDPM, DDIM, and PNDM exhibit fewer visible artifacts in the frequency domain, they tend to have higher spectra density and contain low-frequency artifacts along the vertical and horizontal impulse sequence, deviating from that of real image spectra.’

The differences observed thus inspired the researchers to investigate the potential creation of a ‘generic’ detector capable of identifying AI-bred faces from arbitrary GANs and LDMs.

The method trains ResNet-50, XceptionNet and EfficientNetB4 on frequency representations of both real and synthetic images (i.e., the system is training on the kind of images seen directly above, and not on real images, similar to the way that one recent text-to-video project trained on images that actually included bounding boxes, in order to directly insert trace signals into the model).

Tests

The tests conducted for the new benchmark were divided into three phases. In the first, the aforementioned methods* Wang2020, Grag2021, Ojha2023 and Mandelli2022 were used, with the pre-trained weights from the original releases (i.e., the systems were being tested in the state in which they were first published).

Due to the different sources used across these projects, only Mandelli2022 was trained on the Alias-Free Generative Adversarial Networks synthetic face dataset, while the others were trained on LSUN.

For the second phase, Ojha2023 and Wang2020 alone were tested for generalization ability across diverse generative models. Further, the aforementioned three CNN-based classifiers (XceptionNet, ResNet-50 and EfficientNetB4) were trained with rasterized frequency representations, and evaluated in the same context.

The real source images used came from CelebA-HQ, and the fake images from the ProGAN model and the DDIM diffusion model.

For phase three, the detectors were tested for robustness against the four above-mentioned image perturbation methods, using the four pre-trained detectors from phase one, tested against ‘distorted’ face images from DDIM and ProGAN. Additionally, Ojha2023 and Wang2020 were retrained on the relevant training set and tested under the same conditions.

Evaluation metrics were average precision (AP) and Area Under Receiver Operating Characteristic Curve (AUC).

Detection performance evaluated against four pre-trained detectors, used with the original weights from the source projects.
Detection performance evaluated against four pre-trained detectors, used with the original weights from the source projects.

The authors conclude that detectors trained only on fake AI face images ‘struggle to adapt to synthetic face images’ – by which they mean that these detectors do not generalize well beyond the kind of output that they are actually expecting.

Next, the authors tested detection performance against the benchmark (spectral) images:

Results for testing of generalization analysis across diverse techniques. All results are based on face images by ProGAN and DDIM, and are tested across all seven generative models used in the study.
Results for testing of generalization analysis across diverse techniques. All results are based on face images by ProGAN and DDIM, and are tested across all seven generative models used in the study.

The authors comment:

‘[The] retrained version of Wang2020 shows the potential to generalize to images created by VQGAN, DDPM, and PNDM, yet struggles to adapt to StyleGAN2 and LDM.

‘Conversely, Ojha2023 achieves nearly flawless detection across all the GANs and [LDMs]. Notably, after training with frequency representations of these face images, the three CNN detectors achieve much better generalization ability when compared to their counterparts that are directly trained with RGB images.

‘The combination of EfficientNetB4 and frequency representation even surpasses the state-of-the-art performance on certain GAN models and most [LDMs].’

Finally, the researchers tested the systems against the four previously-mentioned image perturbations, and found that while spectral training was able to improve on the baselines of most of the frameworks, more extreme perturbations limited the extent of this effect (the dashed lines in the images below depict the performance of the systems when retrained on the new benchmark data):

Results for testing the systems against four types of image perturbation, with dashed lines representing the performance of the frameworks when retrained on the new benchmark data.
Results for testing the systems against four types of image perturbation, with dashed lines representing the performance of the frameworks when retrained on the new benchmark data.

Conclusion

While there is strong indication that retraining on spectral images can bring notable improvements to deepfake or AI-based face detection, and while this approach offers the possibility of the much sought-after high-level metric for detection, as ever, the imposition of degradations to images tends to confuse the system.

Image degradation is a powerful tool for imposing extra authenticity upon images – for instance, in the deepfaking or AI synthesis of ‘archival’-style images intended to belong to a former era, which are likely to demonstrate characteristically lower image quality than current standards offer, or which are likely to have suffered image quality loss since the images were supposedly taken, through poor storage and general wear and tear.

Additionally, the routine economies that the web ecostructure makes to popular images, such as recompression, also offer fakers a chance to add a patina of ‘authentic degradation’ to images that are brand new, and never actually existed outside of the generative AI age.

Despite this, the new work does seem to offer a methodology that could be applicable to domain-consistent output, such as cases where fakes are intended to be new and novel, and to live up to the prevailing HQ standards.

In the end, we can only presume, the traces of artificiality found by these methods will become replicable by newer generative systems, and the only way that it will be feasible to make reasonable guesses around image authenticity will be to evaluate plausibility, and to apply other related criteria (such as the fact that an image which should have entered the public arena much earlier is only now being made available).

* We apologize for the need for this kind of a crib, but the need for it tends to crop up when dealing with reference papers whose source methodologies did not get explicitly named by their inventors.

** My conversion of the authors’ inline citations to hyperlinks.

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle