Despite the vastly superior ability of Latent Diffusion Models (LDM) such as Stable Diffusion to create high-resolution representations of real people, in comparison to 2017-era autoencoder methods (i.e., the methods used in deepfake videos over the past six years), the deepfake detection research sector has produced very few papers that address LDM’s superior deepfaking capabilities.
This may be due to the ongoing difficulty that LDMs have in creating temporally consistent video. Since autoencoder methods such as DeepFaceLab and FaceSwap (despite onerous training regimes, and results that are inferior to LDMs in quality) can produce consistent video fakes, and since video is considered the number one threat in the deepfake scene, security-based research into the facial deepfake properties of systems such as Stable Diffusion is quite nascent at the moment.
To date, this avenue of investigation has tended towards fully-supervised approaches. A fully-supervised approach assumes an unusual level of access to the technologies involved; in the security research sector, the equivalent field of study is ‘white box’ attacks, where new lines of research assume a level of access to the target technologies that is possible, but quite unlikely.
In the case of deepfake detection, a supervised approach will include knowledge of labels and other aspects of the data that are usually only available to the generating system, but cannot usually be known (or inferred) based on the output.
A weakly-supervised approach, by contrast, is equivalent to a ‘black box’ scenario in security research – where the methodology has access only to the superficial results of the system, and yet, hopefully, is able to infer some useful functionality only from this.
Considering these factors, a new paper from Bitdefender and the Polytechnic University of Bucharest offers such an approach, revising former methods so that they can be considered ‘weakly-supervised’, and therefore more generically applicable to a wider range of deepfake technologies – and especially diffusion-based approaches.
The methods offered in the paper show an improvement over prior offerings, and offer the possibility of increasing the generalization capabilities of deepfake detection models, so that they do not need to be constantly updated with the data from the most recent faking techniques.
The new approach concentrates on determining whether individual sections of an image have been altered, in contrast to the majority of work in recent years, which has primarily sought to determine whether an entire image has been generated.
The paper throws into sharp relief the laggard momentum of the anti-deepfake research sector, which continued to concentrate on Generative Adversarial Networks (GANs) for some years after the almost-complete conquest of the deepfake scene by autoencoder methods, from late 2017 onward.
Likewise, with the sunk cost’ of so much subsequent prior research into autoencoder deepfakes (a technology that has had no significant increase in quality in at least four years, and which can be considered to be ‘stalled’ at this point) finds the sector continuing to concentrate on autoencoders.
The new work focuses instead on the inpainting capabilities of LDMs, a functionality which converts source material into latent embeddings and performs manipulations (such as face-swapping, for instance using LoRAs) directly in latent space rather than pixel space:
Inpainting a substituted likeness in the AUTOMATIC1111 Stable Diffusion distribution, using a free LoRA downloaded from civit.ai – the work of a minute.
Because the source image is manipulated into the target image very deep in the noise-decoding process of the latent diffusion model’s latent space, inpainting conforms the target content better to the source content than most autoencoder and GAN projection techniques are capable of, resulting in trivially-easy static image fakes that comfortably pass standard tests:
Though there is a growing expectation that LDMs will soon gain temporally consistent face manipulation, and though the inherent challenges may mean a longer wait for this functionality than many are presuming, such an event seems likely, at the current state-of-the-art in LDM-based deepfake detection, to take the research community by surprise, which makes forays such as the new project from Bitdefender and the Bucharest polytechnic a welcome addition to the literature.
Additionally, it’s worth noting that if LDMs do ever become capable of generative consistency across frames, they are capable of deepfaking entire bodies, and not just the central section of faces, as autoencoder methods currently do.
The new paper is titled Weakly-supervised deepfake localization in diffusion-generated images, and comes from two researchers at Bitdefender, and one from the University Politehnica of Bucharest.
The new work revisits three former approaches to the task at hand: Gradient class activation maps (Grad-CAM); the truncated image classification network Patch-Forensics (called ‘patches’ in results) which obtains a patch-level score from feature activations; and the Facial Forgery Detection (FFD, called ‘attention’ in results) initiative from Michigan University’s Computer Vision Lab, which uses an attention mechanism to create a mask of interest within a studied image.
For the new project, the Grad-CAM method was augmented by the researchers with the addition of an Xception network, which brings localization capabilities lacking in the original version. The new paper, its authors state, represents the first version of this method to be tested quantitatively, rather than qualitatively (i.e., to be evaluated via metrics and functions, rather than just soliciting user opinion).
In turn, the Patch-Forensics architecture, which originally experimented with both ResNet and Xception backbones, the authors of the new paper dispense with ResNet (since the entire thrust of the new paper centers on the superior performance and utility of Xception).
In addition, to support a fully-supervised localization procedure (for testing and comparison purposes), the authors added a fully convolutional layer to the Grad-CAM network (as had previously been done in the project Fully convolutional networks for semantic segmentation).
Since dataset generation and curation is deeply bound into the new initiative, we’ll move onto Data and Tests, and take a further look at the method there.
Data and Tests
The data generated for the system, and for testing the system, was produced via Stable Diffusion, generating both complete images and inpainted images (where the background component was unchanged across the generation). The new work uses the 2022 RePaint technique to perform inpainting.
The researchers devised a variation on this prior project called Repaint-LDM. They explain*:
‘Latent diffusion models (LDM) have been shown to offer a scalable approach to generating high-fidelity images. Their main idea consists of performing diffusion in the (low-dimensional) latent space of a variational autoencoder (VAE).
‘We translate this idea to inpainting by running the Repaint [scheduler] in the latent space, x ← enc(x), of the variational autoencoder and using an appropriately downsized mask, m ← resize(m). This procedure generates an (inpainted) latent code, ˆx, which is then inverted to the original pixel space using the decoder of the VAE, dec(ˆx).
‘Notably, this method allows us to inpaint an image using any existing pretrained LDM model. To the best of our knowledge, this approach to inpainting is novel.
Model training and evaluation was carried out using the popular CelebA-HQ and FFHQ datasets, largely because they were used in prior related projects, and allowed some continuity of testing criteria. From each of these, the researchers selected a subset of 9,000 training and 900 validation images, to match the number of fake images that were generated for the project.
For the fake images, the authors used the perception-prioritized methods outlined in a prior 2022 paper from Korea.
This particular project was chosen because it had leveraged CelebA-HQ and FFHQ.
The authors of the new work used this approach to generate a 90/10 training/validation corpus of fake images (called in results ‘P2/CelebA-HQ’ and ‘P2/FFHQ’, respectively).
The inpainted regions involved isolating the skin, hair, eyes, mouth, and nose of images, and addressing the addition or removal of glasses – all standard computer vision segmentation/synthesis tasks, many dating back to the earliest days of GANs
Two RePaint-based datasets were generated – one for CelebA-HQ, and one for FFHQ. In the case of CelebA-HQ, existing annotations could be used for the labeling needs of the project. Since FFHQ lacks such masks and ground truth, this was obtained by running the sub-set through MaskGAN:
Mask facets, from those listed, were selected randomly, and the resulting sets are called ‘Repaint-P2/CelebA-HQ’ and ‘Repaint-P2/FFHQ’ in results. Only the first of these was used extensively in testing, while the second was primarily used for training.
LaMA’s Fourier convolutions are part of an autoencoder framework, while Pluralistic is a conditional variational autoencoder (VAE). Again, both these projects were trained on CelebA-HQ and FFHQ, which match them well to the new initiative (even though it arguably contributes to the difficulty in getting better and improved datasets embedded into the research sector – a ‘dataset entropy’ that we have discussed before). Especially, this parity permitted the researchers to obtain consistent and comparable results in regards to the use of the same masks across examples, and helped to individuate the differences across generators.
In testing the system, the authors followed the procedures outlined in Patch–Forensics, which ensured that both real and fake images were subject to identical preprocessing steps before being passed through to the detection approaches. Therefore the images in both the real-world datasets were resized to 256px2.
For the fake detection stage, average precision (AP) was used, with each image obtaining a per-image ‘fakeness’ score.
Three setups were arranged for the tests. The first of these is ‘Setup A (label & full)’, in which the researchers have access to fully-generated images with only image-level labels, consisting of 9,000 fake images fully synthesized by P2, and 9,000 related images from the dataset on which P2 was originally trained.
The second setup is called ‘Set B (label & partial)’. This is a weakly-supervised configuration where image-level labels are available, but no localization information (i.e., which parts of the image have been changed). Thus an image labeled ‘fake’ by the detection process may not be entirely fake. This uses inpainted images from Repaint-P2 and 9,000 real images from the corresponding real-world training dataset.
The third and final setup is called ‘Setup C (mask & partial)’. This is a fully-supervised setup with access to ground truth localization masks, and consists of 30,000 inpainted images from repaint-P2. No real images are used here.
Initially these setups were evaluated for localization.
(Note: the results section of this paper is characterized by complex codification, which makes the results unusually opaque and difficult to understand – not least because excessive concurrent and consecutive tests have been parsed into single table results; please bear with us as we attempt to decode the terminology and labyrinthine nature of the results)
In the localization results table below, Grad-CAM (‘GC’), patches (‘PT’) and Attention (‘AT’) are all tested on the Repaint-P2/CelebA-GHQ dataset under the three levels of supervision described in the three setups outlined above. Localization is evaluated using Intersection over Union (IoU) and Pixel-wise binary classification Accuracy (PCBA). ‘AP’, as mentioned earlier, means ‘average precision’.
Of these results, the authors comment:
‘[We] see that Patches generally outperforms the other two approaches across multiple setups and [metrics]. We see that localization performance is strong for all methods when training in the fully supervised scenario (setup C) and performance drops as we move to the two weakly supervised setups (setups A and B). Interestingly, GradCAM and Attention perform better in setup B than in setup A, while for Patches we observe the reverse trend.
‘We believe that Patches is worse in setup B because the loss is set at patch-level, and the patch labels are inherently noisy as we use partially-manipulated images at input.
‘In terms of detection (the ‘AP’ columns in Table 2), we observe strong performance of Patches in both weakly supervised setups, A and B. Interestingly, the detection performance is good for all models in setup B.
‘In retrospect, this is expected since for the detection task in setup B the train data matches the test data.’
The paper provides some examples of the localization maps obtained by the detection methods in all three setups, which effectively equate to a deepfake detection result:
The authors suggest that while all methods are able to recover the manipulated region in the fully-supervised scenario, patch-level approaches may be superior either to Grad-CAM or attention.
Time and space don’t permit us to cover the all of the exhaustive (and complexly-conveyed) results featured in this paper, some of which arguably constitute ablation studies (the paper has no ablation studies section) rather than core results; but we should mention one more test conducted that is particularly salient: performance on unseen datasets, which is the apposite context for a potential in-the-wild deepfake detector.
For this, an entirely ‘alien’ dataset – COCO Glide – was introduced, consisting of 512 images inpainted using a diffusion-based model.
The authors cross-tested this data with five other prior methods, all trained on their own respective datasets: MantraNet; Noiseprint; PSCC-Net; TruFor; and HiFi-Net. These were compared to the authors’ own Repaint-P2/CelebA-HQ sets, as well as COCO Glide.
In order to compare with patches, the PSCC method was fine-tuned in setup C on the Repaint-P2/CelebA-HQ dataset.
Here the authors comment:
‘We observe that the generalization performance is modest on either of the two datasets: the best out-of-domain performance on Repaint–P2/CelebA-HQ is 23.1%, obtained by TruFor, while on COCO Glide is 33.3%, obtained by PSCC.
‘Even methods that have shown to generalize (TruFor ) or that have been trained specifically on diffusion images (HiFi-Net ) have difficulties on out-of-domain datasets. Patches shows competitive results (second best in terms of IoU on COCO Glide), even if it was trained solely on faces. Interestingly, this is not the case for PSCC. While PSCC obtains top performance in-domain, on Repaint–P2/CelebA-HQ, it struggles to [generalize] to COCO Glide.
‘This behaviour suggests that overfitting is [occurring], which is not surprising given that the model capacity of PSCC (3.6M parameters) is an order of magnitude larger than the one of Patches (200k parameters).’
In concluding, the authors reiterate that the patch-based method outperforms the other two approaches tested, and that detection performance in in the image label & partial manipulations scenario performs well in a number of possible configurations.
This suggest, the authors contend, that inpainted images are a strong contender for the training of deepfake classifiers. However, they concede that localization of diffusion-inpainted images is ‘very challenging even in the most optimistic scenario’.
Unfortunately, the opacity and organizational compression of this paper makes it one of the most inaccessible that we have ever covered – which is a shame, as it has a couple of interesting takeaways, in a sector which seems poised to explode in the next 12-18 months.
The success of patch-based approaches indicates that this may be a fruitful line of research, and the fact that almost the entirety of the new paper constitutes an ablation study that could arguably have preceded more focused follow-on research, means that this was a hard-won revelation.
The second encouraging facet of the paper is that it was able to indicate a road forward at all in the area of weakly-supervised fake detection. In a research line that’s currently setting out on a potentially futile watermark war, this represents a refreshing and even promising direction.
* My conversion of the researchers’ inline citations to hyperlinks.