Combating ‘Identity Bleed’ in Deepfakes

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

The problem of ‘identity bleed’ has plagued deepfakes systems ever since the technology came to prominence at the end of 2017. If you’ve seen a range of viral deepfakes on YouTube, you’re likely to have come across it – occasions where the swap is quite good, and looks more or less like the person who is supposed to be interjected into the video…but it doesn’t really work.

Two face-swapping frameworks attempt a notable shift in identity, with varying and unconvincing results; in both cases, some measure of the host identity bleeds back through the swap. Source: https://arxiv.org/pdf/2306.05356.pdf
Two face-swapping frameworks attempt a notable shift in identity, with varying and unconvincing results; in both cases, some measure of the host identity bleeds back through the swap. Source: https://arxiv.org/pdf/2306.05356.pdf

Somehow, the target identity, which was supposed to be overwritten by the new identity, bleeds back through and blends with the superimposed personality. In the worst cases, the deepfake can look more like the offspring of the source and target identity than a different person.

For this reason, practitioners of autoencoder deepfakes have traditionally sought target material where the host identity is fairly similar to the target identity – which means that even if the host identity does bleed through, the swap will probably still work, because there was pre-existing similarity.

A recent paper from China’s Sun Yat-sen University, and from Chinese technology conglomerate Tencent, attempts to address the issue, and to explain why the phenomenon occurs, describing the challenge as a ‘design flaw’ in older face-swapping architectures:

‘During training, given the target and source of different identities, there is no pixel-wise supervision to guide synthesis in the previous [methods]. To deal with this, they pick 20%50% training input pairs and set the source and target to be the same person in each pair.

‘For these pairs, face swapping can leverage re-construction as the proxy task, and make the target face as the pixel-wise supervision. Nonetheless, the remaining pairs still lack pixel-level supervision.’

Put another way, let’s consider the conceptual schema for autoencoder deepfake systems such as DeepFaceLab and FaceSwap. Once all the face data is gathered and curated, and the model is actually training, the system begins to learn how to reconstruct (not swap) the two identities that have been plugged into it. In the case of the example below, the autoencoder system is learning, separately, how to reconstruct the actors Jack Nicholson and Jim Carrey.

An autoencoder system learns to reconstruct two distinct personalities. The information that it learns during training is saved into a shared encoder, so that the data on either personality is available from the same source.
An autoencoder system learns to reconstruct two distinct personalities. The information that it learns during training is saved into a shared encoder, so that the data on either personality is available from the same source.

What the system is not learning how to do is to transfer these actor’s faces; it has absolutely no idea how to accomplish such a task, but instead only understands how to re-synthesize each actor – essentially a neural equivalent of the Star Trek transporter, where the system is designed to recreate the input perfectly.

The swapping capability is achieved via what could be considered an ugly, post facto hack – a mere moment of code-based rewiring, where the decoders for the two identities are simply switched around.

Face superimposition is achieved via a simple rewiring of the encoder/decoder routes.
Face superimposition is achieved via a simple rewiring of the encoder/decoder routes.

In a way, the trained model is not ready for this, and has done almost none of the necessary work that could make a ‘last minute’ swap like this effective. In practice, as indicated before, it works because the selected identities are usually near enough in facial characteristics that any shortfall is covered by this similarity, with the extracted features of each identity relatively close to each other in characteristics.

The new paper illustrates, literally, the extent to which the swapping process itself is naïve and unsupervised by any dedicated reconstruction loss to target the swap (A>B) rather than the reconstruction of the original identity (A>A).

From the new paper, an illustration of the informational shortfall that occurs in face-swapping systems that simply train reconstructive systems and then swap the output – the A>B identity swap process lacks any pixel-level supervision, unlike the A>A or B>B recreations.
From the new paper, an illustration of the informational shortfall that occurs in face-swapping systems that simply train reconstructive systems and then swap the output – the A>B identity swap process lacks any pixel-level supervision, unlike the A>A or B>B recreations.

The authors of the new work are attempting to address this shortcoming in identity transfer systems, through the development of a supervisory system that infers important cross-identity characteristics and  allows them some training time of their own, so that the swap is ‘studied’ and informed rather than randomly rewired from two systems (A>A and B>B) that know nothing about each other.

ReliableSwap vs three former approaches. Source: https://arxiv.org/pdf/2306.05356.pdf
ReliableSwap vs three former approaches. Source: https://arxiv.org/pdf/2306.05356.pdf

However, the new approach, dubbed ReliableSwap, is not a mere new contender in the face-swapping space, but actually a conceptual adjunct methodology that can be added to existing systems, as a way to improve them.

ReliableSwap attempts an interracial and intergender transfer. See source video for many other examples, all at better clarity and resolution. Source: https://www.youtube.com/watch?v=uqe4pD-XpGE

In addition to generating and imposing blend-specific training characteristics into the process, ReliableSwap also addresses a fundamental limitation of nearly every face-swapping technology currently available – the increased generative emphasis on the upper face area as the main locus of identity. By neglecting the lower part of the face, many current approaches fail to plausibly superimpose identities except in cases where the host identity is extremely similar to the target identity.

To this end, the authors of the new work have created a secondary ‘fixer’ module that exclusively concentrates on this lower area.

The training stage of the FixerNet component of ReliableSwap.
The training stage of the FixerNet component of ReliableSwap.

The resulting system boosts identity preservation notably. Though the quality of results demonstrated in the work’s supplementary material varies, the real achievement of ReliableSwap is that it acknowledges and at least attempts to redress the architectural shortcomings of influential and commonly-used face-swapping frameworks, and finally devotes some neural resources at training time to considering the mechanics of the swap itself – which, perhaps to the surprise of some, is currently handled mostly by subject choice, careful curation, and a fair degree of luck.

The new paper is titled ReliableSwap: Boosting General Face Swapping Via Reliable Supervision, and comes from four researchers across the aforementioned institutions.

Approach

The system devises a schema of cycle triplets to support the development of swap-specific information.

Conceptual schema for cycle triplets.
Conceptual schema for cycle triplets.

Cycle triplets essentially create synthetic data, or faux or prefatory swaps to inform the model’s ability to swap faces. Regarding the illustration above, the authors state:

‘[Given] two real images (the target Ca and the source Cb), we blend the face of Cb into Ca through face [reenactment] and multi-band [blending]), obtaining the synthesized swapped face Cab. These techniques ensure the high-level semantics (identity) are unchanged when pasting a blob of connected pixels (facial regions) from the source Cb to the target Ca.

‘Thus, Cab inherits identity from the source Cb and other identity-irrelevant attributes from the target Ca. Similarly, blending the face of Ca into Cb produces another synthesized swapped face Cba. As a result, Cab preserves the identity from Cb, and Cba maintains the attributes from Cb.

‘Then, when using the synthesized results Cba as the target input and Cab as the source one, an ideal face swapping model would output Cb as the result, which forms cycle relationship.’

All these processes provide synthesized approximations for inter-identity swaps prior to the actual actions on real data by a completed model, filling in the missing A>B/B>A knowledge-gaps present in most current neural facial identity-swapping architectures.

Secondary systems are used to obtain the necessary information, including the 2022 Latent Image Animator (LIA) project, which uses an autoencoder framework to navigate and impose new directions in the latent space of a trained embedding, allowing the user to perform notable manipulations:

Examples from the paper for Latent Image Animator (LIA), a 2022 project incorporated into ReliableSwap. Source: https://arxiv.org/pdf/2203.09043.pdf
Examples from the paper for Latent Image Animator (LIA), a 2022 project incorporated into ReliableSwap. Source: https://arxiv.org/pdf/2203.09043.pdf

LIA provides the face reenactment data for the cycle triplets, while a 1983 mosaic composition framework is used to accomplish multi-band blending, and the actual swapping of the identities for the synthetic data.

From the early 1980s, a multiresolution spline process, capable of performing blends, is used as part of the ReliableSwap method. Source: http://ai.stanford.edu/~kosecka/burt-adelson-spline83.pdf
From the early 1980s, a multiresolution spline process, capable of performing blends, is used as part of the ReliableSwap method. Source: http://ai.stanford.edu/~kosecka/burt-adelson-spline83.pdf

The supplementary FixerNet module, according to the authors, embeds the discriminative features of the lower face as a booster to the overall identity embedding. Since existing metrics (such as CosFace and ArcFace) tend to report reconstruction accuracy centered around the upper face area, the authors have devised additional metrics that account for the entire effect of reconstruction, including the novel addressing of the lower face area: lower-face identity retrieval (L Ret) and lower-face identity similarity (L Sim).

Conceptual architecture for ReliableSwap.
Conceptual architecture for ReliableSwap.

The LIA-reenacted face is parsed by the InsightFace framework (which recently came to public attention as the basis of the ‘one-click’ ROOP face-swapping project), with the identified and segmented faces then passed to the multi-band blending process.

Thus far, no account has been made of dissonance in face shape between the two identities – one of the central concerns that forces face-swapping enthusiasts to target ‘similar’ host identities. Therefore ReliableSwap introduces a reshaping stage, using a pre-trained face inpainting network, to address the issue.

The segmented faces are passed to a reshaping network, where inpainting is used to account for disparities in face shape, and also to fill in background or 'remaining' material.
The segmented faces are passed to a reshaping network, where inpainting is used to account for disparities in face shape, and also to fill in background or 'remaining' material.

The cycle triplet loss is calculated via the reconstruction loss developed for the 2018 paper Towards Open-Set Identity Preserving Face Synthesis, with the Learned Perceptual Image Patch Similarity (LPIPS) loss metric also used to refine the training.

Please allow time for the animated GIF below to load

Cycle triplets increase the transfer of identity. See source video for many other examples, all at better clarity and resolution. Source: https://www.youtube.com/watch?v=uqe4pD-XpGE
Cycle triplets increase the transfer of identity. See source video for many other examples, all at better clarity and resolution. Source: https://www.youtube.com/watch?v=uqe4pD-XpGE

Data and Tests

To undertake tests of the new system, the authors leveraged the VGGFace2 dataset for training, which contains 3.3 million face images. The top 1.5 million images were cropped (to 256px square) and aligned in accordance with the conventions of the FFHQ dataset.

600,000 cycle triplets were created offline prior to the training process, with lower-quality images omitted out via the SER-FIQ IQA (image quality assessment) filter.  The FixerNet lower-face module was run on a ResNet-50 backbone over the MS-Celeb-1M dataset using ArcFace loss. The main network backbone was IresNet-50 (ArcFace) running the CASIA-Webface dataset on CosFace loss.

It must be remembered that ReliableSwap is an adjunct or ancillary framework, and not in direct competition with prior methods in the same way as an entirely novel framework. The two baseline frameworks chosen for comparison were SimSwap and FaceShifter, with batch size, training steps and learning rate all normalized, for fairness.

For qualitative comparison, the researchers used FaceShifter’s own evaluation methods.

Qualitative comparisons.
Qualitative comparisons.

Of these results, the authors state ‘The results demonstrate that our ReliableSwap preserves more identity details.’

Additionally, the authors evaluated ReliableSwap against ad hoc celebrity faces collected from the internet, and found that the collective trained knowledge of the cycle triplets was able to maintain swap quality against unseen data.

Unseen data run through the ReliableSwap system.
Unseen data run through the ReliableSwap system.

Here the researchers comment:

‘Benefiting from the reliable supervision provided by cycle triplets and lower facial details kept through FixerNet, our results preserve high-fidelity source identity, including nose, mouth, and face shape.’

Finally, for qualitative tests, ReliableSwap was placed against the frameworks HiRes, MegaFS, Hififace, InfoSwap, SimSwap and FaceShifter.

Qualitative results against 'rival' frameworks. See the source paper for better resolution and detail.
Qualitative results against 'rival' frameworks. See the source paper for better resolution and detail.

These tests were performed against the CelebA-HQ dataset. Five challenging pairs were selected, with clear opposing traits in terms of gender, skin color, expression and facial pose. Of the results, the authors of the new paper comment ‘Our ReliableSwap outperforms others on source identity preservation, as well as global similarity and local details.’

For the quantitative rounds, the authors followed the FaceForensics++ evaluation standards to obtain values for identity retrieval, head pose errors and expression errors. For this, 10 frames were extracted from dataset videos and processed via MTCNN, resulting in 10,000 aligned faces, which were used as target inputs. Identity embeddings were extracted via CosFace, and HopeNet was employed as a pose estimator for the Deep3D framework.

Commenting on the results of this test, the authors state:

‘The [results] show that our ReliableSwap improves the identity consistency on SimSwap and FaceShifter. Besides, ours based on FaceShifter achieves the highest ID Ret. And L Ret. and ours based on SimSwap are with best Pose and and comparable Exp., which demonstrates the efficacy of the proposed method.’

As a further quantitative test, the authors used the evaluation methodology of RAFSwap. This time they randomly sampled 100,000 CelebA-HQ image pairs, reporting identity similarity, pose error, expression errors, and Fréchet Inception Distance (FID) scores.

Here the authors assert ‘Our ReliableSwap achieves the best identity preservation, and comparable Pose, Exp., and FID’.

Comparisons with prior frameworks. See source video (embedded at end of article) for many other examples, all at better clarity and resolution. Source: https://www.youtube.com/watch?v=uqe4pD-XpGE

A human study was also conducted, where volunteers were asked to choose the processed image most resembling the source face, and the one that retained the highest amount of identity-pertinent attributes with the target face. For this, 30 pairs of images were randomly sampled from the qualitative round.

Here, as we can see in the image above, ReliableSwap led in all areas except the attribute retention, where FaceShifter had a slight lead. It is perhaps worth observing the very large lead that the new framework was able to obtain in the other results – rather more than an incremental improvement.

The authors conclude:

‘Our ReliableSwap achieves state-of-the-art performance on the FaceForensics++ and CelebA-HQ datasets and other wild faces, which demonstrates the superiority of our method.’

Conclusion

The new work performs at least one very useful function – to highlight the extent to which current SOTA frameworks tend to neglect the conforming of identity-to-identity at training time, relying instead on a priori human expectations, and a ‘sense of wonder’ and credulous disposition in the viewer – a reaction that is already beginning to fade, in the 7 years since the advent of deepfakes, and which can surely not be relied on in the future, as face replacement networks become more common and more sophisticated.

At the current state of affairs, face synthesis is divided between generative networks such as GANs and latent diffusion models, which can perfectly reproduce faces, but struggle to maintain continuity or integration into existing source material; and autoencoder-style ‘facial imposition’ systems, which, as we have seen, tend to bolt alternative identity features quite crudely into host identities, without examining the exact relationship between two paired faces, or specifically training pathways between them, on an adequate number of channels, and with reliable and consistent metrics.

Whether or not ReliableSwap is likely to evolve into a fruitful adjunct system, the researchers have made a strong case in principle for the functionality that the system strives to provide – not least because it represents a core, platform-agnostic change in methodological approach.

More To Explore

Images from the accompanying YouTube video for the paper ' MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos'. Source: https://www.youtube.com/watch?v=Kpbpujkh2iI
AI ML DL

Extracting Controllable CGI From the ‘Black Box’ of Neural Human Avatars

A new collaboration between China and Denmark offers a way to extract traditional CGI meshes and textures from implicit neural human avatars – a task that is extraordinarily challenging, but which could pave the way for more controllable AI-generated imagery and video in the future.

Montaged images from the paper 'From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment' - https://arxiv.org/pdf/2406.13912
AI ML DL

Generative AI’s Captioning Crisis May Not Be Fixable With Large Language Models

Text-to-image and text-to-video models such as Stable Diffusion and Sora rely on datasets of images that include captions which accurately describe the photos in the collection. Most often, these captions are either inadequate or inaccurate – frequently both. Sometimes they’re downright deceptive, damaging models trained on them. But the research sector’s hopes that multi-modal large language models can create better captions is challenged in a recent new paper from NVIDIA and Chinese researchers.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle