The problem of ‘identity bleed’ has plagued deepfakes systems ever since the technology came to prominence at the end of 2017. If you’ve seen a range of viral deepfakes on YouTube, you’re likely to have come across it – occasions where the swap is quite good, and looks more or less like the person who is supposed to be interjected into the video…but it doesn’t really work.
Somehow, the target identity, which was supposed to be overwritten by the new identity, bleeds back through and blends with the superimposed personality. In the worst cases, the deepfake can look more like the offspring of the source and target identity than a different person.
For this reason, practitioners of autoencoder deepfakes have traditionally sought target material where the host identity is fairly similar to the target identity – which means that even if the host identity does bleed through, the swap will probably still work, because there was pre-existing similarity.
A recent paper from China’s Sun Yat-sen University, and from Chinese technology conglomerate Tencent, attempts to address the issue, and to explain why the phenomenon occurs, describing the challenge as a ‘design flaw’ in older face-swapping architectures:
‘During training, given the target and source of different identities, there is no pixel-wise supervision to guide synthesis in the previous [methods]. To deal with this, they pick 20%∼50% training input pairs and set the source and target to be the same person in each pair.
‘For these pairs, face swapping can leverage re-construction as the proxy task, and make the target face as the pixel-wise supervision. Nonetheless, the remaining pairs still lack pixel-level supervision.’
Put another way, let’s consider the conceptual schema for autoencoder deepfake systems such as DeepFaceLab and FaceSwap. Once all the face data is gathered and curated, and the model is actually training, the system begins to learn how to reconstruct (not swap) the two identities that have been plugged into it. In the case of the example below, the autoencoder system is learning, separately, how to reconstruct the actors Jack Nicholson and Jim Carrey.
What the system is not learning how to do is to transfer these actor’s faces; it has absolutely no idea how to accomplish such a task, but instead only understands how to re-synthesize each actor – essentially a neural equivalent of the Star Trek transporter, where the system is designed to recreate the input perfectly.
The swapping capability is achieved via what could be considered an ugly, post facto hack – a mere moment of code-based rewiring, where the decoders for the two identities are simply switched around.
In a way, the trained model is not ready for this, and has done almost none of the necessary work that could make a ‘last minute’ swap like this effective. In practice, as indicated before, it works because the selected identities are usually near enough in facial characteristics that any shortfall is covered by this similarity, with the extracted features of each identity relatively close to each other in characteristics.
The new paper illustrates, literally, the extent to which the swapping process itself is naïve and unsupervised by any dedicated reconstruction loss to target the swap (A>B) rather than the reconstruction of the original identity (A>A).
The authors of the new work are attempting to address this shortcoming in identity transfer systems, through the development of a supervisory system that infers important cross-identity characteristics and allows them some training time of their own, so that the swap is ‘studied’ and informed rather than randomly rewired from two systems (A>A and B>B) that know nothing about each other.
However, the new approach, dubbed ReliableSwap, is not a mere new contender in the face-swapping space, but actually a conceptual adjunct methodology that can be added to existing systems, as a way to improve them.
ReliableSwap attempts an interracial and intergender transfer. See source video for many other examples, all at better clarity and resolution. Source: https://www.youtube.com/watch?v=uqe4pD-XpGE
In addition to generating and imposing blend-specific training characteristics into the process, ReliableSwap also addresses a fundamental limitation of nearly every face-swapping technology currently available – the increased generative emphasis on the upper face area as the main locus of identity. By neglecting the lower part of the face, many current approaches fail to plausibly superimpose identities except in cases where the host identity is extremely similar to the target identity.
To this end, the authors of the new work have created a secondary ‘fixer’ module that exclusively concentrates on this lower area.
The resulting system boosts identity preservation notably. Though the quality of results demonstrated in the work’s supplementary material varies, the real achievement of ReliableSwap is that it acknowledges and at least attempts to redress the architectural shortcomings of influential and commonly-used face-swapping frameworks, and finally devotes some neural resources at training time to considering the mechanics of the swap itself – which, perhaps to the surprise of some, is currently handled mostly by subject choice, careful curation, and a fair degree of luck.
The new paper is titled ReliableSwap: Boosting General Face Swapping Via Reliable Supervision, and comes from four researchers across the aforementioned institutions.
The system devises a schema of cycle triplets to support the development of swap-specific information.
Cycle triplets essentially create synthetic data, or faux or prefatory swaps to inform the model’s ability to swap faces. Regarding the illustration above, the authors state:
‘[Given] two real images (the target Ca and the source Cb), we blend the face of Cb into Ca through face [reenactment] and multi-band [blending]), obtaining the synthesized swapped face Cab. These techniques ensure the high-level semantics (identity) are unchanged when pasting a blob of connected pixels (facial regions) from the source Cb to the target Ca.
‘Thus, Cab inherits identity from the source Cb and other identity-irrelevant attributes from the target Ca. Similarly, blending the face of Ca into Cb produces another synthesized swapped face Cba. As a result, Cab preserves the identity from Cb, and Cba maintains the attributes from Cb.
‘Then, when using the synthesized results Cba as the target input and Cab as the source one, an ideal face swapping model would output Cb as the result, which forms cycle relationship.’
All these processes provide synthesized approximations for inter-identity swaps prior to the actual actions on real data by a completed model, filling in the missing A>B/B>A knowledge-gaps present in most current neural facial identity-swapping architectures.
Secondary systems are used to obtain the necessary information, including the 2022 Latent Image Animator (LIA) project, which uses an autoencoder framework to navigate and impose new directions in the latent space of a trained embedding, allowing the user to perform notable manipulations:
The supplementary FixerNet module, according to the authors, embeds the discriminative features of the lower face as a booster to the overall identity embedding. Since existing metrics (such as CosFace and ArcFace) tend to report reconstruction accuracy centered around the upper face area, the authors have devised additional metrics that account for the entire effect of reconstruction, including the novel addressing of the lower face area: lower-face identity retrieval (L Ret) and lower-face identity similarity (L Sim).
The LIA-reenacted face is parsed by the InsightFace framework (which recently came to public attention as the basis of the ‘one-click’ ROOP face-swapping project), with the identified and segmented faces then passed to the multi-band blending process.
Thus far, no account has been made of dissonance in face shape between the two identities – one of the central concerns that forces face-swapping enthusiasts to target ‘similar’ host identities. Therefore ReliableSwap introduces a reshaping stage, using a pre-trained face inpainting network, to address the issue.
The cycle triplet loss is calculated via the reconstruction loss developed for the 2018 paper Towards Open-Set Identity Preserving Face Synthesis, with the Learned Perceptual Image Patch Similarity (LPIPS) loss metric also used to refine the training.
Please allow time for the animated GIF below to load
Data and Tests
To undertake tests of the new system, the authors leveraged the VGGFace2 dataset for training, which contains 3.3 million face images. The top 1.5 million images were cropped (to 256px square) and aligned in accordance with the conventions of the FFHQ dataset.
600,000 cycle triplets were created offline prior to the training process, with lower-quality images omitted out via the SER-FIQ IQA (image quality assessment) filter. The FixerNet lower-face module was run on a ResNet-50 backbone over the MS-Celeb-1M dataset using ArcFace loss. The main network backbone was IresNet-50 (ArcFace) running the CASIA-Webface dataset on CosFace loss.
It must be remembered that ReliableSwap is an adjunct or ancillary framework, and not in direct competition with prior methods in the same way as an entirely novel framework. The two baseline frameworks chosen for comparison were SimSwap and FaceShifter, with batch size, training steps and learning rate all normalized, for fairness.
For qualitative comparison, the researchers used FaceShifter’s own evaluation methods.
Of these results, the authors state ‘The results demonstrate that our ReliableSwap preserves more identity details.’
Additionally, the authors evaluated ReliableSwap against ad hoc celebrity faces collected from the internet, and found that the collective trained knowledge of the cycle triplets was able to maintain swap quality against unseen data.
Here the researchers comment:
‘Benefiting from the reliable supervision provided by cycle triplets and lower facial details kept through FixerNet, our results preserve high-fidelity source identity, including nose, mouth, and face shape.’
These tests were performed against the CelebA-HQ dataset. Five challenging pairs were selected, with clear opposing traits in terms of gender, skin color, expression and facial pose. Of the results, the authors of the new paper comment ‘Our ReliableSwap outperforms others on source identity preservation, as well as global similarity and local details.’
For the quantitative rounds, the authors followed the FaceForensics++ evaluation standards to obtain values for identity retrieval, head pose errors and expression errors. For this, 10 frames were extracted from dataset videos and processed via MTCNN, resulting in 10,000 aligned faces, which were used as target inputs. Identity embeddings were extracted via CosFace, and HopeNet was employed as a pose estimator for the Deep3D framework.
Commenting on the results of this test, the authors state:
‘The [results] show that our ReliableSwap improves the identity consistency on SimSwap and FaceShifter. Besides, ours based on FaceShifter achieves the highest ID Ret. And L Ret. and ours based on SimSwap are with best Pose and and comparable Exp., which demonstrates the efficacy of the proposed method.’
Here the authors assert ‘Our ReliableSwap achieves the best identity preservation, and comparable Pose, Exp., and FID’.
Comparisons with prior frameworks. See source video (embedded at end of article) for many other examples, all at better clarity and resolution. Source: https://www.youtube.com/watch?v=uqe4pD-XpGE
A human study was also conducted, where volunteers were asked to choose the processed image most resembling the source face, and the one that retained the highest amount of identity-pertinent attributes with the target face. For this, 30 pairs of images were randomly sampled from the qualitative round.
Here, as we can see in the image above, ReliableSwap led in all areas except the attribute retention, where FaceShifter had a slight lead. It is perhaps worth observing the very large lead that the new framework was able to obtain in the other results – rather more than an incremental improvement.
The authors conclude:
‘Our ReliableSwap achieves state-of-the-art performance on the FaceForensics++ and CelebA-HQ datasets and other wild faces, which demonstrates the superiority of our method.’
The new work performs at least one very useful function – to highlight the extent to which current SOTA frameworks tend to neglect the conforming of identity-to-identity at training time, relying instead on a priori human expectations, and a ‘sense of wonder’ and credulous disposition in the viewer – a reaction that is already beginning to fade, in the 7 years since the advent of deepfakes, and which can surely not be relied on in the future, as face replacement networks become more common and more sophisticated.
At the current state of affairs, face synthesis is divided between generative networks such as GANs and latent diffusion models, which can perfectly reproduce faces, but struggle to maintain continuity or integration into existing source material; and autoencoder-style ‘facial imposition’ systems, which, as we have seen, tend to bolt alternative identity features quite crudely into host identities, without examining the exact relationship between two paired faces, or specifically training pathways between them, on an adequate number of channels, and with reliable and consistent metrics.
Whether or not ReliableSwap is likely to evolve into a fruitful adjunct system, the researchers have made a strong case in principle for the functionality that the system strives to provide – not least because it represents a core, platform-agnostic change in methodological approach.