The Challenge of Preventing ‘Identity Bleed’ in Face Swaps

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

A recent paper from KAIST AI claims to have improved on the current state of the art in standard autoencoder-based faceswapping – by forcing the architecture to truly separate the usually-entangled qualities of identity vs. pose (and other non-identity-based characteristics).

Furthermore, the new approach, titled SelfSwapper, uses CGI-based neural interface methods to combat one of the biggest challenges in the traditional face-swap scenario: creating effective swaps when the target head is proportionally larger or smaller than the source head:

Prior methods attempt to fit non-apposite facial shapes into the target; but the new approach can build out the face as necessary – see that the right-most version is the only one that reflects the lower-mouth>chin termination of the 'source' image. Source: https://arxiv.org/pdf/2402.07370.pdf
Prior methods attempt to fit non-apposite facial shapes into the target; but the new approach can build out the face as necessary – see that the right-most version is the only one that reflects the lower-mouth>chin termination of the 'source' image. Source: https://arxiv.org/pdf/2402.07370.pdf

In tests, both qualitative and quantitative, the new system, which uses mesh-based 3D Morphable Models (3DMMs) to aid in the faceswapping process, was able to preserve identity better than a slew of analogous prior approaches, and to achieve a superior separation of the apparently inseparable qualities (identity vs. pose) that prior methods struggle to prise apart.

At the core, this separation is achieved by ‘running interference’ on the various stages during which, usually, photo-specific traits (head pose, lighting, etc.) tend to ‘slip through’ when only identity traits (face shape, eye color, etc.) are supposed to be evaluated.

The new work is titled SelfSwapper: Self-Supervised Face Swapping via Shape Agnostic Masked AutoEncoder, and comes from four researchers at KAIST AI.

Crossed Wires

The twin planks of faceswapping are identity loss and reconstruction loss – and each of these channels is deeply bound into the other.

Reconstruction loss, which may be evaluated by any number of traditional metrics, is concerned with the extent to which a faceswapping system can recreate the original source image; identity loss, instead, is concerned with high-level traits that distinguish a particular face.

Because it is challenging to define identity traits without bringing in general traits from reconstruction loss, it frequently occurs in faceswapping scenarios that the output tends to look like a composite of both the source and target identities, without really being faithful to either of them.

Above, we can see that a 1978-era Sigourney Weaver has a more pronounced jawline than Mary Elizabeth Winstead (outlined in red, top right), and, in a hobbyist's deepfake conversion of footage from 'Alien' (bottom row), that Weaver's original jawline creeps into the deepfaked version; so that the result seems to be a 'clone-spliced' version of the two actresses, rather than a direct superimposition of Winstead into the role. Source: https://www.youtube.com/watch?v=wh4jmRGJoE8
Above, we can see that a 1978-era Sigourney Weaver has a more pronounced jawline than Mary Elizabeth Winstead (outlined in red, top right), and, in a hobbyist's deepfake conversion of footage from 'Alien' (bottom row), that Weaver's original jawline creeps into the deepfaked version; so that the result seems to be a 'clone-spliced' version of the two actresses, rather than a direct superimposition of Winstead into the role. Source: https://www.youtube.com/watch?v=wh4jmRGJoE8

Ever since the original deepfakes code became publicly available in late 2017, the hobbyist and semi-professional faceswapping community that emerged thereafter considered itself constrained to find target faces that were similar to source faces, in order to diminish the possibility of this kind of ‘evident hybrid’ output.

If you wanted to fake Ryan Gosling, it was simply much easier to impose his face onto a similar-looking face (Caucasian, tall, fair, blue-eyed, similar hairline, etc.) than it was to hope that the autoencoder system could close up any really evident mismatches in facial type.

Multiple shared characteristics between Nick Cage and Ryan Gosling make for a plausible recasting of 'Ghost Rider' – though this nonetheless appears to be a 'Cage-ized' iteration of Gosling. Source: https://www.youtube.com/watch?v=mKh3-pesGBk
Multiple shared characteristics between Nick Cage and Ryan Gosling make for a plausible recasting of 'Ghost Rider' – though this nonetheless appears to be a 'Cage-ized' iteration of Gosling. Source: https://www.youtube.com/watch?v=mKh3-pesGBk

In the end, because of these inevitable shortfalls, most of results obtained by ‘traditional’ 2017-era methods, in packages such as DeepFaceLab and DeepFaceLive, tend to produce a result that is [celeb]-esque rather than a truly plausible and accurate substitution of source>target face.

As we have observed before, rather than choosing such ‘incompatible’ subjects, many deepfakers have over the last five or so years apparently latched on to clips featuring unusually compatible actors for a potential deepfake, so that one can effectively say that mother nature has already done 90% of the work:

In one well-executed deepfake a few years ago, original 'Pulp Fiction' actor Alexis Arquette proved an extraordinary match for US comedian Jerry Seinfeld – to the extent that neural synthesis was left with relatively little work to do. Source: https://www.youtube.com/watch?v=S1MBVXkQbWU
In one well-executed deepfake a few years ago, original 'Pulp Fiction' actor Alexis Arquette proved an extraordinary match for US comedian Jerry Seinfeld – to the extent that neural synthesis was left with relatively little work to do. Source: https://www.youtube.com/watch?v=S1MBVXkQbWU

Therefore the central aim of the new project is to address the shared channels in which reconstruction and identity tend to commingle, to construct barriers against this ‘zone of confusion’, and to produce an architecture that can ‘build out’ or ‘cut away’ facial mass as necessary, in order to produce a more accurate recreation.

Left, source; middle, target; and right, some fairly unambiguous output from SelfSwapper.
Left, source; middle, target; and right, some fairly unambiguous output from SelfSwapper.

Method

The new scheme develops a Shape Agnostic Masked Autoencoder (SAMAE), which, as mentioned, offers a variety of tricks to stop the reconstruction loss (C1) becoming memorized and confused with the ID loss (C2).

The authors note that distinguishing characteristics such as facial contours naturally carry facial ID information, and posit that the key to disentanglement is to make the reconstruction loss more ‘abstract’.

One of the techniques to stop this cross-quality transfer from occurring is titled Perforation Confusion, which makes random changes to the estimated facial contour mask, so that the model does not start to associate these characteristics directly with ID traits.

Perforation confusion distorts the estimated non-ID shape so that the model has a more abstract and generalized idea about face shape, for reconstruction purposes, since, at this stage, identity-specific material is not being processed.
Perforation confusion distorts the estimated non-ID shape so that the model has a more abstract and generalized idea about face shape, for reconstruction purposes, since, at this stage, identity-specific material is not being processed.

The process, as is becoming very common with neural facial recreation and synthesis, makes use of a parametric CGI mesh model, the 3DMM-based system employed by the 2009 project A 3D Face Model for Pose and Illumination Invariant Face Recognition (the paper for which is not currently available online).

Example 3DMMs from the original 1999 paper which introduced the technology, wherein the known coordinates of traditional CGI faces are used as an interface for rather more intractable neural methodologies. Source: https://openaccess.thecvf.com/content_CVPR_2019/papers/Gecer_GANFIT_Generative_Adversarial_Network_Fitting_for_High_Fidelity_3D_Face_CVPR_2019_paper.pdf
Example 3DMMs from the original 1999 paper which introduced the technology, wherein the known coordinates of traditional CGI faces are used as an interface for rather more intractable neural methodologies. Source: https://openaccess.thecvf.com/content_CVPR_2019/papers/Gecer_GANFIT_Generative_Adversarial_Network_Fitting_for_High_Fidelity_3D_Face_CVPR_2019_paper.pdf

Since these are mesh-based models (imagine a head made out of crisscrossed wire, where each resulting square of intersection between the wires is a polygon), it’s also possible to ‘confuse’ the reconstruction loss process by randomly scaling the mesh during training. The authors explain why this is helpful:

‘Empirically, during the inference phase, we observed that when the facial volume of the source is significantly smaller (or larger) than that of the target, the swapped face appears awkward, either shrunken or dilated. This issue arises be cause the model, trained to self-reconstruct the input image with a consistently sized facial region, tends to over-rely on the pixel-aligned [information from the non-ID stream].

‘This reliance hinders generalization to cross-identity inferences. To handle this problem, we propose the random mesh scaling technique, which allows the model to generate realistic face images using randomly scaled [non-ID faces], thereby enhancing the model’s ability to generalize to facial priors of varying scales during inference.’

Random scaling of the obtained 3DMM mesh, for the non-ID data prevents the system from fixating on these non-identity characteristics during training.
Random scaling of the obtained 3DMM mesh, for the non-ID data prevents the system from fixating on these non-identity characteristics during training.

A third tack for preventing ID/reconstruction association is to disentangle the albedo information from the artificial head which, by now, has been mapped to the desired source image.

For the purposes of traditional CGI, albedo information is a complex of possible surface qualities, including reflectance and skin color. In most cases of faceswapping, these are not traits that should be directly transposed, since these channels also contain shading information, and the source and target image may have different shading qualities – and they are almost certain to have differing albedo qualities in general.

The paper states:

‘Given the necessity of the albedo parameter in [rendering ID-specific faces], we have devised a workaround. We transform [albedo] into its neutralized [form], which incorporates a white-colored albedo map. This modification eliminates any albedo-related information.’

The training and inference pipelines for SelfSwapper, where we can see the influence of the three-pronged attack on entanglement in the 3DMM treatments devised for the method.
The training and inference pipelines for SelfSwapper, where we can see the influence of the three-pronged attack on entanglement in the 3DMM treatments devised for the method.

To offset any potential information loss from thus hobbling the albedo transfer between the two streams, the authors devised an additional encoder (denoted as Eskin), which more discretely captures skin color information, by masking the skin with the BiseNet segmentation model, and targeting only low-dimensional embeddings.

Data and Tests

To test the system, SelfSwapper was trained on the FFHQ dataset, featuring images sampled at 256x256px resolution. The authors used one thousand randomly-sampled pairs from the Celeb-A-HQ dataset, for source/target data.

The authors note that many competing methods have optimized their results by augmenting the FFHQ dataset examples with bespoke collections, including conveniently identity-labeled datasets, featuring images and/or video, which helped to stabilize the training in those cases; and they observe that SelfSwapper is able to achieve its own state-of-the-art results solely based on the FFHQ reference dataset, which indicates that the system’s capacity for generalization is far less fragile or framework-specific.

Metrics used during training, for the reconstruction losses were simple L1 loss and Perceptual Loss. To enhance realism, and additional non-saturating adversarial loss was employed – a technique central to Generative Adversarial Networks (GANs).

Rival frameworks/baselines adopted for both the qualitative and quantitative phases of testing were: Subject Agnostic Face Swapping and Reenactment (FSGAN); SimSwap; Information Bottleneck Disentanglement for Identity Swapping (InfoSwap); High-resolution Face Swapping via Latent Semantics Disentanglement (FSLSD); Fine-Grained Face Swapping via Regional GAN Inversion (E4S); and BlendFace.

(Other analogous frameworks such as HifiFace and FaceShifter were excluded from tests due to the lack of official open source code)

The model was trained on two NVIDIA 3090 GPUs (there are two models, each with 24GB, and one is more performant – but the exact model was not specified in the paper) at a batch size of eight, under the Adam optimizer, at a learning rate of 2×10−4, both for the discriminator and the generator. The ADM U-Net was used for the generator, and StyleGAN2 for the discriminator.

The rival schemes were divided into two categories: the target-oriented frameworks SimSwap, InfoSwap, FSLSD and BlendFace; and those that center on the source images: FSGAN and E4S.

Target-oriented approaches use direct reconstruction loss between source and target images, which retains characteristics such as head pose and lighting. Illustrated below are qualitative tests conducted by the new paper’s researchers, exemplifying this approach in comparison to SelfSwapper:

A qualitative comparison of target-oriented methods to the SelfSwapper framework. Please refer to the source paper for better resolution.
A qualitative comparison of target-oriented methods to the SelfSwapper framework. Please refer to the source paper for better resolution.

In reference to the qualitative results shown above, the authors comment:

‘Other baselines struggle to replicate the source’s facial features such as inner facial traits (e.g., eye color, wrinkles, and cheekbone) and facial contours (e.g., jaw shape). In contrast, Ours conveys these with high-fidelity. Pay attention to the red and orange indicators for detailed comparison.’

They further observe:

‘[Straightforward] application of reconstruction loss often leads to the leakage of the target’s identity, resulting in a blend of source and target identities. These models frequently struggle to accurately represent the source’s facial contours and skin details.

‘In contrast, our model moves beyond this trade-off, adeptly reproducing the source’s facial contours, skin details, and inner facial features, while still maintaining the target’s non-facial attributes, facial posture, and expressions.’

Source-based models merge the reconstructed source image with the target image and, as mentioned, several unwanted passenger-traits can often come along for the ride:

Comparison of SelfSwapper against source-based methods.
Comparison of SelfSwapper against source-based methods.

Of the qualitative tests shown above, against source-based models, the authors comment:

‘These baselines exhibit issues with leakage of the source’s illumination. Observe red and orange indicators. Our model effectively avoids source illumination leakage, thanks to our method’s finely disentangled features…

‘… their effectiveness is limited by the reenactment models’ performance, often leading to inaccuracies in source identity preservation and difficulties in handling pose variations.

‘Furthermore, the blending process can result in unnatural skin tones and inadequate replication of the target’s lighting, sometimes introducing noticeable artifacts. For example, the shadow from the source image can be carried over into the swapped image, as depicted in the second row of [image above].’

The researchers contend that, by contrast, SelfSwapper is ‘adept’ at performing a  match between the target image’s skin color and lighting conditions, and obtaining a superior blend and transfer between identities.

For quantitative comparison, a wider number of metrics were used, including Identity Similarity (ID. Sim, a measurement of cosine distance between embeddings for swapped and source images); Identity Consistency (ID. Cons), Expression and Head Pose Distance, and also a Fréchet Inception Distance (FID) score.

Additionally, Shape Distance between the source and faked results were evaluated via the 3DMM framework’s predictor – a feature that was not used during the training process.

Quantitative comparison with both source and target-oriented frameworks.
Quantitative comparison with both source and target-oriented frameworks.

Here the researchers contend:

‘[Our] method achieves the state-of-the-art quality without sacrificing certain metrics, unlike other methods. InfoSwap ranks second-best in terms of ID. Sim. and ID. Cons. scores, falls behind in other metrics. Regarding Expression distance, although BlendFace achieves the highest score, it exhibits a lower ID. Sim. score. Additionally, its low ID. Cons. score suggests a leakage of target attributes.

‘This indicates that BlendFace potentially compromises identity preservation in favor of maintaining the target’s pose and expression.’

Finally, the authors compiled the results into 2D graph form, assessing identity similarity and distance, respectively:

Graph comparisons for identity similarity and distance.
Graph comparisons for identity similarity and distance.

For the left-hand graph, denoting identity similarity, a position in the upper-right area is optimal, and for the right graph, denoting identity distance, a lower-right position is optimal:

‘The [graphs imply] that our method excels in reflecting the source identity and preventing target identity leakage. E4S is also positioned at the upper-right corner of the figure. However, [the right-hand graph] indicates that E4S yields the lowest FID score, suggesting it creates source-like images with reduced target identity leakage, but with poor realism.

‘In contrast, our model is situated in the bottom-right corner of the graph, demonstrating superior performance in image realism and robustness against target identity leakage, outperforming other baselines.’

Examples from SelfSwapper. In all cases, the leftmost woman is the target identity being inserted into source data. The blue dots denote source (there is only one here, leftmost), the pink dots 'untouched' targets, and the pink+blue dots the concatenation of identities via SelfSwapper.
Examples from SelfSwapper. In all cases, the leftmost woman is the target identity being inserted into source data. The blue dots denote source (there is only one here, leftmost), the pink dots 'untouched' targets, and the pink+blue dots the concatenation of identities via SelfSwapper.

Conclusion

The struggle to separate structure from identity in a face-swapping scenario is one of the most difficult to understand, and to effect. In a sense, the facial structure being imposed must become the medium and not the message, and must provide as ‘blank’ as possible a physiological context for identity.

Yet both the source and target image are equally defined by contours and lineaments; to a certain extent, they are also defined by secondary considerations such as pose and illumination. The objective, therefore, is to provide an almost teleological projection of identity into a set of unbiased parameters defining ‘a face’; in many ways, this is far easier to do when creating a face from nothing, as with Latent Diffusion Models such as Stable Diffusion, than by being forced within the confines of existing target material.

It may be, in the future, that video can be transformed into abstract but complex priors so completely that face-swapping can be effected by pure recreation, where the host footage is so entirely abstract and low-level that it has no real identity information to leak.

But since we are quite some way from that capability, and since the needs of near-term VFX pipelines will include face-replacement, ageing, and other requirements of dealing with ‘hard-baked’ source material which must be transformed, the diverse training ‘distractions’ provided by SelfSwapper may indicate one possible step nearer the ultimate goal.

More To Explore

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
AI ML DL

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
AI ML DL

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle