Face-Swapping Directly in the Latent Space

LatentSwap

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Typical face-swapping or deepfake applications require datasets. Very often, the methods require bespoke pretraining on vast amounts of facial data, using collections such as FFHQ or Celeb-A. The resources and time required for training are considerable, and the process is inevitably a turbulent marriage between the generalized features obtained at length from per-project training data, and the related latent codes that are obtained from this repetitive training.

Here we see an example of a trained latent space, where the various image latent codes, together with their captions, have been deposited in the nearest available convenient sector of a matrix. By navigating through this space, we can exploit various features such as identity. Using methods such as GAN Inversion, we can also ‘project’ deliberate and specific images into the latent codes of a particular part of the latent space. Thus, by projecting an image of a woman into a ‘male’ section of the latent space, we can reinterpret her appearance as male while retaining many of her key characteristics. Source: http://projector.tensorflow.org/

There is increasing interest in systems that can directly intervene in the latent space of a facial (or full-body) synthesis system, without the need to ‘orient’ the system by deliberately training it on thousands (or millions) of representative examples of the facial domain (and related sub-domains, such as pose and expression).

What if one could bypass all these representative tokens and directly manipulate latent codes? Facial features, once extracted and converted into a latent representation could be moved around like putty, providing an unusually artisanal and precise method of effecting changes, such as changes in identity, subtle alterations of facial expression – and all the other sought-after manipulations which are, in the general run of the current state of the art, still being accomplished ‘the hard way’, through representative training.

For visual effects workflows, direct manipulation of latent codes would mean models that know nothing of the world, or the wider domain that they are a part of, but will actually obey commands, and speed up pipeline workflows.

One such system, titled LatentSwap, has recently been proposed in a collaboration between academic institutions and private companies in Korea and the US:

Identity replacement with the LatentSwap system. Source: https://arxiv.org/pdf/2402.18351.pdf
Identity replacement with the LatentSwap system. Source: https://arxiv.org/pdf/2402.18351.pdf

While the new system was able to beat out many formidable rival systems (including the original deepfakes code), the main reason that the paper is interesting is that it concentrates on direct latent manipulation, instead of setting up latent ‘mappings’ which then have to take a tedious journey through pixel space, remaining dependent on multiple examples of the source and target identity.

Though LatentSwap does use pre-trained GAN inversion and StyleGAN2 models which have been trained on large datasets, it can use these models ‘off the shelf’ in order to provide transformations, instead of needing to fine-tune them elaborately.

Thus, in many ways, it is arguably nearer to warping tools (such as those found in After Effects or Photoshop) than to the current, far more elaborate practices in facial replacement networks.

Direct latent manipulation is therefore arguably beginning to emerge as a powerful new trend in AI-based human synthesis. Though identity substitution is the focus of the new work, the authors strongly indicate that their approach is adaptable to direct editing of source material, such as changing expressions and other original characteristics.

Attribute editing, with LatentSwap initialized in the 2020 InterFaceGAN project.
Attribute editing, with LatentSwap initialized in the 2020 InterFaceGAN project.

The new paper is titled LatentSwap: An Efficient Latent Code Mapping Framework for Face Swapping, and comes from six authors across Korea University, Massachusetts Institute of Technology, Supertone Inc., Nota Inc., and Optimizer AI.

Method

With an off-the-shelf Generative Adversarial Network (GAN) pre-trained on a high volume of face data, LatentSwap uses a ‘simple but effective’ module to merge the source and target latent codes.

(To be clear, latent codes are mathematical representations of – in this case – faces; containing multiple resolution layers, and operating in the 1024x1024px pixel output space)

The paper states*:

‘During training, the source/target latent codes are randomly sampled, and face swapping is performed using only these two latent codes as input. For inference, we apply a pre-trained GAN inversion model to map the real images onto the generator latent space.

‘Besides the pre-trained inversion model, the generator, and the [identity embedder] just to compute the training loss, no other pre-trained models have been used.

‘We use the [StyleGAN2] generator and work on its latent space, due to its versatility and good generation quality.’

Face-swapping results on 'wild' (arbitrary) images with the LatentSwap network.
Face-swapping results on 'wild' (arbitrary) images with the LatentSwap network.

The system uses ‘latent mixers’, each comprising of five fully-connected layers. The source image is converted to latent codes through GAN inversion, and the target image’s latents are made available through SWISH activation functions.

The latent mixer then concatenates these two codes before processing them sequentially through five fully-connected layers, with the result added back to the target code.

In accordance with the conventions of StyleGAN2, 512-dimensional codes are then extracted. This final latent code is then duplicated 18 times, with each latent mixer subsequently operating on each of the 18 layers of latent codes in the W space (an intermediate latent space).

Conceptual schema for the architecture of LatentSwap.
Conceptual schema for the architecture of LatentSwap.

The swapped latent codes are then passed on to the generator section of the GAN. It should be noted that the W-space latent codes are generated directly by the system devised for the paper Pivotal Tuning for Latent-based Editing of Real Images, known as PTI.

Fully-processed latent codes are then harvested from all 18 latent mixers, and assembled into an 18×512 swapped face latent code.

Each of the 18 layers corresponds to a different resolution of the final obtained image, and the StyleGAN2 weights (trained on the FFHQ dataset) remain invariate throughout.

Three loss functions are used for LatentSwap model training. The first is ID Loss (which measures the cosine distance between source and target identity images); this loss comes from the SmoothSwap project, which is heavily leveraged for LatentSwap. The authors note that since no code was released for SmoothSwap, they were forced to implement the identity embedder themselves, based on the published schema.

Schematic overview of the ID loss function.
Schematic overview of the ID loss function.

The second loss function is Latent Penalty Loss, which preserves facial attributes such as pose and expression, regardless of identity. Since this is essentially a vector comparison process, a Mean Squared Error (MSE) approach is sufficient.

The final metric is Shape Loss, which is not nearly as easy to evaluate. Therefore 3D Morphable Model (3DMM) coefficients are calculated, which involves the interpretation of the latent code into realizable and conventional 3D space, with known coordinates. Once interpreted, the 3D facial landmarks of both the source and target face are estimated through a basic L1 metric.

As mentioned, the PTI approach does the heavy lifting in terms of actually transliterating source images into the latent space. The authors observe that the e4e encoder network can perform this functionality even more directly, but that using it breaks the architecture’s symmetry, in this particular case.

The e4e network is an efficient and economical latent encoder, and one that seems likely to crop up again in future direct latent-editing projects, though its methods skew the results in LatentSwap. Source: https://arxiv.org/pdf/2102.02766.pdf
The e4e network is an efficient and economical latent encoder, and one that seems likely to crop up again in future direct latent-editing projects, though its methods skew the results in LatentSwap. Source: https://arxiv.org/pdf/2102.02766.pdf

Data and Tests

The FFHQ-pretrained StyleGAN2 generator is used with the original project weights, for LatentSwap. For training the LatentSwap system, 64 source/target pairs were trained in each batch, leading to a 1/8 possibility at any one time that a latent code might be training against itself.

In accordance with the general methodology of SmoothSwap, the system was trained for 200,000 steps, using the AdamW optimizer, at a learning rate of 0.0001 (effectively the most fine-grained rate that could be chosen).

The training took place on two 80GB NVIDIA A100 GPUs, running on a DGX workstation, and took a total of 77 hours. The average inference time for the resulting system (which can handle arbitrary input) was 98 seconds.

Analysis of the coefficients revealed that swaps work best at a medium dimensionality, tending to resemble the source at the lower regions, and the target at the higher regions:

Metric performance of LatentSwap, featuring diverse codes from diverse latent spaces, under varying coefficients.
Metric performance of LatentSwap, featuring diverse codes from diverse latent spaces, under varying coefficients.

Prior frameworks tested against the LatentSwap systems for the qualitative round were deepfakes (2017 code); FaceSwap; FaceShifter; MegaFS; HifiFace; and InfoSwap.

Qualitative results against rival frameworks. Please refer to source paper for (slightly) better resolution.
Qualitative results against rival frameworks. Please refer to source paper for (slightly) better resolution.

Of these results, the authors comment:

‘Our method generates realistic and high resolution face swaps. This shows that the pre-trained PTI [model] works well with our framework and has mapped the arbitrary image onto the W space.

‘In general, the face shape of the swapped image closely resembles the target and does not get blurry when the face shape differs between source and target. Hairstyle, on the other hand, tends to follow the source image.

‘Interestingly, this tendency has also been visible in Smooth-Swap, which may indicate that this phenomenon is linked to the smooth identity embedder that both of the models use.

‘These descriptions are more poignant when compared to conventional models such as [SimSwap] and [MegaFS]. SimSwap gives an acceptable face swapping result, especially regarding background and lighting. However, it rarely changes the skin tone and is constrained to a 224 × 224 image size.

‘MegaFS performs qualitatively worse. This is primarily due to the segmentation model changing only specific parts of the target image, leading to unnatural background and being vulnerable to occlusion, alongside showing inexact lighting and face shape.’

For a quantitative test, the ID retention, expression transference, pose and number of parameters were evaluated, with LatentSwap generally ahead, while using significantly fewer parameters than former frameworks:

Metric comparisons of LatentSwap to previous analogous systems.
Metric comparisons of LatentSwap to previous analogous systems.

The supplementary information referred to in the paper, where further details are reported to be given about the metrics used (and where further analysis is presumably recorded) is not currently available.

The Z space in the latent space of StyleGAN is the dominant sector, while the W space is the intermediate latent space, with the latter poorly suited to projection of source imagery for editing or transformations – and the remaining W+ space is capable of the best reconstructions.

As part of ablation studies, the researchers tested LatentSwap’s capabilities across the various sectors of the latent space:

The latent mixer performing diversely across various sectors in the latent space.
The latent mixer performing diversely across various sectors in the latent space.

The authors report:

‘[The image above] shows that taking inputs from W+ and W space gives good and comparable results at their best-performing λ values. However, due to W space being smaller than W+ space, the images lack details, and with some target attributes such as hair and background distorted.

‘Taking inputs from Z fails to find a suitable latent code corresponding to the swapped face throughout all λ as seen in the high ID metric values for all Z space results in [the image above].

‘A possible explanation comes from that the sample space of Z distribution is R512. Therefore, the latent mixer output, being in R512, also belongs to the sample space of Z and the latent mixer output can never escape the sample space of Z.

‘[By] contrast, W and W+ are subspaces embedded in R512. Therefore, it is possible for the latent mixer to output latent codes outside W or W+ space, where it is more likely for the latent code for the swapped face to exist.’

Conclusion

Pushing projected images through the latent space of a trained model is known technology, but LatentSwap’s capacity for direct manipulation of latent codes shows a promising new ambit for facial transfer and editing, where the latent codes are explicitly manipulated, without extensive source data training or fine-tuning.

It seems likely, in such cases, that many of the limitations that might ensue from this kind of approach could be bound up in the limitations of the distribution of data in an off-the-shelf system such as StyleGAN2, trained on FFHQ.

As more forward-thinking data curation methods embed deeper into the synthesis scene, it’s reasonable to expect that such improvements will be reflected in systems of this type, and that direct latent manipulation, with fewer proxies and less (or no) fine-tuning, could become the standard approach.

* My substitution of hyperlinks for the authors’ paper citations.

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle