Typical face-swapping or deepfake applications require datasets. Very often, the methods require bespoke pretraining on vast amounts of facial data, using collections such as FFHQ or Celeb-A. The resources and time required for training are considerable, and the process is inevitably a turbulent marriage between the generalized features obtained at length from per-project training data, and the related latent codes that are obtained from this repetitive training.

* Here we see an example of a trained latent space, where the various image latent codes, together with their captions, have been deposited in the nearest available convenient sector of a matrix. By navigating through this space, we can exploit various features such as identity. Using methods such as GAN Inversion, we can also ‘project’ deliberate and specific images into the latent codes of a particular part of the latent space. Thus, by projecting an image of a woman into a ‘male’ section of the latent space, we can reinterpret her appearance as male while retaining many of her key characteristics. *Source: http://projector.tensorflow.org/

There is increasing interest in systems that can * directly intervene* in the latent space of a facial (or full-body) synthesis system, without the need to ‘orient’ the system by deliberately training it on thousands (or millions) of representative examples of the facial domain (and related sub-domains, such as

*and*

*pose**).*

*expression*What if one could bypass all these representative tokens and * directly manipulate latent codes*? Facial features, once extracted and converted into a latent representation could be moved around like putty, providing an unusually artisanal and precise method of effecting changes, such as changes in identity, subtle alterations of facial expression – and all the other sought-after manipulations which are, in the general run of the current state of the art, still being accomplished ‘the hard way’, through representative training.

For visual effects workflows, direct manipulation of latent codes would mean models that know nothing of the world, or the wider domain that they are a part of, but will actually * obey commands*, and speed up pipeline workflows.

One such system, titled * LatentSwap*, has recently been proposed in a collaboration between academic institutions and private companies in Korea and the US:

While the new system was able to beat out many formidable rival systems (including the original deepfakes code), the main reason that the paper is interesting is that it concentrates on direct latent manipulation, instead of setting up latent ‘mappings’ which then have to take a tedious journey through pixel space, remaining dependent on multiple examples of the source and target identity.

Though LatentSwap does use pre-trained GAN inversion and StyleGAN2 models which have been trained on large datasets, it can use these models ‘off the shelf’ in order to provide transformations, instead of needing to fine-tune them elaborately.

Thus, in many ways, it is arguably nearer to warping tools (such as those found in After Effects or Photoshop) than to the current, far more elaborate practices in facial replacement networks.

Direct latent manipulation is therefore arguably beginning to emerge as a powerful new trend in AI-based human synthesis. Though identity substitution is the focus of the new work, the authors strongly indicate that their approach is adaptable to direct editing of source material, such as changing expressions and other original characteristics.

The new paper is titled * LatentSwap: An Efficient Latent Code Mapping Framework for Face Swapping*, and comes from six authors across Korea University, Massachusetts Institute of Technology, Supertone Inc., Nota Inc., and Optimizer AI.

## Method

With an off-the-shelf Generative Adversarial Network (GAN) pre-trained on a high volume of face data, LatentSwap uses a ‘simple but effective’ module to merge the source and target latent codes.

*(To be clear, latent codes are mathematical representations of – in this case – faces; containing multiple resolution layers, and operating in the 1024x1024px pixel output space)*

The paper states*:

*‘During training, the source/target latent codes are randomly sampled, and face swapping is performed using only these two latent codes as input. For inference, we apply a pre-trained GAN inversion model to map the real images onto the generator latent space. *

*‘Besides the pre-trained inversion model, the generator, and the [**identity embedder**] just to compute the training loss, no other pre-trained models have been used.*

*‘We use the [**StyleGAN2**] generator and work on its latent space, due to its versatility and good generation quality.’*

The system uses ‘latent mixers’, each comprising of five fully-connected layers. The source image is converted to latent codes through GAN inversion, and the target image’s latents are made available through SWISH activation functions.

The latent mixer then concatenates these two codes before processing them sequentially through five fully-connected layers, with the result added back to the target code.

In accordance with the conventions of StyleGAN2, 512-dimensional codes are then extracted. This final latent code is then duplicated * 18 times*, with each latent mixer subsequently operating on each of the 18 layers of latent codes in the

*space (an intermediate latent space).*

*W*The swapped latent codes are then passed on to the generator section of the GAN. It should be noted that the * W*-space latent codes are generated directly by the system devised for the paper

*, known as PTI.*

*Pivotal Tuning for Latent-based Editing of Real Images*Fully-processed latent codes are then harvested from all 18 latent mixers, and assembled into an 18×512 swapped face latent code.

Each of the 18 layers corresponds to a different resolution of the final obtained image, and the StyleGAN2 weights (trained on the FFHQ dataset) remain invariate throughout.

Three loss functions are used for LatentSwap model training. The first is * ID Loss* (which measures the cosine distance between source and target identity images); this loss comes from the SmoothSwap project, which is heavily leveraged for LatentSwap. The authors note that since no code was released for SmoothSwap, they were forced to implement the identity embedder themselves, based on the published schema.

The second loss function is * Latent Penalty Loss*, which preserves facial attributes such as pose and expression, regardless of identity. Since this is essentially a vector comparison process, a Mean Squared Error (MSE) approach is sufficient.

The final metric is * Shape Loss*, which is not nearly as easy to evaluate. Therefore 3D Morphable Model (3DMM) coefficients are calculated, which involves the interpretation of the latent code into realizable and conventional 3D space, with known coordinates. Once interpreted, the 3D facial landmarks of both the source and target face are estimated through a basic L1 metric.

As mentioned, the PTI approach does the heavy lifting in terms of actually transliterating source images into the latent space. The authors observe that the e4e encoder network can perform this functionality even more directly, but that using it breaks the architecture’s symmetry, in this particular case.

## Data and Tests

The FFHQ-pretrained StyleGAN2 generator is used with the original project weights, for LatentSwap. For training the LatentSwap system, 64 source/target pairs were trained in each batch, leading to a 1/8 possibility at any one time that a latent code might be training against itself.

In accordance with the general methodology of SmoothSwap, the system was trained for 200,000 steps, using the AdamW optimizer, at a learning rate of 0.0001 (effectively the most fine-grained rate that could be chosen).

The training took place on two 80GB NVIDIA A100 GPUs, running on a DGX workstation, and took a total of 77 hours. The average inference time for the resulting system (which can handle arbitrary input) was 98 seconds.

Analysis of the coefficients revealed that swaps work best at a medium dimensionality, tending to resemble the source at the lower regions, and the target at the higher regions:

Of these results, the authors comment:

*‘Our method generates realistic and high resolution face swaps. This shows that the pre-trained PTI [model] works well with our framework and has mapped the arbitrary image onto the W space.*

*‘In general, the face shape of the swapped image closely resembles the target and does not get blurry when the face shape differs between source and target. Hairstyle, on the other hand, tends to follow the source image. *

*‘Interestingly, this tendency has also been visible in Smooth-Swap, which may indicate that this phenomenon is linked to the smooth identity embedder that both of the models use.*

*‘These descriptions are more poignant when compared to conventional models such as [SimSwap] and [MegaFS]. SimSwap gives an acceptable face swapping result, especially regarding background and lighting. However, it rarely changes the skin tone and is constrained to a 224 × 224 image size. *

*‘MegaFS performs qualitatively worse. This is primarily due to the segmentation model changing only specific parts of the target image, leading to unnatural background and being vulnerable to occlusion, alongside showing inexact lighting and face shape.’*

For a quantitative test, the ID retention, expression transference, pose and number of parameters were evaluated, with LatentSwap generally ahead, while using significantly fewer parameters than former frameworks:

The supplementary information referred to in the paper, where further details are reported to be given about the metrics used (and where further analysis is presumably recorded) is not currently available.

The * Z* space in the latent space of StyleGAN is the dominant sector, while the

*space is the intermediate latent space, with the latter poorly suited to projection of source imagery for editing or transformations – and the remaining*

*W**space is capable of the best reconstructions.*

*W+*As part of ablation studies, the researchers tested LatentSwap’s capabilities across the various sectors of the latent space:

The authors report:

*‘[The image above] shows that taking inputs from W+ and W space gives good and comparable results at their best-performing λ values. However, due to W space being smaller than W+ space, the images lack details, and with some target attributes such as hair and background distorted. *

*‘Taking inputs from Z fails to find a suitable latent code corresponding to the swapped face throughout all λ as seen in the high ID metric values for all Z space results in [the image above].*

*‘A possible explanation comes from that the sample space of Z distribution is R ^{512}. Therefore, the latent mixer output, being in R^{512}, also belongs to the sample space of Z and the latent mixer output can never escape the sample space of Z.*

*‘[By] contrast, W and W+ are subspaces embedded in R ^{512}. Therefore, it is possible for the latent mixer to output latent codes outside W or W+ space, where it is more likely for the latent code for the swapped face to exist.’*

## Conclusion

Pushing projected images through the latent space of a trained model is known technology, but LatentSwap’s capacity for direct manipulation of latent codes shows a promising new ambit for facial transfer and editing, where the latent codes are explicitly manipulated, without extensive source data training or fine-tuning.

It seems likely, in such cases, that many of the limitations that might ensue from this kind of approach could be bound up in the limitations of the distribution of data in an off-the-shelf system such as StyleGAN2, trained on FFHQ.

As more forward-thinking data curation methods embed deeper into the synthesis scene, it’s reasonable to expect that such improvements will be reflected in systems of this type, and that direct latent manipulation, with fewer proxies and less (or no) fine-tuning, could become the standard approach.

** My substitution of hyperlinks for the authors’ paper citations.*