In neural facial synthesis, data augmentation is becoming increasingly important. In order to develop either a traditional autoencoder deepfake model or a more modern Stable Diffusion facial model, VFX practitioners often need to impose the characteristics desired (extra age, etc.) directly into the training images, so that the learned model will generalize and reproduce these changes for the final footage.
This is very different to the automated data augmentation that occurs when training computer vision models, which performs random and algorithmic rotations, inversions, and perhaps color augmentations in order to provide the training process with more diverse takes on what may be limited data.
Such simple augmentations are intended only to help the model become as flexible as possible, by forcing it to consider repeated data instances from novel standpoints – the core information about the person’s face does not change in itself.
Rather, augmentation of the kind we are considering here may involve altering hundreds, or even thousands of varied images of an actor in a current project, so that they appear older, or (for instance) so that the photos demonstrate better resolution than the archival source material offers.
The more the source data can be transformed, the greater opportunity there is for the desired changes to become well-generalized into the final model, and to produce consistent results. In this way, all the ‘hard work’ is accomplished once and for all in advance, rather than, repetitively, at the point where the final trained model is put into use.
It is not realistic to set experienced VFX practitioners to the task of hand-crafting such augmentations at any scale; apart from the formidable logistical considerations, machine learning can do a better job than even an experienced retoucher – so long as it can produce at least a handful of examples of the kind of transformation that you’re after.
Consider then, the hypothetical case that you wish to develop a model that can produce images of an actor, and which can reproduce the actor at an older age than they really are. There is, of course, no real-world source data available for this; and using traditional make-up techniques for the data capture is pointless (if this older approach was convincing enough, you could just use it live, on set).
At this stage, it is not important what technologies you use to achieve these ‘exemplary’ transformations (which in the case of a Stable Diffusion LoRA or DreamBooth adjunct, may require as few as 20-50 convincing ‘altered’ images). If you can obtain enough varied image examples that capture the desired characteristic (such as increased age), you can then use these images to train a model that can spit out a far higher number of varied training images, which can in turn be trained into a later production model.
Therefore, though they are not as much in fashion in the image synthesis research scene as they were a couple of years ago, Generative Adversarial Networks (GANs) are unusually well-suited to help produce the moderate amounts of augmented data required for the task at hand – a task where CGI frequently fails, and where the inconsistency of latent diffusion models (LDMs) such as Stable Diffusion offer a non-ideal choice.
The central problems at hand are the related issues of bias and entanglement.
To telescope the issue, consider that a trained network, such as a GAN or latent diffusion model (which has typically been fed on real-world data, on real photos of real people) is asked to produce ‘a photo of a woman with a full mustache’. The system would not normally be expected to yield results that most people would consider authentic, because (one would assume) there are very few mustached women featured in a typical dataset, and therefore the trained system would not have many examples to draw from.
In fact, that’s not strictly true – in support of the Movember initiative to highlight men’s health, many datasets feature comprehensive social media images of women wearing the characteristic and stylized mustaches associated with the movement:
As we can see, the above Stable Diffusion results for ‘A Woman with a full mustache’ (to distinguish the target concept from more typical and more minor female facial hair issues) are dominated by the official styles the connoisseur and the rock star, since women apparently favor these stylings when making this charitable gesture.
Now let’s see if Stable Diffusion can do as well with the prompt ‘A woman with a beard’:
Here entanglement is far more evident, with barely-feminized (or explicitly male) men featuring; and we can see that secondary male characteristics such as male pectorals (top right) have even been drawn into the mix†. Clearly, there is no beard-related equivalent for the Movember data that covers the concept requested in this prompt.
Likewise, other semantic associations can be difficult to prise apart, such as the association between eye-glasses and old age.
Couplings of this type, essentially a form of bias, abound in hyperscale image datasets, and are known as spurious correlations (i.e., the relationship is common, but still arbitrary, and cultural or statistical rather than essential in nature).
Though GANs and LDMs have very different architectures, they face exactly the same problem, for exactly the same reason – the characteristic that you want to generate always seems to bring an ‘unwanted friend’ along with it, because the source data tends to provide these associations, and such couplings therefore tend to become unreasonably statistically significant.
However, qualities such as ‘age’ are not discrete concepts like ‘mustache’ or ‘glasses’, but truly pervade the depiction of a person; therefore trying to distill their essence from their host identity (while retaining that identity) is a little like trying to remove a drop of ink from a well-stirred glass of water – in any neural synthesis architecture.
A Better GAN Approach
Towards further progress on this challenge, a new collaboration between the US and Canada, titled SC2GAN: Rethinking Entanglement by Self-correcting Correlated GAN Space, is offering an improved method of disentanglement, by intelligently and pointedly retraining the latent space of a Generative Adversarial Network in a novel way that tends to better isolate distinct facets.
In the example above, from the new work (which comes from researchers at ModiFace Inc., the University of Illinois and the University of Toronto), we see in the top row how a typical case of entanglement can also change other key characteristics, such as (in this case) gender, when trying to alter characteristics like age.
By correcting latent codes in the original trained latent space, the new method is able to carve a new and ‘cleaner’ path through the re-trained characteristics. In the middle row, powered by little more than 100 such corrections, we see an improved ageing of the young woman, though spectacles are arbitrarily added. In the bottom row, with over 1000 corrections applied, the aging process appears far more plausible and does not produce spectacles.
In the image below, we see a practical visualization of this ‘rerouted’ path through the re-trained latent space, which has broadened the claustrophobic confines of the original latent space layout, and can now avoid embeddings and features associated with spectacles:
The architecture of the latent space of a GAN (and many other frameworks) is determined by the diversity of examples and classes that are trained into it – which is to say, that there are no pre-made ‘compartments’ for faces, lips, mustaches, or other semantic concepts; rather, the space grows and self-designs according to the data.
What you’re left with after training is a navigable matrix of related concepts, where the names of the concepts (‘man’, ‘woman’, ‘beard’, ‘old’, etc.) will have been drawn from labels in the data.
Pushing (‘projecting’) the output head (the mechanism that actually shows you a result) through various paths in the latent space will cause the output to travel through and be affected by whatever concepts are desired that can be accessed in a route. In this way, one can effect transformations regarding age, gender, and any other characteristic trained into the latent space.
Please allow time for the animated GIF below to load
In the 3-dimensional coordinates system of a StyleGAN latent space, many research projects have concentrated on effecting transformations in the W direction. The W space of a GAN is the intermediate latent space following on from the fully-connected layer mapping (Z space). It has been dubbed by some as a ‘trivial’ space, and is arguably analogous to the ‘suburbs’ of a complex metropolis.
Nonetheless, W space has been of great interest to researchers into GAN-editing systems, due to its ductility in regard to facilitating style mixing and image inversion (i.e., projecting a new image into a fully-trained GAN in order to edit its qualities using the system’s trained latent codes).
The key application of SC2GAN is to find marginal or under-represented images or couplings (see above), generate apposite images relating to the target concepts, and re-inject this novel content back into the low-density regions of the W space until this zone of the latent space is better balanced with the rest, and less of a ‘ghetto’ than it was when the model was first trained.
This process of editing involves re-interpolating data from W in the W+ space, a denser and more central area in the trained model, which is less entangled but also less flexible.
The authors state † †:
‘Following such direction, we obtain edited images with localized changes corresponding to minority attribute groups, which will then be projected to the W space via GAN inversion. To enable disentangled editing, we re-train [InterFaceGAN and GradCtrl] with this self-corrected latent distribution to learn the editing directions.’
Besides the quantitative and qualitative results obtained in tests (see ‘Data and tests’ below), the disentanglement effected by this process of retraining can be illustrated statistically. In the image below, in (a), we see the extent to which concepts are entangled in the source data of the FFHQ dataset, and the statistical reasons why (for instance) ‘smile’ is more associated with ‘woman’, and ‘beard’ with ‘man’, etc.. In the right-hand (b) graph, we see that training FFHQ into StyleGAN2 largely (and quite logically) replicates this bias.
Performing Principal Component Analysis (PCA, a method to identify the most common dimensions in trained data) on the original FFHQ-trained StyleGAN2 latent space shows that numerous concepts are stubbornly baked in together in the matrix, as represented by the limited number of colors and corresponding categories in the left-hand graph of the image below; but in the SC2GAN-corrected space (shown on the right in the image below), the ‘corrected’ latent space now features a higher number of discrete labels – labels which were previously bound up with other concepts, and which are will no longer come with unwelcome biases and associations.
It would be great if the denser W+ space (rather than the more marginal W space) was more amenable to this kind of approach. However, as the authors of the new paper note, prior attempts to perform the same transformations in the W+ direction have led to approaches which cannot adequately change the image according to the desired effect, because the W+ space is simply too intractable:
Data and Tests
The researchers tested SC2GAN in a range of qualitative and quantitative experiments, using StyleGAN2 trained on FFHQ, and using editing directions drawn from prior works InterFaceGAN (‘’ in results) and GradCtrl (‘’ in results – links in author quote above).
The authors sampled 200,000 images from the source data and created pseudo-labels for the attributes gender, smile, eyeglasses, age, lipstick and beard, using the pretrained attribute classifiers in StyleGAN.
They sampled latent codes in the W space, in accordance with methods from the two aforementioned previous works, and then used methods developed by GANSpace (‘’ in results) and GradCtrl to obtain self-corrected samples of the images.
The model was then re-trained from scratch with this new merged dataset, which now contained both the original W codes and their corresponding self-corrected images.
In an initial qualitative round, the authors tested for attribute manipulation in regard to gender, eyeglasses, age, lipstick and beard, offering results that show both the native generations from InterFaceGAN and GradCtrl, and generations when these methods were retrained under the new approach. They additionally tested against GANSpace and also against the StyleSpace framework (‘’ in results, and which uses semantic segmentation and channel-targeting – see image earlier in this article).
Regarding these results, the authors comment:
‘For both global attributes(age and gender) and local attributes(lipstick, eyeglasses and beard), our framework boosts disentanglement for both [InterFaceGAN] and [Grad-Control]. For instance, we achieve disentangled aging effects without eyeglasses added and decouple female direction from smile.
‘GANSpace and StyleSpace suffer little from the entanglement issue, but the amount of change they make for global attributes is extremely limited, e.g., StyleSpace fails to synthesize more female effects, and GANSpace lacks the ability to generate aging effects.
‘In the meantime, for local attributes, with our framework applied, InterFaceGAN and Grad-Control achieve performance similar to GANSpace and StyleSpace, which operate in spaces of much higher dimensions.’
For a quantitative round, the researchers made use of the Attribute Dependency (AD) metric proposed by StyleSpace. To evaluate the level of entanglement for an edited attribute, they sampled latent codes corresponding to images at the decision boundaries of the trained network, where these related to a corresponding attribute classifier in StyleGAN2.
When the absolute change in logits was evaluated across the frameworks, the study found that disentanglement in the W direction improved significantly across all the target attributes:
Finally, a brief experiment in real image manipulation was conducted, demonstrating entanglement levels when attempting to influence the attributes ‘age’ and ‘lipstick’:
Regarding this, the authors comment ‘Our proposed approach achieves disentanglement while preserving the identity better’.
It is difficult to overstate the extent to which entanglement is a roadblock to a potential revolution in AI-based VFX procedures, or the extent to which the problem originates as much in cultural practice as at the architectural level.
Nonetheless, picking apart the qualities in entangled assets is a complex and burdensome line of research. If attribute classifiers could produce more granular distinctions between qualities in training images, this kind of retroactive, remedial approach would not be necessary.
As it stands, the architecture must contend not only with the distributional biases that proceed from the cultural habits that define popular dataset-gathering methodologies, but also with the immense parallel problem of obtaining better and more granular labels.
Effectively, it seems that the problem could be solved at either end of the process: either you could feed a system better-labeled data, and it would organically produced better-delineated and more disentangled characteristics; or you could develop a system that can disentangle the characteristics at the data preparation stage – or even as an automated data augmentation routine, at training time itself.
However, either proposition requires that someone (or more likely, thousands of people) sits down and labels all these source images in a more informative way. All roads lead to this expensive and stubborn prospect.
Even then, this may only solve classification issues for objects that can easily be discretized, such as hair, mustaches, glasses, etc. To meaningfully and helpfully quantify more abstract properties such as age and race may require a notable breakthrough in visual/semantic theory.
† Despite the iconic nature of arguably the world’s most famous mustache, the prompt ‘A woman with a Hitler mustache’ still produces only Movember-style mustaches, because the model has nothing but Movember social media data to draw on, and cannot discretize the two concepts (i.e., ‘woman’ + ‘mustache’ seems always to equate to ‘woman with Movember mustache’). Several other facets relating to the infamous dictator do appear in some of the ‘mustached women’ images generated as a test for this article; but there is no sign of the toothbrush mustache that characterized Hitler’s appearance. In fact, the curlicues of the Movember moustache are so dominant in the V1.5 Stable Diffusion model for the semantic term ‘mustache’, that they are usually included even on generations of Hitler himself, who is apparently denied the opportunity to wear his own characteristically brief whiskers.
† † My substitution of explicitly named and hyperlinked references to other works, in place of the authors’ numeric citations.