Anyone who has ever watched a movie that follows a character throughout the course of their life will have had to ‘get out and push’ a little when the filmmakers try to represent a person as being much younger or much older than the actor playing them. The sight of a mid-late thirties man pretending to be 18 years old, or else caked in what appears to be pancake batter and trying to impersonate someone in their eighties, is where our imaginations tend to have to take over.
Some of the major changes associated with aging are deep enough that it is effectively impossible for make-up to simulate them. For instance, reflecting the increased concavity of older faces involves ‘cutting into’ the actor’s face, where more youthful flesh is currently occupying space, which is not a realistic on-set prospect. Likewise, the essential shape of an 18 year-old head and face is notably different even from the relatively young adult that they will soon become, with smaller ears and nose, and different jawline definitions, among many other essential traits.
Therefore the VFX industry currently has notable interest in adding or taking away years from actors, as we are currently doing with Tom Hanks, Robin Wright and Paul Bettany, for Robert Zemeckis’ Here, and as esteemed VFX house Industrial Light and Magic recently undertook to produce a younger Harrison Ford for the opening segment of the sequel Indiana Jones and the Dial of Destiny.
In truth, there is a lot of ‘by eye’ guessing when attempting to create these kinds of age-defying transformations, whether the direction is younger or older – and VFX workflows could benefit notably from neural procedures that could take some of the guesswork and artisanal sweat out of the process.
A new collaboration between China and France is proposing this kind of automation for neural aging, using Stable Diffusion and a host of other modules and libraries to power Face Aging via Diffusion-based editing, or FADING.
In tests, the new work, according to the authors, improves on the state-of-the-art in a number of ways, not least by not making blind and generalized assumptions about the way people age, such as the tendency of a prior framework to add glasses to any face which has been ‘marked’ for aging:
The system works by re-tooling a standard Stable Diffusion interface to allow for novel functionality, including face recognition and age recognition, and by the use of multiple simultaneous text prompts that help to ensure core characteristics such as gender are preserved, while allowing identity-locked aging/de-aging to take place.
The authors state:
‘FADING is the first method to extend large-scale diffusion models for face aging; [we] successfully leverage the attention mechanism for accurate age manipulation and disentanglement; [and] qualitatively and quantitatively demonstrate the superiority of FADING over state-of-the-art methods through extensive experiments.’
The new paper is titled Face Aging via Diffusion-based Editing, and comes from two researchers, respectively from Shanghai Jiao Tong University and the LTCI (Télécom Paris Institut) Polytechnique de Paris.
FADING is divided into two stages: specialization, where the Stable Diffusion model is re-tooled and retrained for this specific task; and age editing, where the standard latent diffusion model’s diffusion process is inverted to accommodate the variables related to the goal.
In the specialization module, a source image is inverted (i.e., projected into) the latent space of the Stable Diffusion autoencoder, where random Gaussian noise is added to the original latent embedding generated at inversion time, which outputs a series of noisy samples that contain essential features from the source image.
To facilitate image generation conditioned on crafted text prompts, a generated image is obtained by passing an estimated latent tensor to the system’s encoder (a tensor essentially being a kind of neural spreadsheet of known characteristics of the embedding).
Keys and values are then extracted from the embedding using cross-attention layers. In cases of unconditional generation (where instrumentalities such as Classifier-Free Guidance [CFG] or even simple text prompts are not present), the token embeddings that have been extracted are replaced with null-text embeddings, or ‘placeholders’.
The authors note that age-editing with a pretrained latent diffusion model (LDM) can be obtained without a training stage, and that this has been done before, with SDEdit and Null-text Inversion; however, they observe that this approach is generic rather than specialized for neural facial synthesis, and that subsequently, generic prompts such as ‘a man in his thirties’ can capture age detail but tends to jettison identity detail.
Therefore the system is custom-trained on prompt pairs, one of which is generic, in the style mentioned above, and the other being ‘photo of a [x] year old person’ (where ‘x’ is the target age). The authors state:
‘We refer to this fine-tuning scheme as the double-prompt scheme. One assumption to justify this observation is that it can allow better disentangling of age information from other age-irrelevant features (i.e. identity and context features).’
In order not to dilute the latent priors that have already been obtained at this stage, the double-prompt training is limited to a fairly frugal 150 steps.
For the age-editing stage, the system can now produce conditional (i.e., responding to image or text input) and unconditional output, with a specified target age.
To enable actual image editing, the diffusion process has to be reversed, using the approach employed in the aforementioned Null-Text Inversion.
The input image inversion process is borrowed from the 2020 paper Denoising Diffusion Implicit Models. Additionally, editing the age of the throughput image requires a pre-trained age estimator.
Using techniques developed for the 2022 offering Prompt-to-Prompt Image Editing with Cross Attention Control, the native cross-attention maps that are used for text-conditioning in Stable Diffusion are targeted and reconditioned.
By these methods, the actual pixels which relate to age, irrespective of identity, can be disentangled from identity characteristics, enabling an identity-agnostic aging workflow. This effectively comprises the age-estimation layer in the system.
With these disentangled features characterized thus, the estimated age can now be replaced with the target age. During the guided generation, the revised cross-attention maps (see above) are injected into the sampling process, while the identity latent values remain discrete. The authors state:
‘In this way, the generated image is conditioned on the target age information provided by the target [prompt] through the cross-attention values, while preserving the original spatial structure.
‘Specifically, as only age-related words are modified in the new prompt, only pixels that attend to age-related tokens receive the greatest attention.’
Optimization of the process can be further enhanced by the use of gender classifiers, to replace the broader token ‘person’ with ‘man’ or ‘woman’. The authors observe that when attempting to radically lower the age of the target, adult tokens such as ‘man’ or ‘woman’ perform poorly.
Data and Tests
To test the system, the researchers used a standard LAION-trained Stable Diffusion installation. A subset of 150 images was drawn from the FFHQ-aging dataset, and the model was fine-tuned for 150 steps at a batch size of 2. The median age of the dataset samples for the ground truth group was used as a fine-tuning prompt (presumably something like ’37 years old’, though the paper does not specify this).
The datasets used in testing were NVIDIA FFHQ, which contains 70,000 1024x1024px images of faces, manually labeled into 10 age groups from two years old to over seventy years old; and the much-frequented CelebA-HQ, consisting of 30,000 images – with the latter used only for evaluation, and not for training (i.e., as a control group).
In line with previous studies along these lines., age labels were obtained with the use of the DEX classifier.
Images were downsampled to 512x512px for the study.
For metrics, the researchers required evaluation of age accuracy, age-agnostic attribute reservation, and aging quality. Metrics chosen were the loss function Mean Absolute Error (MAE), which predicted age vs. target age; Kernel Inception Distance (KID), which assessed the shortfall between generated and real images for similar ages; and Face++, which was used to assess aging accuracy, attribute preservation and blurriness evaluation.
Rival frameworks tested were High Resolution Face Age Editing (HRFAE); Lifespan Age Transformation Synthesis (LATS); and Custom Structure Preservation in Face Aging (CUSP). Though these were the main systems against which the researchers wished to test FADING, additional frameworks used were FaderNet, PAGGAN, and IPCGAN.
Using the evaluation protocol from HRFAE, the tests began on the CelebA-HQ dataset, with the task of transforming 1000 test images labeled ‘young’ into a target age of 60 years.
The researchers emphasize that the images used are extracted from CUSP, and are not ‘cherry-picked’ (i.e., filtered for only the most favorable results).
The authors state:
‘We observe that [FaderNet] introduces little modifications, [PAG-GAN and IPC-GAN] produce pronounced artifacts or degradation. [HRFAE] generates plausible aged faces with minor artifacts but is mostly limited to skin texture changes, such as adding wrinkles.
‘[LATS], [CUSP], and our approach introduce high-level semantic changes, such as significant receding of the hairline (see third row). But LATS operates only in the foreground; it does not deal with backgrounds or clothing and requires a previous masking procedure.
‘On the other hand, CUSP always introduces glasses with aging. This is likely due to the high correlation between age and glasses in their training set. Our method does not introduce these undesired additional accessories, produces fewer artifacts on backgrounds, and possesses more visual fidelity to the input image.’
This phase of the experiments was expanded into a tournament-style face-off against the best-performing framework from this round, CUSP.
For this, input images from all age groups were used, and reported on a per-age basis, to demonstrate continuous transformations through the target ages:
Here, the researchers comment:
‘[Our] approach introduces fewer artifacts, generates realistic textural and semantic modification, and achieves better visual fidelity across all age groups. [We] achieve significant improvement for extreme target ages (infant and elderly, see columns for (4-6) and (70+)). [Our] model handles better rare cases, such as accessories or occlusions. CUSP fails when the source person wears facial accessories.
‘Typically, for the person on the right who wears sunglasses, CUSP falsely translates sunglasses to distorted facial components. In contrast, our method preserves accessories accurately while correctly addressing structural changes elsewhere.
‘These results confirm our initial hypothesis that utilizing a specialized DM pre-trained on a large-scale dataset increases robustness compared to methods exclusively trained on facial datasets, which are susceptible to data bias.’
The researchers note a curious phenomena with this kind of study – that skin tone tends to shift as age changes, and they suggest that this is inherent in the original Stable Diffusion model.
Regarding the quantitative round, results are mixed:
Here FADING is in parity with CUSP in terms of aging accuracy, and also achieves the highest gender-preservation, which the researchers contend proves the system’s ability to disentangle attributes passing through the framework. The authors assert that this is because the prior works generate texture-level modifications rather than high semantic changes, which yields a friendlier score from automated metrics – which, by implication, suggests that a more adept metric might be needed for this kind of evaluation.
Testing against FFHQ-Aging, FADING achieves a (desirable) lower MAE score, indicating that the system has higher accuracy.
This additional test also re-emphasizes that FADING offers better gender preservation – and the authors note that in the 30-50 age range, FADING achieves an almost perfect preservation. FADING’s performance in the quantitative KID analysis is, the authors state, an order of magnitude lower that CUSP for nearly all age groups.
Hollywood’s growing interest in AI-generated aging and de-aging puts this kind of research into a critical context. It is very difficult to separate the characteristics of age from the characteristics of identity, and thus this particular challenge is related to the wider problem of entanglement (where traits, such as identity or gender, ‘come along for the ride’, even when the neural system is attempting to extract more discrete characteristics, such as ‘age’).
It’s notable that works of this kind draw so heavily on prior papers (even by the fairly self-referential standard of the image synthesis scene), and that the metrics and supporting systems have to be slightly tormented into fulfilling the necessary role, for the lack of dedicated libraries that might be more apposite.
It could be that far broader studies in aging need to be gathered into this research strand, so that pixel-evaluation and trawling for features among the usual array of common datasets could give way to methodologies (and even dedicated metric algorithms) that are supported by anthropological and medical data.
The importance of efforts such as FADING lie, at the moment, in the need to generate plausible synthetic data for training more conventional deepfake-style systems that have a greater native capacity for video generation than Stable Diffusion and other LDM architectures. However, the key requirement at the moment is to develop guiding principles for these processes.