A Dedicated Loss Function for Neural Face Training

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Though they may not be the most glamorous aspect of generative synthesis, loss functions are at the very heart of the ever-expanding possibilities for neural facial recreation.

Loss functions represent the perpetual and ongoing process of guidance (or adjudication) during the training of a model, directing the system towards a more faithful recreation of whatever target metric is the objective of the model being trained (for our purposes, that facial representations should become more realistic and more faithful to a target identity).

On the left we see the target identity, acting as the ground truth for a deepfake training process. Left to right, we see an imagined representation of the way that the loss function helps the system to focus first on essential features of the data that it is iterating over, and then, with the broad strokes mapped out, to focus on the details. Source: https://blog.metaphysic.ai/loss-functions-in-machine-learning/
On the left we see the target identity, acting as the ground truth for a deepfake training process. Left to right, we see an imagined representation of the way that the loss function helps the system to focus first on essential features of the data that it is iterating over, and then, with the broad strokes mapped out, to focus on the details. Source: https://blog.metaphysic.ai/loss-functions-in-machine-learning/

Many of the most popular loss functions, such as L1 and L2 loss, are purely algorithmic, in that they make mathematical assumptions about a teleological or ‘ideal’ final loss, and drive the training process towards it based on these fixed criteria. .

Increasingly, loss functions are utilizing user studies or direct human judgements to calibrate themselves. One example of this is the Perceptual Similarity Metric and Dataset (LPIPS) loss function, for which the standards were set by human feedback across 484,000 in-person evaluation of results.

LPIPs relies on human scoring to provide ongoing assessment of quality during training, or as a metric for output at inference time. Source: https://arxiv.org/pdf/1801.03924.pdf
LPIPS relies on human scoring to provide ongoing assessment of quality during training, or as a metric for output at inference time. Source: https://arxiv.org/pdf/1801.03924.pdf

One thing that characterizes even the trendiest and latest loss functions, however, is that they all general in nature, and not specifically optimized for the task of evaluating human faces. Yet the study of the human face is proving as predominant a concern in neural rendering as it is in art and human culture in general.

For this reason, some interest has arisen in recent years for the development of a broadly applicable loss function capable of being dropped into a wide range of neural workflows, purely for facial evaluation, without needing to adhere to particular datasets or to force projects to conform to specific methodologies.

To this end, a new joint paper from Disney Research (currently one of the leaders in human synthesis research) and ETH Zürich is proposing to fill this gap, with a methodology based on a 1990s paper that demonstrated how humans use shading to evaluate 3D qualities and realism.

Examples of the new system, compared to analogous prior approaches. Source: https://arxiv.org/pdf/2310.19580.pdf
Examples of the new system, compared to analogous prior approaches. Source: https://arxiv.org/pdf/2310.19580.pdf

The new system uses depth-map estimation to convert color images – both fake and real, as necessary – into grayscale images that serve as a mean indicator of topology, and is agnostic to any particular mesh-based approach, which allows it to be considered as a generic loss algorithm that could be used in multiple types of neural training systems.

Perceptual Shape Loss uses backgrounds of pure noise, to concentrate the evaluative system on the actual content. Source: https://www.youtube.com/watch?v=RYdyoIZEuUI
Perceptual Shape Loss uses backgrounds of pure noise, to concentrate the evaluative system on the actual content. Source: https://www.youtube.com/watch?v=RYdyoIZEuUI

Additionally, the paper offers a dedicated new loss function titled Perceptual Shape Loss (PSL), which is based on the discriminator system of Generative Adversarial Networks (GANs), the improved efficacy of which was demonstrated in multiple tests against apposite benchmarks.

The new paper is titled A Perceptual Shape Loss for Monocular 3D Face Reconstruction, and comes from six researchers across the two institutions.

Method

The new Disney/ETH approach is inspired by the 1992 paper On the perception of shape from shading, which posits that our ability to use shading to determine three-dimensionality is among the most primitive and fundamental functionalities of human vision, since it dispenses with the luxury of color and of other ancillary factors, and is going to operate in poor light and under the most severe circumstances in which we are able to see at all (such as actually having only monocular vision).

From the 1992 paper on which the PSL approach is based. Source: http://wexler.free.fr/library/files/kleffner%20(1992)%20on%20the%20perception%20of%20shape%20from%20shading.pdf
From the 1992 paper on which the PSL approach is based. Source: http://wexler.free.fr/library/files/kleffner%20(1992)%20on%20the%20perception%20of%20shape%20from%20shading.pdf

Thus shading is argued as a self-promoted lower-level function that can be relied upon, and is therefore a suitable approach for a loss function that must distill essential information.

The new Disney/ETH paper states:

‘Our key idea is to create a loss function for monocular 3D face reconstruction, implemented as a neural network that takes a face image and a gray-shaded render of face geometry as input, and outputs a scalar value to indicate how well the render matches the image in terms of shape.

‘The network should intuitively critique the inputs, and provide continuous feedback about the ‘goodness’ of match between the image and the render, for any image-render pair. Once such a critic network is trained, its output can be interpreted as a perceptual shape loss that can be used for 3D face reconstruction tasks.’

The critic network developed for PSL and its underlying methodology is an advance on the base discriminator function of Deep Convolutional GAN (DCGAN). The authors extended the original spec of that project with the addition of two extra convolutional layers, to bring the functional resolution up from 64px2 to 256px2.

The input to the new network consists of an original RGB image and a derived grayscale render inset against a background of noise (to allow the training system to concentrate on the central content), with the grayscale image illuminated by a single light source.

Simple lighting and backgrounds in the PSL approach.
Simple lighting and backgrounds in the PSL approach.

The network is trained to reward good matches between an image/render pair, and to punish poor matches. To help the system to operate later on novel, unseen data, the training data used contains both good and bad matches, with bad matches exemplified in mismatched expressions, poses, identity, and other core characteristics.

Examples of deliberate mis-matches designed to help the system learn to discriminate correctly.
Examples of deliberate mis-matches designed to help the system learn to discriminate correctly.

Data and Tests

The base training dataset was the Disney/ETH Zurich dataset Semantic Deep Face Models, which contains 358 distinct identities across 12 camera viewpoints, and with 24 facial expressions. Face geometry is created with the passive stereo evaluation approach presented in the earlier (2010) Disney/ETH paper High-Quality Single-Shot Capture of Facial Geometry.

From the earlier 2010 Disney/ETH paper 'High-Quality Single-Shot Capture of Facial Geometry ', examples of geometry obtained from original images. Source: https://studios.disneyresearch.com/wp-content/uploads/2019/03/High-Quality-Single-Shot-Capture-of-Facial-Geometry.pdf
From the earlier 2010 Disney/ETH paper 'High-Quality Single-Shot Capture of Facial Geometry ', examples of geometry obtained from original images. Source: https://studios.disneyresearch.com/wp-content/uploads/2019/03/High-Quality-Single-Shot-Capture-of-Facial-Geometry.pdf

It was important to generate samples with only a single light source, so the researchers used the prior Rendering with Style method (another Disney/ETH collaboration, from late 2021), which facilitates this. The inpainting used in this older technique allowed the authors to control lighting and viewpoint; and they comment:

‘The resulting dataset contains a large variety of quasi in-the-wild images for which we have the corresponding high-quality geometry serving as ground truth. Given this synthetic data, we can create real and fake samples using the same approach as for studio data.’

The network was ultimately trained on 92,736 examples of real people taken in studio conditions, and 6,276 synthetic images. The distribution of ‘incorrect’ pairings was equally divided between incorrect corresponding identities; incorrect expressions; and both.

A further 99,012 examples were generated with ad hoc and non-conforming head rotations and other skewing of the geometry, leading to a total of 495,060 unique training images, containing only one-fifth real-world images, which were repeated four times to equalize the proportions of real and fake data shown to the system.

Additionally, a further 20,736 examples (50% real and 50% fake) were used as a validation set.

Before training, the data was normalized by cropping faces and aligning eyes, mouth and nose to a common pose, in line with the 2020 Disney/ETH paper High-Resolution Neural Face Swapping for Visual Effects.

This process involved rejecting geometry poses that did not conform to the bounds of a relatively portrait-style pose (and the researchers concede that this method in general is not the best at recreating side-views, citing the common knowledge that in-the-wild profile data is lacking in the research sector in general, and that this has a severe downstream influence on the capabilities of facial neural synthesis systems to generate accurate side-views).

The critic network was trained for four epochs (i.e., four complete views of the data by the training system) on a Titan X GPU, with 12GB of VRAM. Training took eight hours, and used a PyTorch framework running at a learning rate of 0.001, on a batch size of 64, using images at a resolution of 256px2.

To test whether the trained system could distinguish real from fake images, the researchers plotted the distribution of output scores from the critic system, finding that the new approach offers a ‘reasonable separation’.

A histogram of score separation across images known to be real or fake.
A histogram of score separation across images known to be real or fake.

Next, it was necessary to assess whether the new system was suitable as a novel loss metric in its own right. The authors explain:

‘A loss function should be differentiable, continuous, smooth, and ideally have a minimum at the optimal parameter inputs. As our perceptual loss is implemented as a CNN in PyTorch, it is trivially differentiable.’

This proof was obtained by plotting the normalized output score while performing perturbations on the geometry, as would happen in a real-world training system.

This was repeated three times on diverse example images, varying factors such as identity and head rotation, among others.

Testing for loss function candidacy.
Testing for loss function candidacy.

The paper states:

‘As the plots indicate, our function is smooth and continuous, and generally decreases as we deviate from the optimal parameters.

‘While not performing an exhaustive evaluation of the local neighborhood around the optimal parameters to show a strict maximum, our experiments show evidence of a local maximum for the optimal geometry renders.’

The authors next tested the system for accuracy of perceptual shape loss during facial reconstruction, using the Faces Learned with an Articulated Model and Expressions (FLAME) system. The three scenarios evaluated in this part of the tests were single-image optimization; regression network training (the creation of an applicable loss function algorithm); and topology independence (how agnostic the method is to specific mesh-based approaches). Each of these used three versions of the loss function: a baseline model, Lbase; a ‘full’ model, Lfull; and an optimized ‘compact’ version, Lcompact.

Evaluation for single-image optimization was done over the DECA pre-trained inference network, using the NoW benchmark.

The optimization results on the validation set are visualized in the upper section of the image below:

Results on the NoW validation set. Please refer to source paper for better resolution.
Results on the NoW validation set. Please refer to source paper for better resolution.

The paper states:

‘[We] can see that using our new loss function (Lfull) increases the accuracy of reconstruction over the baseline.

‘The greatest improvement over the baseline can be seen on the NoW selfie challenge subset, which contains mostly frontal faces. Note that all optimizations also naturally improve over the initialization (DECA), which is an inference-based method.’

The authors note additionally that the compact loss method is also better than the baseline, and ‘nearly as good’ as the full method.

NoW test optimization, the researchers observe, shows similar improvements:

Results on the non-metrical NoW benchmark for initial DECA reconstruction, pitted against the authors' novel loss function.
Results on the non-metrical NoW benchmark for initial DECA reconstruction, pitted against the authors' novel loss function.

The authors note:

‘The optimization leads to best results when including our perceptual shape loss (Lfull), and even using our shape loss and landmark loss alone achieves good results (Lcompact).’

The researchers surmise that the results indicate that PSL is a sound candidate for an architecture-agnostic perceptual loss metric and function.

Select examples from the paper's published qualitative results for face reconstruction on in-the-wild images.
Select examples from the paper's published qualitative results for face reconstruction on in-the-wild images.

Next, the authors trained a new inference-based parameter regression method, PSL, on all three sizes of the algorithm. For this, they fine-tuned DECA on 2000 frontal images (since these performed best in other parts of the testing cycle) from the CelebAMask-HQ dataset, using a learning rate of 0.00001 under the Adam optimizer, for 1,700 iterations.

Inference-based facial reconstruction – results across the three sizes of algorithm, using DECA.
Inference-based facial reconstruction – results across the three sizes of algorithm, using DECA.

The new method was additionally tested against prior inference-based frameworks EMOCA and SPECTRE (both methods which also use DECA and bespoke perceptual losses), this time with the baseline method coming out ahead:

The Lbase method comes out ahead against rival proposals EMOCA and SPECTRE.
The Lbase method comes out ahead against rival proposals EMOCA and SPECTRE.

The paper states:

‘[The results] show that using our perceptual shape loss improves upon the DECA initialization in terms of identity. We observe, similar to the optimization case, that the Lfull and Lcompact settings show the greatest improvement over Lbase on the NoW Challenge selfie category.

‘On the NoW test [set], Lbase achieves the best score. However, both Lfull and Lcompact still improve upon the DECA initialization. In contrast, EMOCA and SPECTRE keep the DECA identity parameters fixed and do not improve upon the DECA results on the NoW benchmark.’

Finally, the researchers tested for topology independence, to ensure that the candidate loss function was indifferent to specific reconstruction methodologies, such as the differing number of vertices in diverse mesh-based methods (i.e., methods that use CGI and parametric representation in some way).

Four different mesh topologies across four different systems were tested, respectively using facial meshes containing 5,072, 19,577, 35,709 and 38,799 vertices. These meshes were tested along with the Basel Face Model (BFM), and the PCA face models used (at the four different resolutions), were created along the lines suggested by the aforementioned Semantic Deep Face Models.

The new loss function performs well irrespective of the configuration of the meshes used in the pipeline.
The new loss function performs well irrespective of the configuration of the meshes used in the pipeline.

Of these results, the authors state:

These experiments indicate that our loss function is topology-agnostic, and our critic network does not need to be retrained when the topology (or model) changes.’

For a video overview of the project, see the embedded video at the end of this article.

Conclusion

It is good to see the state-of-the-art in face-specific loss functions advanced at all. Subject-neutral approaches such as LPIPS, while having been trained extensively on faces, represent a ‘jack of all trades’ approach that inevitably sacrifices some attention during training to non-face (and non-human) data.

The ongoing commitment of Disney and ETH to mesh-based approaches does, however, continue to lock facial neural workflows into CGI-based instrumentalities and methodologies. For the time being, this is a pragmatic compromise; but one can hope to eventually see similar face-based loss functions that make heavier use of priors, and seek to obtain the same, or better results, without needing to resort to CGI as an intermediary technology.

More To Explore

AI ML DL

Controllable Deepfakes With Gaussian Avatars

Could Gaussian Splatting become the hottest new deepfake technology since 2017? The massive surge of interest from the research sector suggests it might – and the latest innovation not only brings full controllability to neural or deepfaked faces, but also lets you become someone else at an unprecedented level of photorealism and efficiency.

AI ML DL

Badly-Compressed Images Affect CLIP’s Performance, New Research Contends

CLIP is the new darling of the computer vision research, and of image-based generative AI, with wide uptake of the image/text analysis framework across the sector. However, new research indicates that CLIP’s efficiency and usefulness is negatively affected by badly-compressed images. Though this should not be a problem in the modern high-speed broadband age, it is – because so much essential data and methodologies still in use data back several decades.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle