Though they may not be the most glamorous aspect of generative synthesis, loss functions are at the very heart of the ever-expanding possibilities for neural facial recreation.
Loss functions represent the perpetual and ongoing process of guidance (or adjudication) during the training of a model, directing the system towards a more faithful recreation of whatever target metric is the objective of the model being trained (for our purposes, that facial representations should become more realistic and more faithful to a target identity).
Many of the most popular loss functions, such as L1 and L2 loss, are purely algorithmic, in that they make mathematical assumptions about a teleological or ‘ideal’ final loss, and drive the training process towards it based on these fixed criteria. .
Increasingly, loss functions are utilizing user studies or direct human judgements to calibrate themselves. One example of this is the Perceptual Similarity Metric and Dataset (LPIPS) loss function, for which the standards were set by human feedback across 484,000 in-person evaluation of results.
One thing that characterizes even the trendiest and latest loss functions, however, is that they all general in nature, and not specifically optimized for the task of evaluating human faces. Yet the study of the human face is proving as predominant a concern in neural rendering as it is in art and human culture in general.
For this reason, some interest has arisen in recent years for the development of a broadly applicable loss function capable of being dropped into a wide range of neural workflows, purely for facial evaluation, without needing to adhere to particular datasets or to force projects to conform to specific methodologies.
To this end, a new joint paper from Disney Research (currently one of the leaders in human synthesis research) and ETH Zürich is proposing to fill this gap, with a methodology based on a 1990s paper that demonstrated how humans use shading to evaluate 3D qualities and realism.
The new system uses depth-map estimation to convert color images – both fake and real, as necessary – into grayscale images that serve as a mean indicator of topology, and is agnostic to any particular mesh-based approach, which allows it to be considered as a generic loss algorithm that could be used in multiple types of neural training systems.
Additionally, the paper offers a dedicated new loss function titled Perceptual Shape Loss (PSL), which is based on the discriminator system of Generative Adversarial Networks (GANs), the improved efficacy of which was demonstrated in multiple tests against apposite benchmarks.
The new paper is titled A Perceptual Shape Loss for Monocular 3D Face Reconstruction, and comes from six researchers across the two institutions.
The new Disney/ETH approach is inspired by the 1992 paper On the perception of shape from shading, which posits that our ability to use shading to determine three-dimensionality is among the most primitive and fundamental functionalities of human vision, since it dispenses with the luxury of color and of other ancillary factors, and is going to operate in poor light and under the most severe circumstances in which we are able to see at all (such as actually having only monocular vision).
Thus shading is argued as a self-promoted lower-level function that can be relied upon, and is therefore a suitable approach for a loss function that must distill essential information.
The new Disney/ETH paper states:
‘Our key idea is to create a loss function for monocular 3D face reconstruction, implemented as a neural network that takes a face image and a gray-shaded render of face geometry as input, and outputs a scalar value to indicate how well the render matches the image in terms of shape.
‘The network should intuitively critique the inputs, and provide continuous feedback about the ‘goodness’ of match between the image and the render, for any image-render pair. Once such a critic network is trained, its output can be interpreted as a perceptual shape loss that can be used for 3D face reconstruction tasks.’
The critic network developed for PSL and its underlying methodology is an advance on the base discriminator function of Deep Convolutional GAN (DCGAN). The authors extended the original spec of that project with the addition of two extra convolutional layers, to bring the functional resolution up from 64px2 to 256px2.
The input to the new network consists of an original RGB image and a derived grayscale render inset against a background of noise (to allow the training system to concentrate on the central content), with the grayscale image illuminated by a single light source.
The network is trained to reward good matches between an image/render pair, and to punish poor matches. To help the system to operate later on novel, unseen data, the training data used contains both good and bad matches, with bad matches exemplified in mismatched expressions, poses, identity, and other core characteristics.
Data and Tests
The base training dataset was the Disney/ETH Zurich dataset Semantic Deep Face Models, which contains 358 distinct identities across 12 camera viewpoints, and with 24 facial expressions. Face geometry is created with the passive stereo evaluation approach presented in the earlier (2010) Disney/ETH paper High-Quality Single-Shot Capture of Facial Geometry.
It was important to generate samples with only a single light source, so the researchers used the prior Rendering with Style method (another Disney/ETH collaboration, from late 2021), which facilitates this. The inpainting used in this older technique allowed the authors to control lighting and viewpoint; and they comment:
‘The resulting dataset contains a large variety of quasi in-the-wild images for which we have the corresponding high-quality geometry serving as ground truth. Given this synthetic data, we can create real and fake samples using the same approach as for studio data.’
The network was ultimately trained on 92,736 examples of real people taken in studio conditions, and 6,276 synthetic images. The distribution of ‘incorrect’ pairings was equally divided between incorrect corresponding identities; incorrect expressions; and both.
A further 99,012 examples were generated with ad hoc and non-conforming head rotations and other skewing of the geometry, leading to a total of 495,060 unique training images, containing only one-fifth real-world images, which were repeated four times to equalize the proportions of real and fake data shown to the system.
Additionally, a further 20,736 examples (50% real and 50% fake) were used as a validation set.
Before training, the data was normalized by cropping faces and aligning eyes, mouth and nose to a common pose, in line with the 2020 Disney/ETH paper High-Resolution Neural Face Swapping for Visual Effects.
This process involved rejecting geometry poses that did not conform to the bounds of a relatively portrait-style pose (and the researchers concede that this method in general is not the best at recreating side-views, citing the common knowledge that in-the-wild profile data is lacking in the research sector in general, and that this has a severe downstream influence on the capabilities of facial neural synthesis systems to generate accurate side-views).
The critic network was trained for four epochs (i.e., four complete views of the data by the training system) on a Titan X GPU, with 12GB of VRAM. Training took eight hours, and used a PyTorch framework running at a learning rate of 0.001, on a batch size of 64, using images at a resolution of 256px2.
To test whether the trained system could distinguish real from fake images, the researchers plotted the distribution of output scores from the critic system, finding that the new approach offers a ‘reasonable separation’.
Next, it was necessary to assess whether the new system was suitable as a novel loss metric in its own right. The authors explain:
‘A loss function should be differentiable, continuous, smooth, and ideally have a minimum at the optimal parameter inputs. As our perceptual loss is implemented as a CNN in PyTorch, it is trivially differentiable.’
This proof was obtained by plotting the normalized output score while performing perturbations on the geometry, as would happen in a real-world training system.
This was repeated three times on diverse example images, varying factors such as identity and head rotation, among others.
The paper states:
‘As the plots indicate, our function is smooth and continuous, and generally decreases as we deviate from the optimal parameters.
‘While not performing an exhaustive evaluation of the local neighborhood around the optimal parameters to show a strict maximum, our experiments show evidence of a local maximum for the optimal geometry renders.’
The authors next tested the system for accuracy of perceptual shape loss during facial reconstruction, using the Faces Learned with an Articulated Model and Expressions (FLAME) system. The three scenarios evaluated in this part of the tests were single-image optimization; regression network training (the creation of an applicable loss function algorithm); and topology independence (how agnostic the method is to specific mesh-based approaches). Each of these used three versions of the loss function: a baseline model, Lbase; a ‘full’ model, Lfull; and an optimized ‘compact’ version, Lcompact.
The optimization results on the validation set are visualized in the upper section of the image below:
The paper states:
‘[We] can see that using our new loss function (Lfull) increases the accuracy of reconstruction over the baseline.
‘The greatest improvement over the baseline can be seen on the NoW selfie challenge subset, which contains mostly frontal faces. Note that all optimizations also naturally improve over the initialization (DECA), which is an inference-based method.’
The authors note additionally that the compact loss method is also better than the baseline, and ‘nearly as good’ as the full method.
NoW test optimization, the researchers observe, shows similar improvements:
The authors note:
‘The optimization leads to best results when including our perceptual shape loss (Lfull), and even using our shape loss and landmark loss alone achieves good results (Lcompact).’
The researchers surmise that the results indicate that PSL is a sound candidate for an architecture-agnostic perceptual loss metric and function.
Next, the authors trained a new inference-based parameter regression method, PSL, on all three sizes of the algorithm. For this, they fine-tuned DECA on 2000 frontal images (since these performed best in other parts of the testing cycle) from the CelebAMask-HQ dataset, using a learning rate of 0.00001 under the Adam optimizer, for 1,700 iterations.
The paper states:
‘[The results] show that using our perceptual shape loss improves upon the DECA initialization in terms of identity. We observe, similar to the optimization case, that the Lfull and Lcompact settings show the greatest improvement over Lbase on the NoW Challenge selfie category.
‘On the NoW test [set], Lbase achieves the best score. However, both Lfull and Lcompact still improve upon the DECA initialization. In contrast, EMOCA and SPECTRE keep the DECA identity parameters fixed and do not improve upon the DECA results on the NoW benchmark.’
Finally, the researchers tested for topology independence, to ensure that the candidate loss function was indifferent to specific reconstruction methodologies, such as the differing number of vertices in diverse mesh-based methods (i.e., methods that use CGI and parametric representation in some way).
Four different mesh topologies across four different systems were tested, respectively using facial meshes containing 5,072, 19,577, 35,709 and 38,799 vertices. These meshes were tested along with the Basel Face Model (BFM), and the PCA face models used (at the four different resolutions), were created along the lines suggested by the aforementioned Semantic Deep Face Models.
Of these results, the authors state:
These experiments indicate that our loss function is topology-agnostic, and our critic network does not need to be retrained when the topology (or model) changes.’
For a video overview of the project, see the embedded video at the end of this article.
It is good to see the state-of-the-art in face-specific loss functions advanced at all. Subject-neutral approaches such as LPIPS, while having been trained extensively on faces, represent a ‘jack of all trades’ approach that inevitably sacrifices some attention during training to non-face (and non-human) data.
The ongoing commitment of Disney and ETH to mesh-based approaches does, however, continue to lock facial neural workflows into CGI-based instrumentalities and methodologies. For the time being, this is a pragmatic compromise; but one can hope to eventually see similar face-based loss functions that make heavier use of priors, and seek to obtain the same, or better results, without needing to resort to CGI as an intermediary technology.