Dealing with Unconventional Facial Expressions in Neural Synthesis

SMIRK FER

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Considering the tremendous surge of interest in human synthesis over the last 18 months, relatively little attention is being paid, either in the research scene or in the public’s understanding of the implications of generative AI, to the challenge of recognizing, altering and synthesizing human facial expressions.

Anyone who has ever worked with a demanding director, on set or in post-production, will be aware of the assiduous attention to detail that can cause 127 takes of a scene, or cause such production delays that the movie needs to be shot a second time, overtaken by emerging technology in the interim.

So as it becomes increasingly possible to alter facial expressions in post-production, via neural techniques such as expression editing, directors will expect a full toolkit of techniques in this regard – and that currently does not exist.

Part of the problem is that the underlying principles of facial affect recognition itself are dominated by the mere 6-7 ‘core’ universal expressions of the Facial Action Coding System (FACS).

Facial Action Units, divided between different parts of the face, codified by the 1970s-born Facial Action Coding System (FACS), at work in AffectNet. Source: https://arxiv.org/pdf/2103.15792.pdf
Facial Action Units, divided between different parts of the face, codified by the 1970s-born Facial Action Coding System (FACS), at work in AffectNet. Source: https://arxiv.org/pdf/2103.15792.pdf

The nascent study and classification of facial expressions is not evolving in service of human synthesis, but is motivated by the psychological and security research sectors, which have their own agendas and their own timetables.

Arguably, if we compare the current conceptual state-of-the-art in the study of human facial expressions, it would appear to be nearer the stage that psychiatry and psychology was at sometime in the 1930s – the future is bright, but distant.

Face Value

As it stands, whether a director wishes to create or amend an expression in a purely generative video (and here the advent of Sora has convinced some industry luminaries that generative video technologies are encroaching on traditional film and TV production at lightning speed), or to use neural methods to fine-tune an actor’s performance months after the wrap, six available expressions are clearly not going to cut it.

What we’re currently witnessing, instead, is the use of AI techniques to change recorded faces into a kind of ‘neural clay’, so that by pushing around sections of a model’s latent space, it’s possible to raise an eyebrow, furrow a brow, dial down a smile, or otherwise tweak the expressions that were recorded on the day.

Some examples of expression editing from the recent ChatFace offering. Source: https://arxiv.org/pdf/2305.14742.pdf
Some examples of expression editing from the recent ChatFace offering. Source: https://arxiv.org/pdf/2305.14742.pdf

This kind of plasticity is useful but artisanal in nature – no more automated than any of the other CGI-based techniques that have been developed over the last 30-40 years, in that a professional or end-user is required to play around with possibilities until the face ‘looks right’, rather than choosing and applying an apposite expression from a general toolkit.

So this kind of neural plasticity is a useful tool, but it doesn’t enhance or advance the limited semantic understanding that prevents us imposing a wider range of more subtle emotions, i.e., by defining any required emotion in terms of facial expression, and imposing it neurally.

In this sense, both FACS, and the current methodologies of Facial Expression Recognition (FER) datasets are arguably the choke-points; if a system settles on just six core expressions, the multitude of other possible and more subtle possible facial expressions will end up shunted into some ‘ghetto’ at the periphery of one of these severely limited categories.

Worse, from the point of view of developing better systems, there are no upstream libraries that can bring a wider gamut of expressiveness into new systems – because the dominant systems (FACS, etc.) tends to define them, and finer-grained facial expressions are therefore hard to either capture or synthesize.

A further obstacle to improving on the current state-of-the-art is the difficulty in capturing expressions which are meaningful and interpretable to people, yet which are outside of the standard intent and accepted semantics of facial expressions.

Partly this is because identity and facial emotion are very entwined; the lineaments of a person’s neutral face may accord with a known emotion, which can confuse an FER system, for instance.

Secondly, where does a smile (as a neutral concept) end and a particular identity begin? This is an impossible separation, since the medium is also the message, and facial topology and expressed emotion are difficult to prise apart, so that we may obtain ‘neutral’ domain models for a broad range of expressions.

SMIRK

One recent collaboration between Greece and Germany, however, has at least apparently advanced the state-of-the-art in being able to distinguish ‘outlier’ expressions. Titled Spatial Modeling for Image-based Reconstruction of Kinesics (aka the tortuously shoe-horned acronym SMIRK), the system uses a neural rendering module in concert with extracted 3D facial meshes (via the 3DMM-style FLAME framework) to separate topology from secondary aspects, such as albedo, which are less explanatory of emotion than facial identity.

Examples of mesh extrapolation via the improved processes of SMIRK. Source: https://arxiv.org/pdf/2404.04104.pdf
Examples of mesh extrapolation via the improved processes of SMIRK. Source: https://arxiv.org/pdf/2404.04104.pdf

By focusing solely on the inferred geometry, the authors claim, it is easier to isolate facial manipulations which may indicate non-mainstream expressions (such as a ‘smirk’, which has no place in the FACS canon, for instance).

Additionally, this method allows the researchers to generate versions of the input source identity that have different expressions than the one depicted in the original image, and these amended expressions can be used as further input to generalize the reconstruction model.

SMIRK's expression perturbation pipeline aides in overall reconstruction, facilitating a better ability to extract unusual expressions.
SMIRK's expression perturbation pipeline aides in overall reconstruction, facilitating a better ability to extract unusual expressions.

The authors state:

‘Our extensive experimental results show that SMIRK outperforms previous methods and can faithfully reconstruct expressive 3D faces, including challenging complex expressions such as asymmetries, and subtle expressions such as smirking.’

The paper is titled 3D Facial Expressions through Analysis-by-Neural-Synthesis, and comes from seven researchers across Greece’s Institute of Robotics at the Athena Research Center, National Technical University of Athens, and the Institute of Computer Science at Hellas (FORTH), as well as the Max Planck Institute for Intelligent Systems in Tubingen, Germany.

Method

The new system is inspired by a strand of recent projects that use facial reconstruction methodologies, including the pivotal EMOCA project, also from the Max Planck Institute; SPECTRE, a prior work featuring many of the same authors as for the new paper; and, among others, the Chinese offering 3D Face Reconstruction Using A Spectral-Based Graph Convolution Encoder.

The through-line of most of the prior works informing SMIRK is the use of CGI-based parametric models – default heads and/or bodies in base canonical positions and configurations, which are then conformed to features extracted from real images of people.

One of the oldest technologies still actively used in current neural synthesis development, 3DMM (and later off-shoots) projects real characteristics onto a CGI head. Source: https://arxiv.org/pdf/1909.01815.pdf
One of the oldest technologies still actively used in current neural synthesis development, 3DMM (and later off-shoots) projects real characteristics onto a CGI head. Source: https://arxiv.org/pdf/1909.01815.pdf

Once a connection has been made between the keypoints on the parametric head and the landmarks and other tokens obtained from the original images, the user has an optimal degree of instrumentality in controlling several key neural processes.

The difference with SMIRK, the authors claim, is that the new image>image method employed in the system bridges the domain gap between the real source input images and the synthesized output.

A domain gap is common when synthetic imagery is used, since the quality of synthesized imagery in training datasets is not usually of the same standard as the novel data that the system will be asked to create at inference time. By creating an interstitial altered image based directly on the individual source image in each case, ground truth can be obtained on a per-instance basis improving accuracy.

In the initial reconstruction pass, the source image is passed to an encoder that regress camera parameters and other data from the FLAME CGI model. Thus a 3D model is conformed to the source image and rendered with a differentiable rasterizer, before being reconstructed with the image-image translation network. After this, self-supervised photometric, landmark and perceptual losses are estimated.
In the initial reconstruction pass, the source image is passed to an encoder that regress camera parameters and other data from the FLAME CGI model. Thus a 3D model is conformed to the source image and rendered with a differentiable rasterizer, before being reconstructed with the image-image translation network. After this, self-supervised photometric, landmark and perceptual losses are estimated.

The researchers state*:

‘SMIRK contributes with a novel neural rendering module that bridges the domain gap between the input and the synthesized output. By minimizing this discrepancy, SMIRK enables a stronger supervision signal within an analysis-by-synthesis framework.

‘Notably, this means that neural-network based losses such as perceptual, identity, or emotion can be used to compare the reconstructed and input images without the typical domain-gap problem that is present in most works.’

The FLAME model used by SMIRK appears in a number of synthesis projects that we have covered, including Gaussian avatar deepfakes, an improvement on the Controlnet Stable Diffusion ancillary framework, and the ManVatar NeRF-based avatar framework, among others.

The generated mesh has 5023 vertices, which is considered modest in facial CGI modeling, and is capable of eye closure and jaw rotation.

The encoder used for processing through to the expression variation is a deep neural network that regresses the FLAME parameters into three distinct branches, each of which is a MobileNet V3 backbone.

The neural renderer component is intended to replace conventional graphics-based networks in similar systems with an image-to-image model. During the transliteration process, random pixels are blocked out from the generated face images as the system iterates, so that the ultimate representation does not overfit to the source image, but relies instead on the mesh.

Random dropout of parts of the processing image is a common technique during training, designed to help the model generalize and not obsess about being faithful to the exact configuration of the original photo.
Random dropout of parts of the processing image is a common technique during training, designed to help the model generalize and not obsess about being faithful to the exact configuration of the original photo.

The masking of the image itself is accomplished, effectively, by joining together the facial landmark dots that are inferred from the source image. The system used for this is the Facial Alignment Network (FAN) library that also powers pose and topology recognition for the 2017-era deepfake packages DeepFaceLab, DeepFaceLive, and FaceSwap.

As for the reconstruction loss functions that ensure the system improves with training, the functions used include photometric loss, which is a simple L1 loss that compares the output to the source image; and VGG Loss, a perceptual encoder that can converge faster in the earlier phases of training.

Additional loss functions are landmark loss, an L2 comparison between estimated landmarks in source and generated images; expression regularization, an L2 loss that penalizes extreme or unlikely expressions; and emotion loss, which derives a loss based on the aforementioned EMOCA FER network.

In this regard, the authors state:

‘To prevent the image translator from adversarially optimizing the emotion loss by perturbing a few pixels, for this loss we keep the image translator T “frozen”, optimizing only the expression encoder Eψ . Note that unlike EMOCA, our framework ensures that the emotion loss does not suffer from domain gap problems, as the compared images reside in the same space.’

However, the improved supervision signal obtained by these methods remains at the mercy of the poor diversity in the underlying training datasets, which, the authors note, is also a choke-point for all previous analogous systems.

The authors illustrate the issue:

‘This means for example that if a more complex lip structure, scarcely seen in the training data, cannot be reproduced fast enough by the encoder, the translator T could learn to correlate miss-aligned lip 3D structures and images and thus multiple similar, but distinct, facial expressions will be collapsed to a single reconstructed representation.

‘Further, this may lead to the translator compensating for the encoder’s failures during the joint optimization.’

This issue is addressed in the new system by what the authors term an augmented expression cycle consistency path.

Permutations, including an attempt at evaluating a 'neutral' expression, using the augmented expression cycle consistency path part of the SMIRK workflow.
Permutations, including an attempt at evaluating a 'neutral' expression, using the augmented expression cycle consistency path part of the SMIRK workflow.

Here the original expressions are replaced with new variations, and photorealistic images generated, as seen in the image above. Thus novel training pairs are generated dynamically.

This cycle consistency loss is therefore implemented directly in the expression parameter space of the FLAME model, which allows the predicted expression to resemble the captured one far more closely than in similar systems.

This approach stops over-compensation errors (‘bad guesses’ on the part of the internal mechanisms iterating through the data), and encourages a broader range of facial expressions.

The various methods used, seen in the image above, in this module, are permutation, perturbation, template injection and zero expression (neutral expressions).

Though the others are reasonably self-explanatory, template injection involves feeding the FLAME facial CGI model with parameters obtained from the FaMoS dataset, which features multiple subjects performing asymmetric and extreme expressions.

Examples from the FaMoS dataset, which features atypical expressions. Source: https://tempeh.is.tue.mpg.de/#dataset
Examples from the FaMoS dataset, which features atypical expressions. Source: https://tempeh.is.tue.mpg.de/#dataset

Two losses are used during this phase – the self-explanatory identity consistency and expression consistency. The system alternates between these various passes, alternately freezing the encoder and the translator, to maximize a random effect that will aid flexibility.

Data and Tests

The researchers conducted quantitative and qualitative tests for SMIRK. Where video examples are indicated, some of these can be seen in the official accompanying video for the system, embedded at the end of this article.

Datasets used in the tests were FFHQ, CelebA, LRS3, and MEAD. Competing rival systems were DECA and EMOCA V2, both also reliant on the FLAME model, and additionally Deep3DFace and FOCUS, which instead use the BFM model.

The three encoders used for tests were all pre-trained and supervised by two losses: landmark loss, for pose and expression, and shape predictions based on the MICA framework.

(Though it is not referred to otherwise in the paper, the results – see further below –  feature MGCnet as well)

Examples from the MICA framework, which contributes to losses used in tests. Source: https://arxiv.org/pdf/2204.06607.pdf
Examples from the MICA framework, which contributes to losses used in tests. Source: https://arxiv.org/pdf/2204.06607.pdf

The researchers point out the frequent assertions in the literature that contend that evaluating facial expression reconstruction is an ill-posed problem in itself*:

‘The geometric errors tend to be dominated by the identity face shape and do not correlate well with human perception of facial expressions. Accordingly, we compare our method in a quantitative manner with three experiments: 1) emotion recognition accuracy, ability of a model to guide a UNet to faithfully reconstruct an input image, and 3) a perceptual user study.’

In accordance with the EMOCA protocol, the authors trained a Multi-Layer Perceptron (MLP) to classify eight basic expressions, and to run regression on valence and arousal values with the use of AffectNet.

Metrics reported for the emotion recognition test were Concordance Correlation Coefficient (CCC), Root Mean Squared Error (RMSE) for arousal and valence, and expression classification accuracy (E-ACC).

Emotion recognition performance on AffectNet.
Emotion recognition performance on AffectNet.

The authors report:

‘[SMIRK] achieves a higher emotion recognition score compared to most other methods, although falling behind EMOCAv1/2 and Deep3DFace. It is worth noting that, although EMOCA v1 achieves the highest emotion accuracy, it often overexaggerates expressions which helps with emotion recognition.

‘EMOCA v2, arguably a more accurate reconstruction model, performs slightly worse. Our main model is comparable with Deep3DFace and outperforms DECA and FOCUS. We can also train a model that scores better on emotion recognition, by increasing the emotion loss weight.’

Though it is possible to train a model that would score better for emotion recognition, the authors observe that previous work for EMOCA has indicated this leads to increased artifacts.

Next, the researchers tested for reconstruction loss. For this, a Unet image-to-image translator was trained with the encoder frozen, thus training only the translation functionality. The authors explain:

‘[If] the 3D mesh is accurate enough, the reconstruction will be more faithful, due to a one-to-one appearance correspondence. For each method (including ours for fairness), we train a UNet for 5 epochs, using the masked image and the rendered 3D geometry as input.’

The L1 reconstruction loss and the VGG loss were reported in these tests, which were powered by the AffectNet database.  

Results for image reconstruction performance.
Results for image reconstruction performance.

Here the authors comment:

‘We observe here that using the information for the rendered shape geometry of SMIRK, the trained UNet achieves a more faithful reconstruction of the input image when compared to DECA and EMOCAv2.

‘Particularly for EMOCAv2, we observe that although it can capture expressions, the results in many cases do not faithfully represent the input image, leading to an overall worse image reconstruction error.

‘In terms of L1 loss, SMIRK is on par with Deep3DFace and FOCUS and has a small improvement in terms of VGG loss.’

For the user study, 80 images each from AffectNet and MEAD were used to perform 3D face reconstruction with SMIRK and the prior frameworks, The 85 people participating were shown a face image together with two 3D reconstructions from one or other of the methods, and were requested to choose the optimal image in terms of facial expression representation.

Results from the user study.
Results from the user study.

Of the user study results, the researchers comment:

‘[Our] method was significantly preferred over all competitors, confirming the performance of SMIRK in terms of faithful expressive 3D reconstruction.

‘The results were statistically significant (for all pairs, p < 0.01 with binomial test, adjusted using the Bonferroni method). EMOCAv2, which also uses an emotion loss for expressive 3D reconstruction, was the closest competitor to our method, followed by FOCUS and Deep3D, while DECA was the least selected.’

Lastly (as we do not cover ablation studies except in extraordinary circumstance – please refer to the source paper for these), the authors produced general qualitative examples, both in the form of static images for the paper, and in video form (refer to embedded video at end of article).

Visual comparisons. For some reason the paper has not included text legends for the compared frameworks – from left to right, they are Deep3DFaceRecon, FOCUS, DECA, EMOCA V2 and, finally, SMIRK. Please refer to the source paper for better resolution.
Visual comparisons. For some reason the paper has not included text legends for the compared frameworks – from left to right, they are Deep3DFaceRecon, FOCUS, DECA, EMOCA V2 and, finally, SMIRK. Please refer to the source paper for better resolution.

Here the authors comment:

‘[Our] method can more accurately capture the facial expressions across multiple diverse subjects and conditions.

‘Furthermore, the presented methodology can also capture expressions that other methods fail to capture, such as non-symmetric mouth movements, eye closures, and exaggerated expressions.’

Conclusion

Any new innovation that advances the state-of-the-art in fine-grained expression recognition and synthesis is welcome, and SMIRK certainly qualifies here. However, it is arguably only supplying better recognition methods (in some cases) for later and better semantic architectures around FER than we must currently contend with.

For the moment, a great deal of the challenge that SMIRK and previous systems are addressing is organizational, even philosophical, rather than technical in nature.

* My substitution of hyperlinks for the authors’ inline citations.

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle