Improving Facial Expression Synthesis Through Gan-Based Frontalization

emotion-gan

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Though we have written quite extensively about the difficulties of generating extreme angles of people in facial synthesis systems when that type of data is lacking in the dataset (it usually is), there is a strand of research in computer vision which scopes itself in the opposite direction: frontalization is a technique that attempts to produce passport-style, ‘straight-on’ images from source material where the subject is not directly facing the camera.

Frontalization attempts to depict subjects from a straightforward POV, based on elliptical views. Source: https://core.ac.uk/download/pdf/199220278.pdf
Frontalization attempts to depict subjects from a straightforward POV, based on elliptical views. Source: https://core.ac.uk/download/pdf/199220278.pdf

Known also as Frontal View Synthesis (FVS), this is a relatively well-funded pursuit, since it largely benefits security operations – for instance, where the authorities may have an in-the-wild oblique image of someone taken from a security camera, which is not ideal for witness verification, and would be better served by the kind of simple mug-shot normally only ever available from ATM cameras (where the customer is forced into a frontal pose by the design of the machine).

The trouble with trying to make guesses in either direction (oblique>passport, passport>oblique) is that whatever facial expression is depicted in the source image is likely to be coming along for the ride when the neural conversion takes place.

In details from the above-cited project, we see that the system in question cannot 'neutralize' or alter the expression in the source image.
In details from the above-cited project, we see that the system in question cannot 'neutralize' or alter the expression in the source image.

The field of Facial Expression Recognition (FER), stuck as it currently is with the very limited number of facial expressions available in the dominant paradigm, the Facial Action Coding System (FACS), is already challenged to recognize expressions in a passport-style canonical view, i.e., a mugshot.

Every degree of facial pose-angle away from that mugshot-pose effectively becomes a domain of its own, which means that for recognition and synthesis systems dealing with arbitrary and extreme facial poses, each expression represents, effectively, a type of sub-identity that must be contended with:

Relating extreme facial poses to default or canonical reference emotions. Source: https://www.sciencedirect.com/science/article/pii/S1110016823000327
Relating extreme facial poses to default or canonical reference emotions. Source: https://www.sciencedirect.com/science/article/pii/S1110016823000327

Any work that advances the state-of-the-art in facial expression inference at extreme angles and most especially which provides improved ways of either frontalizing or de-frontalizing source images, is something that’s going to be of interest to the growing neural VFX community.

eMotion-GAN

One such paper has just emerged from France, which uses a motion-based Generative Adversarial Network (GAN) to map expressions from extreme angles back down to a canonical mugshot pose.

Large variations for frontal view synthesis are handled effectively in the new system. Source: https://arxiv.org/pdf/2404.09940.pdf
Large variations for frontal view synthesis are handled effectively in the new system. Source: https://arxiv.org/pdf/2404.09940.pdf

In terms of subsequent categorization, where effective, this technique can help to associate recognized emotions in a mugshot viewpoint with the same emotion from an oblique angle, building up a cohesive domain.

Projecting this line of research into the future, this kind of approach could help film and TV directors to tweak facial expressions in post-production in a semantic and systematic way, instead of an artisanal way where latent codes are pushed around until the face ‘looks right’.

Additionally, it would facilitate workflows where more agile and flexible expression recognition was possible, making datasets more searchable for apposite material for any particular shot or model training.

The paper states:

‘Considering the motion induced by head variation as noise and the motion induced by facial expression as the relevant information, our model is trained to filter out the noisy motion in order to retain only the motion related to facial expression.

‘The filtered motion is then mapped onto a neutral frontal face to generate the corresponding expressive frontal face. We conducted extensive evaluations using several widely recognized dynamic FER datasets, which encompass sequences exhibiting various degrees of head pose variations in both intensity and orientation.

‘Our results demonstrate the effectiveness of our approach in significantly reducing the FER performance gap between frontal and non-frontal faces. Specifically, we achieved a FER improvement of up to +5% for small pose variations and up to +20% improvement for larger pose variations.’

Notable in the new system is that it does not rely on facial landmarks as an anchor reference for processing (which can smuggle in unwanted ID data), which is uncommon in similar literature; and that it can also transfer expressions between subjects (and, the authors claim, to a variety of other face categories including animals and drawings, though this is not extensively dealt with in the new work).

The new paper is titled eMotion-GAN: A Motion-based GAN for Photorealistic and Facial Expression Preserving Frontal View Synthesis, and comes (unusually) with a full code release. The four IEEE members contributing to the work come from IMT Nord Europe, and the Centre de Recherche en Informatique Signal et Automatique de Lille, both in Lille, France.

Method

The method essentially treats head pose variation as part of a motion domain, and centers on extracting facial motion while disregarding head pose motion. The authors contend that this approach allows for the extraction of generic (rather than ID-locked) expression data, which can be used across a variety of individuals.

(However, it should be noted that the 6-8 popular facial expressions in FACS remain in themselves a limitation of any system which adopts FACS – but we have to start somewhere!)

From the new paper, examples of how eMotion-GAN estimates motion from variation in head poses, and retains only motion from the facial musculature, transposing the result to a synthesized image in frontal plane view
From the new paper, examples of how eMotion-GAN estimates motion from variation in head poses, and retains only motion from the facial musculature, transposing the result to a synthesized image in frontal plane view

The process consists of two primary phases: motion frontalization, and motion warping:

Schema for eMotion-GAN.
Schema for eMotion-GAN.

In the first phase, visualized in the schema above, motion frontalization converts optical flow data and filters the motion related to the head pose, transposing the motion from the facial musculature into a frontal view. The expression discriminator works in concert with various reconstruction loss functions.

In the second phase, motion warping, the neutralized and frontalized face generates the desired expression, with a pre-trained classifier handling expression recognition.

In the first phase, given two consecutive face images, the face is detected and cropped before optical flow is calculated, and high resolution images are used to preserve identity information, though the images will eventually be resized to match the far lower resolutions that can pass through the processing capabilities of the machine training the model.

The architecture at work in the first module begins with a flow generator comprising an encoding and decoding network with the input and output set to 128x128px.

The subsequent PatchGAN discriminator is inspired by the Pix2Pix GAN discriminator, a Convolutional Neural Network (CNN) which guesses whether a slice of the generated output is real or fake, building up a comprehensive prediction iteratively.

Next, the expression discriminator serves as an additional CNN-based classifier for FER purposes, using optical flow data as input, and then encoding the facial motion of the identity under study.

The motion warping module distorts the frontalized facial motion processed by the previous module into a neutral expression, by applying the calculated motion fields that have been extracted up to this point. The expressions subsequently produced are evaluated by an FER detector (denoted as eFER), which ensures fidelity of facial expression preservation.

The aforementioned image generator includes a motion encoder, an identity encoder, and a joint decoder for rendering the facial image.

The eFER functionality makes use of the pretrained Chinese DMUE model, which is intended to combat annotation ambiguity. In accordance with the combative nature of the generator/discriminator model in GAN systems, it also provides blind feedback as to whether the previous submitted face was warped correctly or not (‘blind’ in the sense that it doesn’t tell the generator in what way it failed, only that it did, to ensure that the generator does not overfit to a feedback trend).

Image-based predictions from three batches under DMUE, used in the new system. Source: https://arxiv.org/pdf/2104.00232.pdf
Image-based predictions from three batches under DMUE, used in the new system. Source: https://arxiv.org/pdf/2104.00232.pdf

In terms of loss functions within eMotion-GAN, the authors note that GANs often suffer from artifacts, and they have therefore added additional penalized losses to mitigate this. These consist of Mean Absolute Error (MAE), VGG-based perceptual loss, and image-based facial expression less.

For the flow generator, Optical Flow’s End Point Error (EPE) and Charbonnier losses are employed. End Point Loss is an optical flow-specific metric that measures the Euclidean distance between an estimated optical flow and the underlying ground truth; Charbonnier loss, instead, is a type of L1 loss that’s particularly suited for depth estimation.

Besides these additional losses, the standard GAN losses apply. Sigmoid cross-entropy is used for the initial discriminator, and categorical cross-entropy for the emotion discriminator.

Data and Tests

To test the system, experiments were conducted across various  datasets that contain a wide variety of facial expressions from the same individuals, and from other individuals. Whether generated for the purpose, or found in a dataset, paired images are necessary for these purposes.

The researchers based the majority of experiments on the  SNaP-2DFe dataset, being the sole open source dataset featuring paired images with facial expressions acquired simultaneously using mugshot and oblique viewpoints.

The paper states:

‘Simultaneous Natural and Posed 2D Facial expressions dataset [(SNaP-2DFe)] is an image dataset that meets the requirements requested for the motion frontalization phase. It contains paired images of frontal and non-frontal faces for different subjects and facial expressions.

‘More specifically, it provides images of subjects with expressive faces acquired simultaneously with head pose variation (unconstrained recording) and without head pose variation (constrained recording) from 15 different subjects with 6 categories of head poses (yaw, pitch, roll, diagonal, nothing and Tx) and 7 different facial expressions most commonly used in the literature (disgust, happiness, anger, surprise, neutral, fear and sadness).’

Samples from the SNaP-2DFe dataset. Source*: https://archive.ph/P39WF
Samples from the SNaP-2DFe dataset. Source*: https://archive.ph/P39WF

For training, common FER datasets were employed: CK+; ADFES; MMI; and Oulu-CASIA.

Examples from the MMI dataset, used in tests for eMotion-GAN. Source: https://ibug.doc.ic.ac.uk/research/mmi-database/
Examples from the MMI dataset, used in tests for eMotion-GAN. Source: https://ibug.doc.ic.ac.uk/research/mmi-database/

The researchers state that these latter datasets were included primarily to bolster the system’s ability to derive facial motion patterns. For those datasets that did not include paired mugshot/oblique couplets, the authors generated apposite pair-matches, to obtain couplets.

To perform meaningful tests, the authors needed to compare results from an FER classifier in both the motion and image domains, with no frontalization applied, and with frontalization.

Facial motion was extracted with the Farneback method, which obtains optical flow direction from grayscale imagery, and which the authors report is best-suited for FER.

Optical flow directional indicators on the right, using the Farneback method. Source: https://www.diva-portal.org/smash/get/diva2:273847/FULLTEXT01.pdf
Optical flow directional indicators on the right, using the Farneback method. Source: https://www.diva-portal.org/smash/get/diva2:273847/FULLTEXT01.pdf

For the learning rate for the motion frontalization module, 1×10-5 was used under the Adam optimizer for the flow generator, the same but at 1×10-4 for the patches discriminator and for Stochastic Gradient Descent (SGD) loss, and also for the expression discriminator. A learning rate schedule was adopted from the DCGAN project, for the patches discriminator.

For the motion warping module, Adam was again used, with a learning rate of 10-5 for the image generator.

The model was trained for 15 epochs (an epoch being a complete presentation of the data to the training system) at a batch size of 4, and PyTorch was used for this end-to-end model.

Initial tests were for Frontal View Synthesis, and were facilitated by SIFT flow for deep learning prior approaches, and Complete Face Recovery GAN (CFR-GAN) and pixel2style2pixel (pSp) for image-based methods. For optical flow-based methods, Global-Local Universal Network (GLU-Net) was employed.

All experiments were performed under identical settings, with a baseline consisting of face detection and facial cropping, without frontalization. Results were compared in both motion and image domains.

Though the authors describe the initial test as a ‘qualitative comparison’, it does actually seem to qualify more as a quantitative comparison, since it is not the object of a user-study or of casual observation, but rather makes use of three common metrics: Structural Similarity Index (SSIM); Root Mean Squared Error (RMSE); and the aforementioned EPE.

Mean and standard deviations of EPE, RMSE and SSIM across the validation datasets.
Mean and standard deviations of EPE, RMSE and SSIM across the validation datasets.

Of these results, the authors comment:

‘[The results table] shows the superiority of our method in both motion and image over the other FVS methods.

‘Indeed, a lower EPE indicates that our frontalization model succeeds in frontalizing motion while preserving facial expression patterns, which is mainly due to the fact that our method frontalizes motion rather than image, allowing it to take advantage of the similarity of expression patterns between different subjects, and a higher SSIM indicates that the warping model successfully warps the frontalized motion in the image, which is mainly driven by the perceptual loss included in the warping model that allows it to generate consistent expressive faces while preserving the facial identity of the subject.’

Next came qualitative experiments for image warping, the second module in the core schema, where the diverse FVS methods mentioned earlier were trialed against eMotion-GAN:

Qualitative comparisons against diverse frameworks for image warping.
Qualitative comparisons against diverse frameworks for image warping.

Here the authors state:

‘In terms of image reconstruction, our method manages to warp the frontalized flow into the neutral face to recover the original facial expression. Unlike other methods, our method does not suffer from artifacts and distortion of expression patterns. Thus, allowing the facial expression to be recovered without loss of identity of the subject.’

FER comparisons were next on the slate. Here the datasets used were ADFES, MMI, CASIA, SNAP-F and SNAP-NF (frontalized and non-frontalized variants from the SNaP-2DFe project, and CK+. Rival systems were Ad-Corre, DAN, HSE, DMUE, against the authors own classifier, the flow-based Flow-CNN.

Quantitative comparison against rival methods. Please refer to the source paper for better resolution.
Quantitative comparison against rival methods. Please refer to the source paper for better resolution.

Of these results, the paper states:

‘In [the] image domain, the baseline seems to achieve the best results given that recent FER approaches are trained to analyze frontal and less constrained faces, thanks to augmentation methods.

‘Image-based frontalization tends to introduce excessive deformation, altering both facial structure and expression. Our motion-based approach applies a loss that preserves face structure and ensures consistent textural rendering, albeit with a slight attenuation of [expression].

‘Nevertheless, our method is still competitive on most datasets as it nearly achieves the results given by recent FER methods in the literature.

‘However, in motion field, the results clearly demonstrate that our proposed method outperforms other FVS methods across datasets.’

Finally, besides the ablation studies that we are not covering here, the researchers tested cross-subject facial motion transfer, which seeks to swap expressions across subjects, and even across sub-domains:

Cross-subject motion transfer tests the ability of eMotion-GAN to swap expressions across apparently incompatible types of image. Please refer to source paper for improved resolution.
Cross-subject motion transfer tests the ability of eMotion-GAN to swap expressions across apparently incompatible types of image. Please refer to source paper for improved resolution.

The authors comment:

‘Our motion warping model can be used for different interesting applications, including: 1) data  augmentation by generating a variety of expressive faces given a neutral face. Indeed, along with our warping model, we can build a conditioned GAN to generate facial expressions motion conditioning on the facial expression and warp the generated facial motion to neutral faces to generate additional data. 2) class balancing: to balance number of samples per expression in a non-balanced FER dataset. 3) intensification or attenuation of facial expressions: to transform micro-expressions into macro-expressions and vice versa. 4) category balancing: as in many datasets, categories such as gender, age and ethnicity are not balanced, our warping model can be used to overcome such constraint.’

Conclusion

FER is a thorny challenge in facial synthesis systems that cannot always depend on input that comes pre-frontalized. The ability to create a canonical or default pose for a variety of expressions could lead in time to  the same kind of divergence-based flexibility that CGI-based systems such as 3D Morphable Models (3DMMs) and FLAME employ, i.e., by defining features based on their divergence from a ‘resting’ canonical template.

The ability to process such a default position without the aid of mesh-based ancillary systems such as 3DMM could be a useful step away from the growing interdependence of CGI and neural techniques.

* Link is too long to include, available in snapshot

Please refer to the source paper for very extensive detail of the minutiae of the testing methodology, which far exceeds the scope of this coverage.

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle