Context Matters in Facial Expression Recognition (and Synthesis)

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

If it were possible to accurately ‘read’ a person’s emotional state based solely on an analysis of their facial muscle disposition, the Facial Action Coding System (FACS), developed in the 1970s and now the dominant methodology in Facial Expression Recognition (FER) and synthesis, would be an unassailable scientific approach.

However, our practical experience of emotional state evaluation (a basic human survival skill), tends to additionally consider the context in which facial expressions occur: the police officer questioning a suspect who, if innocent, would logically appear at least a little stressed – and yet has a tranquil and complacent expression; the woman using pleasant facial expressions to defuse a dangerous or threatening situation; or the psychopath whose affect disconnect allows them to accurately mimic apposite facial expressions that do not actually reflect their true feelings.

In stills from 'The Shining' (1980) and 'Memento' (2000), the disposition of facial muscles doesn't tell us everything we necessarily need to know about the subjects depicted. Sources: https://www.noldus.com/facereader/measure-your-emotions / https://filmschoolrejects.com/35-things-we-learned-from-the-memento-commentary-168d8e6d3a8d/ / https://www.imdb.com/title/tt0081505/reference/
In stills from 'The Shining' (1980) and 'Memento' (2000), the disposition of facial muscles doesn't tell us everything we necessarily need to know about the subjects depicted. Sources: https://www.noldus.com/facereader/measure-your-emotions / https://filmschoolrejects.com/35-things-we-learned-from-the-memento-commentary-168d8e6d3a8d/ / https://www.imdb.com/title/tt0081505/reference/

The need to incorporate environmental or ancillary aspects into emotion recognition systems has gained traction in the research sector in recent years, leading to initiatives such as the REACT 2023 Multimodal Challenge, which is investigating new systems that take other factors into account, such as music (in films and TV FER evaluation), physical environment, and text content components of a target evaluation scenario – as well as considering other motion and audio aspects of the interaction.

Sample frames from the COGNIMUSE project, a recent work that seeks to evaluate emotional state by considering not just video content, but the contributing influence of audio and text-based content in movies. Source: https://arxiv.org/pdf/2306.10397.pdf
Sample frames from the COGNIMUSE project, a recent work that seeks to evaluate emotional state by considering not just video content, but the contributing influence of audio and text-based content in movies. Source: https://arxiv.org/pdf/2306.10397.pdf

The current research interest in expression editing is largely dependent on the standards that the broader FER field sets –  and most particularly on the methodologies and possible preconceptions of the dominant datasets. Therefore, any false position that FER promulgates will tend to feed down into the realm of neural visual effects research. If expression synthesis is ever to become an effective plank in AI-based VFX, the wider groundwork laid down by less ‘niche’ sectors – such as psychology, security and affect UI systems research – may need to look beyond the conventional and easy academic provenance of FACS.

SAFER Facial Expression Recognition

A recent addition to this growing body of interest in post-facial expression recognition is Situation Aware Facial Emotion Recognition (SAFER), offered recently by two researchers at Purdue University in the USA.

From the new paper: a subject in isolation may appear angry and hostile – but taking the context and 'background' of the photo into account radically changes our perceived estimation of his emotional state. Source: https://arxiv.org/pdf/2306.09372.pdf
From the new paper: a subject in isolation may appear angry and hostile – but taking the context and 'background' of the photo into account radically changes our perceived estimation of his emotional state. Source: https://arxiv.org/pdf/2306.09372.pdf

The authors of the new paper describe SAFER as a multi-stream emotion recognition system, and have obtained competitive or superior results across a broad section of the most popular datasets and frameworks in the literature, as well as contributing a new database, called DeFi, designed to address some of the shortcomings of prior collections.

Conceptual workflow for the SAFER architecture.
Conceptual workflow for the SAFER architecture.

SAFER takes a single image as input, though this can be a ‘static’ image or an extracted video frame. Face detection is achieved through the 2019 initiative BlazeFace, a lightweight architecture designed for fast inference on mobile devices, now adopted by diverse companies and projects, including the Mediapipe repository currently incorporated into ControlNet on Stable Diffusion.

BlazeFace: the six initial keypoints identified in a face (red box) are iterated into a more complex and granular array of facial landmark indicators . Source: https://arxiv.org/pdf/1907.05047.pdf
BlazeFace: the six initial keypoints identified in a face (red box) are iterated into a more complex and granular array of facial landmark indicators . Source: https://arxiv.org/pdf/1907.05047.pdf

The feature extraction module comprises an Action Unit feature generator, a ‘visible’ or apparent feature generator, and a deep feature extractor.

Action Units are the base building blocks of the FACS system. Even frameworks which do not necessarily subscribe to the FACS methodology tend to use these units as discrete representations of sub-groups of facial movement, since the taxonomy itself is useful, whatever the ultimate application or governing theory may be.

Action Units – the base building blocks of the FACS methodology, and many subsequent, contrasting systems. Source: https://www.semanticscholar.org/paper/A-method-to-infer-emotions-from-facial-Action-Units-Velusamy-Kannan/f85ccab7173e543f2bfd4c7a81fb14e147695740
Action Units – the base building blocks of the FACS methodology, and many subsequent, contrasting systems. Source: https://www.semanticscholar.org/paper/A-method-to-infer-emotions-from-facial-Action-Units-Velusamy-Kannan/f85ccab7173e543f2bfd4c7a81fb14e147695740

After the face mesh is generated in SAFER (see image below), the system identifies 12 central AU centers based on a limited and essential subset of possible action units available.

Left, the generation of a landmark mesh via BlazeFace; right, the subset of AU rules that govern the detection and concatenation of features into a landmark set.
Left, the generation of a landmark mesh via BlazeFace; right, the subset of AU rules that govern the detection and concatenation of features into a landmark set.

The essential features obtained through this process are calculated into various aspects of the facial disposition and structure, such as width, height, distance, and angle (pose) of the various facial parts.

The three modules that power the facial feature extraction component of SAFER.
The three modules that power the facial feature extraction component of SAFER.

The deep features are obtained via one of two possible types of Convolutional Neural Network (CNN) – a regular CNN with three convolutional layers, where input images are converted to 226×229 pixels; and a transfer learning module using ResNet-50 trained on the ImageNet dataset (which is used in the studies conducted for the paper).

After the background is extracted from the frame, the resulting context is passed to an AlexNet network that has been pretrained on the Places dataset, which is used to recognize environmental aspects and objects.

The Places dataset is a collection of ten million semantically labeled scene photos. Source: https://dspace.mit.edu/bitstream/handle/1721.1/122983/PAMI_places.pdf
The Places dataset is a collection of ten million semantically labeled scene photos. Source: https://dspace.mit.edu/bitstream/handle/1721.1/122983/PAMI_places.pdf

The final feature set obtained is a concatenation of these various extracted streams, where each data flow is accessible as a characteristic or variable.

Tests and Data

To test the system, the researchers trained SAFER on a number of FER datasets common in the literature.

NOTE: The results section of the paper is quite chaotic, and may be best read in a linear way from the original source. Nonetheless we’ll attempt to order the testing assets rational list below.

The rival frameworks tested that are mentioned in the paper include:

Ensemble of deep neural networks with probability-based fusion for facial expression recognition (link, listed as ‘[40]’)
ResNet-50 and VGG-16 for recognizing Facial Emotions (link, listed as ‘[18]’)
Comparing Ensemble Strategies for Deep Learning: An Application to Facial Expression Recognition (link, listed as ‘[41]’)
Facial expression recognition boosted by soft label with a diverse ensemble (link, listed as ‘[17]’)
Ad-Corre: Adaptive Correlation-Based Loss for  Facial Expression Recognition in the Wild (link, listed as ‘[19]’)
Dive into Ambiguity: Latent Distribution Mining and Pairwise Uncertainty Estimation for Facial Expression Recognition (link, listed as ‘[23]’)
Suppressing Uncertainties for Large-Scale Facial Expression Recognition (link, listed as ‘[42]’)
Deep Residual Learning for Image Recognition (link, listed as ‘[36]’)
Face2Exp: Combating Data Biases for Facial Expression Recognition (link, listed as ‘[28]’)
Facial Expression Recognition in the Wild via Deep Attentive Center Loss (link, listed as ‘[20]’)
Occlusion aware facial expression recognition using CNN with attention mechanism  (link, listed as ‘[21]’)

The datasets used in the work are:

– The Extended Cohn-Kanade Dataset (CK+), which contains 593 videos covering 123 subjects, with clips lasting between 10 and 60 frames (link, listed as ‘[38]’).
FER-2013, which contains 35,887 posed face images covering the standard 7 FACS facial expressions (‘angry’, ‘disgust’, ‘fear’, ‘happy’, ‘sad’, ‘surprise’ and ‘neutral’, though in unbalanced amounts). The images are drawn from movies and general fictitious media.
– The EMOTIon recognition in Context (Emotic) dataset, which features 23,571 images drawn from real-life and media sources.
AffectNet, which contains over a million facial images scraped from the internet, with 450,000 used in the study.
RAF-DB, the first database to feature in-the-wild compound expressions, offering 29,672 in-the-wild images.
Context-Aware Emotion Recognition Networks  (CAER-S), which features 70,000 expressions taken from TV shows.
FABO, which features 206 posed video clips.

The process of annotating the AffectNet database, one of the collections used in the tests for SAFER. Source: https://arxiv.org/pdf/1708.03985.pdf
The process of annotating the AffectNet database, one of the collections used in the tests for SAFER. Source: https://arxiv.org/pdf/1708.03985.pdf

These older datasets are pitted against the new dataset compiled for SAFER – DeFi, which the authors have made publicly available.  

The tests were conducted on a 20-core server PC running a 2.6 GHz Intel Xeon CPU with 96GB of RAM, in addition to three NVIDIA Tesla GPUs, each with 24GB of VRAM. Python multiprocessing was used to speed up the processes, and mixed precision libraries used to further reduce the processing burden.

Datasets were split into training, validation and test sets at an 80:10:10 ratio, with images resized to 224x224px. Dataset augmentation was performed, using cropping, brightness, contrast and rotation adjustments, to ensure that no particular data risked to overfit.  An adaptive learning rate was used at a batch size of 32. The performance criteria was ‘accuracy’.

The first set of tests, on benchmark datasets (image below), according to the authors, established SAFER as ‘comparable with the state-of-the-art works in all datasets’:

Initial comparison of SAFER to former datasets.
Initial comparison of SAFER to former datasets.

The authors state*:

‘Our results are comparable with the state-of-the-art works in all datasets. For the FABO dataset, we outperform the accuracy reported by various recent [works]. For the CK+ dataset, we find the best-reported result to be 98.57% accuracy as in [25], our accuracy of 98.5% is comparable to it.’

They further note that SAFER outperforms several other recent works on the CAER-S dataset, with a 7.8% advantage over the nearest-performing prior work:

Tests on the FER-2013 dataset indicate that SAFER outperforms Ad-Corre, and the authors report that their system also improves on [21].

Results from a comparison across various emotion recognition systems on the FER-2013 dataset.
Results from a comparison across various emotion recognition systems on the FER-2013 dataset.

The authors note that class imbalance is a notable obstacle to tests of this type, across many datasets. In AffectNet, the ‘happiness’ class contains 146,198 samples, while the ‘disgust’ class offers only 5,264 samples. If the standard FACS categories were at least a little higher than the 6-7 that are almost uniformly used across FER research projects, this imbalance might be a little less damaging – as it stands, the authors ascribe SAFER’s 63.7% accuracy on FER-2013 to such an imbalance, and note that it is comparable to other frameworks.

In FER-2013, the imbalance is even more pronounced, since the ‘disgust’ class contains only 436 examples, in comparison to ‘happiness’, which contains 7,215 samples. It could be argued that the predominance of ‘catwalk’ and ‘premiere’-style images across the entirety of the computer vision research sector, the lowest-hanging and most abundant fruit that can be scraped, together with the extra work involved in recognizing lesser-seen emotions such as disgust, represent a systemic problem that the community may need to actively address in future projects.

Additionally, the authors comment:

‘Many of the emotion classes share some of the facial expression with each other. For example, we see lots of ‘Happy’ samples are wrongly classified as ‘Neutral’, ‘Disgust’ as ‘Anger’ etc. This happens due to the close correlation between these emotion classes and increases the complexity of the classification task.’

They further observe that SAFER obtains better results on the less imbalanced CAER-S dataset, which, besides being more evenly distributed across the limited emotion classes, also contains additional contextual information.

Finally, the researchers emphasize the value of context in contributing to the evaluation and classification of facial expressions. In one of the examples with which we opened this overview, the actor Guy Pearce, in the 2000 film Memento, is depicted as ‘happy’, yet in a grisly and criminal context. Though relatively naïve algorithmic image recognition filters can understand that red daubs on a body may indicate bloodshed (an active filter in the DALL-E 2 synthesis framework), a wider view of context can not only ‘explain’ a facial expression, but categorize a situation in a more rational light:

'Red-stained hands', but it couldn't be more innocent. The 'playgroup' context can help FER systems to evaluate expressions with an environmental component.
'Red-stained hands', but it couldn't be more innocent. The 'playgroup' context can help FER systems to evaluate expressions with an environmental component.

The authors state:

‘We can provide human explainable reasoning by creating an idea of the situation around the subject. Individual modules tell us what information is available from the face and background. For instance, the subject in [the red bounding box in above image] has a smiling face and colorful vibrant background. By extracting age, gender, location type and location attributes, we can create our situational knowledge which further enhances this reasoning.

‘In the case of [above image], place category output is a day care play room. By combining all these a human understandable explanation of happiness class for the subject can be constructed as ”the subject is a child in a playroom and smiling, has a happy facial expression”.’

Conclusion

That facial expressions need to be considered in context is generally acknowledged in casual cultural mores: expressions such as ‘read my lips’ (referring to textual content) and ‘read the room’ (referring to environmental and broader social cues) are more common than ‘read my face’.

Nonetheless, the FACS system, perhaps because it offers so many useful tools and definitions for this line of research, has evolved into a strand of academic methodology that tends to treat facial expressions as discrete and complete psychological indicators of mental state, divorced from context.

It’s therefore fortunate, perhaps, that current sector interest in multimodal systems are set to use action units as component elements in much wider emotional state evaluation systems, rather than as mere facets within the very limited set of six or seven expressions that have come to dominate the locus of research interest in recent years.

* My conversion of the authors’ inline citations to hyperlinks.

More To Explore

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
AI ML DL

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
AI ML DL

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle