Some emotional states are difficult to guess just by looking at the expression on someone’s face. Yet this is the default way that the neural facial synthesis research sector is currently addressing the challenge of Facial Expression Recognition (FER), mainly with the use of the oft-criticized Facial Action Coding System (FACS).

Embarrassment, for instance, is incredibly contextual, since this particular facial expression requires that the subject be situated in a social scenario (else the emotion is impossible to feel)*.
Besides the fact that embarrassment is not listed even in the extended version of FACS emotions anyway (the core list comprises happiness, sadness, neutral, anger, surprise, disgust and fear), it is not possible to study the disposition of facial muscles in an embarrassed person and guess that emotional state correctly, because the source data is comprised only of a single face image. Even corny placeholders such as Biting lip are amenable to a different interpretation, such as Anxiety.
The only way one can improve recognition of contextual emotions like these, and, arguably, improve the quality of FER in general, is to annotate facial poses in a wider context, so that the next time an automated system sees an ’embarrassed’ expression (base solely on a face), it will have a pre-existing connection between the expression and the contributing external and general environmental conditions which are inspiring it.
The Big Picture
This is the central motivation behind new research from Canada, which investigates whether the semantic knowledge contained in Large Language Models (LLMs) such as GPT3 is capable of inferring emotional state from pictures that are more than just a set of face muscles.

The researchers used a custom set of natural language descriptors that encompassed faces, bodies, interactions and environments to annotate an emotion-centric image database, and then tested the ability of GPT3.5 to infer emotion from the new image captions – and were surprised to find that the popular LLM performed at much the same efficiency and accuracy as human estimators, indicating that language models may have a notable role to play in FER in the future.
Research of this kind, along with multi-modal emotion estimation, is going to prove essential if neural facial synthesis is ever to really come into its own. Currently, emotion estimation in neural pipelines is largely restricted to the plethora of frameworks and modules that utilize the severely restricted number of facial expressions quantified in FACS.
In practical terms, this leaves VFX users having to treat the individual action units (AUs) within an expression (i.e., the individual muscle movements) as ‘lumps of clay’ that need to be manually manipulated until the user is satisfied with the overall final expression – which is as artisanal an approach as any expensive and bespoke CGI pipeline.
Even for proponents of pure face-based systems such as FACS, multimodal annotation and labeling systems are the only obvious way that truly contextual emotions such as embarrassment can ever be quantified.
Additionally, it can be very difficult, in a FACS-style data-gathering session, to authentically evince a desired emotion of any kind, whereas multimodal systems can operate on real-world images and videos, and bring spontaneous examples of elusive emotions into a dataset at scale – so long as there is an adequate method of recognizing in-the-wild expressions.
The new paper is titled Contextual Emotion Estimation from Image Captions, and comes from five researchers at the School of Computing Science at Simon Fraser University in Burnaby.
Approach
In developing the new system, the researchers decided to set themselves a harder than average challenge by concentrating only on the recognition of negative emotions, which is a more difficult prospect, with subtler visual cues, than the recognition of positive emotion.
Therefore the emotions chosen for the scope of the test were drawn from the EMOTIC dataset (available only by prior approval) used in the study, and were Anger, Annoyance, Aversion, Confusion, Disapproval, Disconnection, Disquietment, Embarrassment, Fatigue, Fear, Pain, Sadness, and Suffering.
(Since the EMOTIC labels Pain and Suffering were likely to be sources of ambiguity, these were transformed into the discrete labels Pain/Suffering – Emotional and Pain/Suffering – Physical)
The first task was to create descriptions of physical signals that corresponded to these target labels, using an emotion thesaurus. Such emotions as were not covered in this book were provided by the (now deprecated) text-davinci-003 sub-model of GPT-3.5.
The authors state:
‘The prompts used to generate the physical descriptions were of the form, “List physical cues/physical expressions that would indicate the emotion of ‘disapproval’ in an image.” and “Give a list of facial expressions/physical descriptions/physical movements that might indicate that a person is feeling ‘fatigued’.”’
The descriptions that the LLM provided were then filtered down into 222 discrete physical signals, and finally into a selection of 153 descriptors:

The EMOTIC images were annotated in a custom, web-based interface.

During annotation, both physical signals and contextual aspects were taken into consideration, so that a person being annotated could be viewed, for instance, within a social context, and not as a disembodied set of potential emotional responses.

From the 222 signals initially filtered, 153 were ultimately used to describe the EMOTIC images. Then GPT-3.5 was used to predict emotion labels, aided by a prompt.
By way of example, for the image shown below, the prompt was:
‘”Sean is a male adult. Sean is a(n) passenger. Sean is or has raising eyebrows, side-eyeing. Mia is a child and she is sitting behind Sean and kicking Sean’s chair. Sean’s physical environment is on an airplane. Sean is likely feeling a high level of {placeholder}? Choose one emotion from the list: Anger, Annoyance, Aversion, Confusion, Disapproval, Disconnection, Disquietment, Embarrassment, Fatigue, Fear, Pain/Suffering (emotional), Pain/Suffering (physical’), and Sadness.”’

Data and Tests
The researchers conducted three rounds of experiments to test their approach, each with diminishing levels of description. The first round provided the entire unedited caption, the second removed interaction details, and the third imposed the removal of environmental information:

The sample EMOTIC subset used consisted of 331 unique images. 360 samples, 360 manually-created captions, and the two image types One person and Multiple people. The distribution of emotions in the samples are shown below:

The ground truth for the final set of images was provided by two human annotators, through a process of negotiation and consensus.
The tests were conducted using OpenAI’s Completions API, which provided the image captions in the form of prompts, which were passed on to the GPT-3.5 model. The model’s temperature was set to baseline zero, allowing for an almost deterministic response.
Each caption was run through GPT ten times to determine an optimum median result, with the choice of available emotions limited to the aforementioned 13 negative emotions.
Metrics used were precision and recall, and F1 score.

The researchers observe that, perhaps unsurprisingly, the full captions round yields the highest score. The labels Physical Pain/Suffering were the most accurately estimated, but the labels Anger and Sadness (which are quite low-hanging fruit in the FACS system) were the most frequently selected.
The paper states:
‘Emotional Pain/Suffering was frequently recognized as Sadness, which may be reasonable. Annoyance and Confusion were often recognized as Disapproval. Fear appeared to need environmental cues to be well predicted. Disapproval and Fatigue seem not to be impacted by social and environmental contexts. Embarrassment was fairly well predicted with social interactions.’
The authors note that when environmental descriptions were removed, this negatively impacted the label Physical Pain/Suffering the most. In the image below, the predicted emotion changed from Physical Pain/Suffering to Disapproval when the environment description was omitted, because this also took out the description of pain medication.

‘The full caption for this image with the removed part in italics is: Jack is a male adult. Jack is or has frowning, rubbing the back. Jack’s physical environment is on a bed with medication on the side.’
Similar drops can occur with under-labeled instances of fear, the researchers observe.

In the above example, the omission of the menacing element changed the prediction from Fear to (what most people would consider to be the fairly unlikely) Disapproval.
Emotions such as Anger were easier for GPT-3.5 to recognize across the ablated versions, since the facial cues are less dependent on what is immediately happening, or on context in general (i.e., one may be Angry all day, for various reasons, but one is not likely to be abjectly terrified in a ‘fight-or-flight’ fashion for a prolonged period, except in the immediate presence of environmental threats).
Interestingly, four positive emotions – Excitement, Happiness, Joy and Love – were predicted by GPT-3.5, in cases where relevant information was ablated from the captions.
The paper concludes:
‘Overall, our approach may be used to enhance transparency and facilitate an effective breakdown of scene representation for contextual emotion estimation. It is hoped that our study can also serve as a catalyst for future research in interpretability of LLMs, as well as understanding human perception of emotions, especially if reproduced with other languages and cultures.’
Conclusion
Besides the growing importance of multimodal systems (i.e., frameworks that evaluate music, speech, or other factors to make emotion predictions), bringing FER out of the FACS sandbox is arguably of the utmost importance for the facial synthesis sector.
Even FACS-endorsed fiction such as Lie to Me, and analogous projects such as Poker Face, do not depict genius characters that are capable of determining an inner emotional state based solely on facial disposition – rather, in those cases, the characters are aided by a wealth of contextual information about the subjects that they are observing.
Therefore, contextual FER systems such as the one being proposed in the new work should not perhaps be considered merely as feeder systems for FACS-style facial pose analysis; as the paper demonstrates, environment and context are often a defining characteristic and prerequisite for an accurate FER prediction.
Wild as it may now seem, expression controls in facial synthesis systems of the future may likewise need to know what is supposed to be happening around the face in question, in order to accurately suggest an apposite facial expression.
* Even if that scenario is in the context of, for instance, discovering that a damning disclosure about oneself has been published on the internet