New research from Australia offers a deeper and more nuanced approach to Facial Expression Recognition (FER) – by training a deep learning model to continuously adapt to ‘compound’ facial expressions in the way that humans do, and by using Class Activation Maps to evaluate how well the system can understand how these ambiguous, hybrid expressions are characterized in different regions of the face.

In the field of image synthesis, the ability to recognize compound emotions and emotional states that do not fit into the standard six of the dominant Facial Action Coding System (FACS) seems likely to become essential for expression editing systems.
As we have observed before, once the novelty of photorealistic facial neural facial recreation has palled a little for the viewer, as more projects of this type enter mainstream media, the threshold for credibility is set to rise significantly. If recreated neural characters cannot approach our own level of facial expressiveness, there is the risk of a new type of ‘uncanny valley’ – where the facial surfaces and topology are perfect, but the characters themselves seem emotionally ‘limited’.
In order to prepare for this ‘post-shock’ era of facial synthesis, the authors of the new work contend, it’s necessary for trained models to be able to learn a growing gamut of compound facial emotions over time, and to understand that apparently conflicting emotions can often exist in the affect state of a human face:

The new method developed by the researchers is inspired by studies on human cognition, and employs a number of novel tricks to ensure that the framework has some prior knowledge about learned facial affect states, without needing to develop a ‘baked’ and non-fluid dataset.
The new work, the authors state, achieves state-of-the-art results, though in an admittedly nascent field, and is the first, they believe, to apply few-shot learning to the recognition of complex facial expressions.
‘Our method also demonstrates an improvement in accuracy over other state-of-the-art non-continual learning methods for facial expression recognition by 13.95%. This demonstrates the benefits of our approach to learning facial expressions through continual learning, by first learning to recognise basic facial expressions and then synthesising that knowledge to learn new complex facial expressions in a similar way to humans.’
The new paper is titled Complex Facial Expression Recognition Using Deep Knowledge Distillation of Basic Features, and comes from two researchers at the School of Information Technology at Deakin University in Victoria.
Approach
As the paper notes, the FACS system has become dominant in the literature. The six expressions that it concentrates on feature frequently in computer vision research, and particularly in research into expression editing – the ability to intervene in neural representations and alter their apparent affect state.
However, these six emotions (anger, disgust, fear, happiness, sadness and surprise – with contempt a late addition) could be argued to constitute the ABC of affect state, in a sector that will rapidly be required to progress to a full lexicon. The authors state*:
‘[Human] beings express a wide range of emotions through facial expressions that do not fit into predefined categories, and there is evidence that no such basic, prototypical emotion categories exist. Instead, FER develops naturally over time in humans, who are able to identify new, complex emotions on the fly as they appear.
‘To approach human-like FER performance, a machine should be able to recognise complex expressions of emotion such as happily disgusted, and distinguish them from similar emotions like happy, disgusted, or happily surprised.
‘These compound expressions are more than the sum of their parts; they are distinct concepts which express a unique emotion. To humans, such synthesising of known concepts to form new ones comes relatively naturally, and we are able to learn, process and recognise new compound expressions using very little data.
‘However, the state-of-the-art in FER still has difficulty with such cognitive complexity, due to the similarity of features across both basic and compound expressions.’

The new system is divided into three sections: a basic FER phase, where a simple FER model learns to recognize the six basic expressions from a static labeled dataset; a continual learning module, where the model trained in the first phase must now learn new compound expression classes, by adding novel classes until all the available configurations have been learned; and a few shot learning phase, where the phase 1 trained model learns additional compound expression classes, using a very small number of samples of each new class.

The first phase uses a feature extractor based on the ResNet50V2 dataset. The network is also pretrained on ImageNet, which helps the model to initially establish features and lines, and is then fine-tuned on the FER dataset.

The feature mappings obtained in the first phase are passed through to the second and third phases using knowledge distillation, which effectively passes ‘essential notes’ from one model to another. The input workflow for the first phase passes in batches of 224x224px images that have been preprocessed with the RetinaFace face detection algorithm.

In the second phase, labeled images from each obtained compound expression class are trained into the dataset in sequence, using representative memory to store a section of the training samples from phase 1. The way that this information is selected is dictated by a Predictive Sorting Memory Replay policy (an approach from reinforcement learning, where the system learns from failure, retreating and advancing with each victory or defeat, and building up long-term knowledge).
The ‘inherited’ classes are trained along with the new generated classes, building a richness of insight in an evolutionary or cascading manner. The selected images that reach stage 2 are the most representative of their class (i.e., ‘angry’, ‘sad’, etc.). The authors note that this type of selection accords with the way that humans broadly define classes in general, and facial emotional states in particular:
‘[Humans] retain memories of only the most pronounced moments of an experience, which efficiently enables recognition and classification of the entire experience. Once initialised, the representative memory does not acquire any new samples from the Basic FER Phase data, to emulate human learning whereby the raw data from prior experiences is no longer available, and only memories are retained.
‘This also ensures the method is aligned with practical applications that may have limiting memory and computation requirements, such as mobile computing, IoT and robotics.’
For the few-shot learning phase, the third phase in the architecture, a limited number of training examples are used, and the Representative Memory Replay component from phase 2 is excluded, since this approach is particular to continual learning rather than few-shot learning.
The few-shot experiments in phase 3, the authors state, are each equivalent to a single iteration of the continual learning cycle in phase 2. By keeping these experiments isolated, the authors were able to test the resilience of the learned concepts regarding compound FER.
At this stage, the Grad-CAM framework was used to identify which concepts were manifesting at inference time, and where in the images these concepts had the greatest influence. Grad-CAM uses Class Activation Maps to ‘trace’ the influence of concepts baked into the latent space of a model, producing heat-maps that visualize these concentrations.

Data and Tests
For the testing round, the researchers used the Compound Facial Expressions of Emotion (CFEE) database, which holds 5044 images across 230 different face subjects, each acting out facial expressions in a controlled environment, for a total of 21 labeled facial expressions across multiple examples of each.
For validation (where a part of the data is actively used, and another part ‘held back’ as representative data to test against), the dataset was divided into 10 groups each comprising 23 subjects, with nine used for training and one for validation purposes.
For phase 1 (basic FER), a model was trained and its accuracy evaluated against the validation dataset at each epoch of training (i.e., each time the system had seen the entirety of the data). The process was repeated over the various groups, with the divisions ensuring that the model could perform well with discrete and shielded glimpses at the data alone.
A baseline FER accuracy was then established by extracting a maximum, mean and standard deviation from the aggregated results of all the passes.
The six aforementioned expression classes were (naturally) selected from the base FER phase, with the remaining compound expressions added: happily surprised, happily disgusted, sadly angry, angrily disgusted, appalled, hatred, angrily surprised, sadly surprised, disgustedly surprised, fearfully surprised, awed, sadly fearful, fearfully disgusted, fearfully angry, and sadly disgusted .
The above-mentioned ResNet50V2 model was used as the core model for the network, with the topmost dense layers malleable, and the rest of the model frozen. When validation accuracy converged (i.e., got as good as it was ever going to get), training was stopped (a process known as ‘early stopping’, since one could potentially let the model run until the set iterations have all been passed).
Next, for phase 1, the frozen portion of the model was unfrozen, and the model trained once more to fine-tune the weights towards FER tasks.
For the second and third phases, the layers of the first two convolutional blocks in the model were frozen, since these contained prior knowledge fundamental to the methodologies of these latter two phases. If the layers had not been frozen, the value of the original weights would have been lost, since fine-tuning (or continuing or resuming training on an already effective model) inevitably re-calibrates the original weights, causing loss of information that has already been proven effective, in favor of subtle but unknown variations.

The phase 1 tests were validated by k-fold cross validation, and the authors observe that these results ‘are comparable’ with state-of-the-art approaches.

For the continual learning phase, the best-performing validations from phase 1 were used, and evaluated with metrics from the 2021 Deep Continual Learning for Emerging Emotion Recognition initiative. Under this methodology, the sequence of classes were randomized to test the model’s resilience.
Though the paper does not cite the exact rival frameworks used (except that abbreviations are listed in the results table, see below), the contenders included Lucir-CNN, PODNet, iCaRL, and Deep SLDA.

In line with previous work, the system was also set against a number of non-continual learning approaches and datasets, which once again evaluated the 21 facial expressions. Here too, the new approach led the board:

The few-shot approach was also evaluated, using the same hyperparameters as for the continual learning tests, this time comparing accuracy across number of steps. However, the results proved invariate:

The authors comment:
‘We demonstrate improvements in continual learning for complex FER through our novel knowledge distillation and Predictive Sorting Memory Replay techniques, achieving the state-of-the-art with 74.28% Overall Accuracy on new classes only (an improvement of 0.67%).
‘The Overall Accuracy on all classes is 73.27% which is comparable to baseline results and indicates a reduction in the effects of catastrophic forgetting as the accuracy on known classes is not greatly impacted when learning new classes […]
‘Our method also demonstrates an improvement in accuracy over other state-of-the-art non-continual learning methods for facial expression recognition by 13.95%. This demonstrates the benefits of our approach to learning facial expressions through continual learning, by first learning to recognise basic facial expressions and then synthesising that knowledge to learn new complex facial expressions in a similar way to humans.’
Conclusion
This is important work. However, its significance may not be recognized for some time, since improving the discernment of compound expressions is currently equivalent to designing a better carburetor while work is still continuing on the basic design of the combustion engine.
Nonetheless, the advent of emotionally stunted facial expression ranges is just round the corner, and quite soon, directors and media producers will be wanting a little more from expression editing and facial synthesis systems than the six basic expressions offered by FACS.
As it stands, a number of systems are already capable of allowing practitioners to affect latent codes in the latent space, so that compound expressions could arguably be achieved piecemeal, by gradually opening or narrowing eyes, turning up or down corners of the mouth, and so on. But this is a laborious and interpretive approach that, though fine-grained, does not get the VFX professional much further than the instrumentality of older CGI-based facial rigs.
By contrast, developing a more sophisticated lexicon of facial expressions opens up the possibility for per-character automated labeling of facial datasets, so that one could eventually have a ‘happily disgusted’ slider; which would be a challenging facial expression to formulate manually.
* My conversion of the authors’ inline citations to hyperlinks.