Using ChatGPT and CLIP to Augment Facial Emotion Recognition (FER)

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

The idea of attempting to understand the emotional significance of facial expressions from a broader context is gaining ground every year in the emotion recognition research sector. Multimodal datasets and associated architectures seek to infer emotion from audio, text, and other ancillary clues, rather than just looking at a static snapshot of a face and attempting to guess the underlying emotion it represents, entirely free of context.

Text can add context to facial expression recognition. Here we see text descriptions contributing to a wider understanding of source video clips, in the MAFW dataset (studied in the new paper). Source:

Another factor that can help to identify a facial emotion is temporality – if one can see the emotion emerging in context, this can provide an additional clue as to what the emotion might be. This practice is known as Dynamic Facial Expression Recognition (DFER).

Expressions tend to evolve into other expressions as situations develop. Being able to keep track of changing expressions increases the chance of correct expression identification in general. Source:
Expressions tend to evolve into other expressions as situations develop. Being able to keep track of changing expressions increases the chance of correct expression identification in general. Source:

In theory, the new generation of Vision Transformer (ViT)-based multimodal systems, such as CLIP, BLIP and even the more marginal outings such as WD14, could be of immense use in this respect, since such systems have been trained on hyperscale dataset pairings of images and text.

Vision/text-based systems such as BLIP are able to infer the content of a photo based on having been trained on enormous amounts of text/image pairs. Source:
Vision/text-based systems such as BLIP are able to infer the content of a photo based on having been trained on enormous amounts of text/image pairs. Source:

In reality, as with the Stable Diffusion generative framework that brought CLIP to prominence, the entire architecture is designed to evaluate or generate single images; like the latent diffusion models (LDMs) that they support, approaches such as CLIP have no native mechanisms that could help us to evaluate temporal changes in expression over the course of a video clip or feed.

Now new research from the UK is proposing a system, titled DFER-CLIP, that allows these powerful but ‘static’ analytical systems to operate on video, and even to use generative descriptive systems such as ChatGPT to help formulate the descriptions that will enrich the project’s knowledge of facial emotions in video.

The conceptual workflow for DFER-CLIP. Source:
The conceptual workflow for DFER-CLIP. Source:

The visual aspect of the new work leverages CLIP, but embeds its functionality within an array of several Transformers encoders, with the final extracted feature embedding constituting a learnable class token.

For the textual component, large language models (LLMs, such as ChatGPT) are utilized to provide descriptions based on the emotion in question. Since DFER-CLIP largely conforms to the sector’s constriction to the 6-7 ‘essential’ facial expressions of the Facial Action Coding System (FACS), this is a ‘manual’ process, and the LLM is not required to be hooked up to a vision interpreter.

In tests, the researchers found that DFER-CLIP achieved state-of-the-art results in comparison to analogous similar techniques, and have released code for the system.

The new paper is titled Prompting Visual-Language Models for Dynamic Facial Expression Recognition, and comes from two researchers at the School of Electronic Engineering and Computer Science at Queen Mary University of London.


In a paper which is richer in formulae than exact details about the origin and configuration of the contributing modules, the authors state that each of the encoders which provides CLIP with a temporal context is made up of a multi-headed self-attention and feed-forward network, and that each of these modules is trained from zero (rather than pretrained).

The frame-level features extracted from this process are learned initially by CLIP’s visual encoder, before being bundled into an array that includes a learnable class token, which is then passed to the later part of the network together with the temporal position of the frame in question (i.e., the frame number from the sequence of extracted frames).

Prior and base approaches, compared to DFER-CLIP.
Prior and base approaches, compared to DFER-CLIP.

The main prior approach to DFER involved the use of a classifier to process individual frames, resulting in per-frame analyses of the expressions identified therein (‘a’, left, in the image above). The default CLIP methodology (‘b’. middle, in above image) is to process the frame through ViT and arrive at a class on a per-image basis. In effect, the previous approach (‘a’) is only the CLIP approach concatenated, and does not account in a meaningful way for temporality.

Instead (‘c’, right, in the image above), passes a live token through into this process, and additionally takes account of where in the frame sequence the current image resides, which leads to genuinely temporal modelling of the facial expression as it develops.

The learnable prompt, the authors explain, acts in lieu of mere class names, and forms a context for descriptors in each class. They state:

‘During the training phase, the CLIP text encoder is fixed and we fine-tune the CLIP image encoder. The temporal model, learnable class token, and learnable context are all learned from scratch.

‘DFER-CLIP is trained end to end and the cross-entropy loss is adopted for measuring the distance between the prediction and the ground-truth labels.’

The second component is the text description. Since CLIP learns semantic information from natural language, the authors decided that an LLM such as ChatGPT could provide textual descriptions instead of using automated or manual class labeling.

The correlation between text and imagery in CLIP, illustrated in the original paper. Source:
The correlation between text and imagery in CLIP, illustrated in the original paper. Source:

Since, as with many FER projects, DFER-CLIP is dealing with a very restricted range of emotions, and with no compound emotions, it’s not logistically difficult to accomplish this manually. The authors state:

‘Instead of manually designing the facial expression description, we prompt a large language model such as ChatGPT to automatically generate descriptions based on contextual information. We prompt the language model with the input:

‘Q: What are useful visual features for the facial expression of {class name}?’
‘A: Some useful visual features for facial expressions of {class name} include: …’

‘All the generated descriptors of each facial expression class will be combined to form a comprehensive description.’

Examples of LLM-generated description text for DFER-CLIP. The additional emotions acknowledged in MFAW (see below) augment the usual seven FACS facial expressions.
Examples of LLM-generated description text for DFER-CLIP. The additional emotions acknowledged in MFAW (see below) augment the usual seven FACS facial expressions.

This is about as much explanation of the core approach as the ‘Approach’ section of the new paper offers, though some additional details come to light in the testing phase.

Data and Tests

All experiments were conducted on three DFER benchmark datasets: DFEW, which covers the FACS basic seven expressions (happiness, sadness, neutral, anger, surprise, disgust and fear) across 11,697 in-the-wild clips, each labeled across ten human annotators, and each sourced from movies; FERV39k, which contains 38,935 in-the-wild clips covering four scenarios (crime, daily life, speech, war), and which (again) covers the seven core FACS expressions; and MAFW (seen here in an earlier image, above), which contains 10,045 in-the-wild clips, and which adds contempt, anxiety, helplessness, and disappointment to the core FACS emotions, for a total of 11 facial expressions.

Examples from a typical FACS-based grouping of seven emotions, from the DFEW dataset, included in the testing roster. Source:
Examples from a typical FACS-based grouping of seven emotions, from the DFEW dataset, included in the testing roster. Source:

In accordance with the original design of CLIP, the maximum number of text tokens allowed in the experiments is 77, with the temperature hyperparameter likewise set to 0.01. The 16 frames used in the sequences are processed in accordance with the sampling strategies of many prior works (six are cited), resized to 224x224px, and random data augmentation applied.

Due to the resource-heavy nature of processing video, DFER-CLIP uses the ViT-B/32-based version of CLIP. All models are trained on a Tesla A100 GPU (VRAM allocation unspecified, as it varies in this model). A Stochastic gradient descent (SGD) optimizer was used with a mini-batch size of 48 (see source paper for further training details). Models were trained three times with varying random seeds, and the average used in the final tests.

Evaluation metrics used were weighted average recall (WAR, which measures accuracy), and unweighted average recall (UAR, where accuracy is evaluated on a per-class basis, regardless of the number of data instances).

Unusually, ablation studies form the entirety of the results section, as the researchers studied the effect of removing various functionalities from the system.

Initially, tests were conducted to evaluate the temporal model and context prompts.

Results from the evaluation of learnable context prompts and temporal model depths.
Results from the evaluation of learnable context prompts and temporal model depths.

The authors observe that the adoption of a temporal (rather than per-frame) model improves UAR performance by 2.22%, 0.63% and 1.38% across the range of prompts, and across the three datasets. They observe:

‘In general, deeper models can get better performance, but in our DFER-CLIP, the best performance is obtained under the one-layer temporal model. This is because the temporal model is trained from scratch and may overfit if it is complex.

‘This is also a consideration in the learnable context, in which more learnable vectors do not improve the results. We believe that increasing the learnable context number or temporal model depth will cause overfitting on the training data, resulting in a worse generalization performance on test data.’

Next, the researchers tested for the effect of different training strategies, where the tests evaluate the difference between the visual embedding and the text-based class embeddings. For classifier-based tests, they used Linear Probe, and full fine-tuning both with and without a temporal model (‘TM’ in results below).

To test the text-based classifier-free method, the authors used zero-shot CLIP, Zero-shot FaRL, CoOp, and Co-CoOp.

Different training strategies cross-tested.
Different training strategies cross-tested.

Of these results, the researchers state*:

‘[Our] method outperforms Fully Fine-tuning in UAR by 3.91%, 1.63%, and 2.36%, and in WAR by 2.84%, 0.88%, and 2.07% on DFER, FERV39k, and MAFW datasets, respectively. Even without the temporal model, our method is better than all the classifier-based methods. We also add a temporal model for the Fully Fine-tuning strategy, the results demonstrate our method is still superior to it.

‘We also compare our method with [zero-shot CLIP] and [zero-shot FaRL], in which FaRL is pre-trained on the large-scale visual-language face data. The [results] show that fine-tuning the image encoder can improve the performance remarkably. [CoOp] and [Co-CoOp] are all using the learnable context, in which the Co-CoOp also add projected image features into the context prompt.’

DFER-CLIP’s performance in classification was also tested in terms of a prompt-engineering context, compared to two kinds of manual prompts: ‘A photo of [class]’ and ‘An expression of [class]’.

Results from tests against manual prompts.
Results from tests against manual prompts.

The authors observe that in these tests, their own more elaborate and higher-powered, ChatGPT-style approach improves on the other manual alternatives – except that in the MAFW set, the presence of four extra expressions (mentioned earlier), tends to skew the results away from DFER-CLIP’s favor.

The researchers further observe that the video samples in the dataset are not equally representative, with contempt, helplessness and disappointment occupying only 2.57%, 2.86% and 1.98% of the data, respectively.

They state:

‘[The results] demonstrate that the learning-based context consistently achieves the best WAR results. Furthermore, our method outperforms using the prompt of the class name with the learnable context approach, which indicates the effectiveness of using descriptions.’

Finally, the authors conducted a comparison against analogous state-of-the-art approaches. Frameworks compared here were C3D; P3D; I3D-RGB; 3D ResNet18; R(2+1)D18; ResNet18LSTM; ResNet18ViT; EC-STFL; Former-DFER; NR-DFERNet; DPCNet; T-ESFL; EST; LOGO-Former; IAL; CLIPER; M3DFEL; and AEN.

These experiments were conducted under five-fold cross-validation, using the training and test set from FERV39k.

Results against the nearest available state-of-the-art methods.
Results against the nearest available state-of-the-art methods.

Regarding this outcome, the researchers comment:

‘The comparative [performance] demonstrates that the proposed DFER-CLIP outperforms the compared methods both in UAR and WAR. Specifically, compared with the previous best results, our method shows a UAR improvement of 2.05%, 0.04%, and 4.09% and a WAR improvement of 0.41%, 0.31%, and 4.37% on DFEW, FERV39k, and MAFW, respectively.

‘It should be pointed out that FERV39k is the current largest DFER benchmark with 38,935 videos. Given this substantial scale, making significant enhancements becomes a formidable task.’


It was perhaps inevitable that the superior generative powers of LLMs would encroach stealthily upon more traditional methods of annotation and labeling – and DFER-CLIP is unlikely to be the last word that we hear about this, either in temporal or static labeling systems.

What perhaps stands against this approach, in a broader context, is the extent to which the labeling cultures of generative systems tend to become entrenched and arguably even ‘over-fitted’ to the architectures that they are most popular on.

One example of this is the ‘1girl’ tag in WD14 – an apparently random tag which has come to carry the weight of the concept ‘sole female in picture’. Though such peculiar and very specific tags were presumably invented ad hoc at some point, perpetuating them is a great temptation as they ‘go viral’ and propagate through ancillary products and frameworks.

This is hardly ‘natural language’. At that stage of entrenchment, the labeling system has instead become an eccentric or particular language or codex that must be learned on its own terms, rather than a more desirable and generic schema of ductile and developing labeling habits.

Since all generative systems, including ChatGPT (which we presume to have been used in DFER-CLIP, though this is never explicitly stated), have their own similar ‘quirks’, this is a structural challenge for the future of truly open and evolving systems.

* Citations omitted, as hyperlinks for the projects are already supplied in this article.

More To Explore

Main image derived from

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.