Better Open Source Facial Emotion Recognition With LibreFace

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

The art (it is difficult to refer to it yet as a science) of Facial Emotion Recognition (FER) is, arguably, in a nascent state. The prevailing methodologies, such as the Facial Action Coding System (FACS), are subject to frequent criticism; and the available tools to implement these principles – assuming you support those principles – are often either lightweight but out-of-date, or else reliant on heavy and unwieldy computing systems – and more likely to be proprietary or closed-source code.

Because consensus about FER is lacking, and the field of AI-aided emotion recognition is only just now being defined, available libraries and frameworks fall short of the current needs of the academic research sector.

One such need is for an easily deployable open source system capable of evaluating images of faces in terms of emotion recognition. Many of the most interesting tools in this regard (usually centered around GPU-based AI training) remain proprietary. The alternatives, notably the OpenFace and OpenFace 2 behavioral analysis tool-kits, rely on older statistical analysis approaches such as Support Vector Machine (SVM) and Histograms of Oriented Gradients (HOG) – stalwart, venerable technologies that are now more often found as minor components in larger and more complex frameworks, rather than as the central spine of an evaluative architecture.

To address this shortfall, a new initiative from the University of Southern California proposes to release a novel open source framework titled LibreFace (named, apparently, as a tribute to the open source Microsoft Office alternative LibreOffice).

The GUI for LibreFace in action. Source:
The GUI for LibreFace in action. Source:

LibreFace bridges the gap between the (in the authors’ collective opinion) outdated approaches of the OpenFace project and the rigors of developing full-fledged data-gathering and training pipelines for more recent and burdensome frameworks.

The system gathers together some very modern but portable FOSS components, such as MediaPipe (which will be familiar to many users of Stable Diffusion), as well as leveraging several state-of-the-art FER-relevant datasets, to compose a rational system that can run either on CPU or on a GPU – and which, in tests, was found to run twice as fast on CPU as OpenFace.

In addition to this, LibreFace achieves superior performance to OpenFace in general, and is able to perform comparably with other, much heavier and resource-intensive systems.

LibreFace was developed as an array of .NET libraries, and is intended to operate across a number of platforms as an executable; though the current build is a Windows version, the researchers plan to bring LibreOffice to MacOS and Linux. They also plan to release the code at this URL (though at the time of writing, the repository is empty).

In terms of applicability to image synthesis, and the creation of neural characters, better FER tools are always needed, and a system such as LibreFace could be used to evaluate the ’emotional temperature’ of neural faces, both as a filtering tool, and as an aide to development in new systems intended to allow creative practitioners to alter facial expressions.

Currently, most such systems involve simply pushing concepts into the latent space in order to change individual parts of the face, until an instinctively (i.e., by human interpretation) ‘correct’ expression emerges. Therefore any research, and any associated tools, that can help to develop a flexible lexicon of cohesive facial affect values, such as ‘happy’ or ‘sad’, and to at least semi-automate this process, is going to be quite useful.

The new paper announcing the work is titled LibreFace: An Open-Source Toolkit for Deep Facial Expression Analysis, and comes from five researchers at USC’s Institute for Creative Technologies.


The system comprises four stages: first, the source facial images are pre-processed, using MediaPipe to create an interpreted mesh, from which facial landmarks can be derived; the results from this stage are then fed into a masked autoencoder (MAE) originally developed by Facebook Research, and then to a linear regression or classification layer that rates the image for Action Unit intensity (i.e., how much the constituent parts of the face image may be said to be producing a specific emotion or compound emotion).

With the MAE fine-tuned thus, feature-wise distillation transfers what the MAE learned into a lightweight student model based on ResNet-18;  and lastly, the ResNet-18 output is used to infer the FER characteristics.

Conceptual architecture for LibreFace.
Conceptual architecture for LibreFace.

The MAE teacher model is pretrained on EmotioNet, with model weights furnished by the Chinese 2022 project Multi-Task Learning Framework.

Landmark recognition in EmotioNet, a contributing framework for LibreFace. Source:
Landmark recognition in EmotioNet, a contributing framework for LibreFace. Source:

With the weights of the ViT-based MAE as the backbone, a linear classifier is added to the process, and the data further trained on the AffectNet and FFHQ datasets.

Action units (AUs) in AffectNet (left), and a graph of the valence-arousal space. Source:
Action units (AUs) in AffectNet (left), and a graph of the valence-arousal space. Source:

The model is then fine-tuned on the spontaneous facial action intensity database DISFA, which performs AU intensity evaluation (i.e., how ‘extreme’ a recognized expression is). A Mean Squared Error (MSE) loss function is used for this purpose.

Beginning with a neutral expression, subsequent action units are illustrated in the DISFA database. Source:
Beginning with a neutral expression, subsequent action units are illustrated in the DISFA database. Source:

In a secondary strategy, ResNet-18 is also used as an encoder, and trained on AffectNet and FFHQ, together with a linear classifier, before fine-tuning on DISFA.

The two pre-training strategies for LibreFace.
The two pre-training strategies for LibreFace.

Since the ViT backbone in these strategies is quite resource-intensive, the researchers sought to slim down the pipeline by passing the data through the aforementioned student-teacher model, where select results from a heavier framework are passed to a lighter and more adroit module, based on the methodology of prior research from Samsung and the University of Nottingham in the UK.

Schematic of the feature-wise distillation proposed for LibreFace. The student model's encoder transfers the knowledge from the teacher model's encoder.
Schematic of the feature-wise distillation proposed for LibreFace. The student model's encoder transfers the knowledge from the teacher model's encoder.

In the LibreFace implementation, the pre-trained teacher classifier is frozen (i.e., it cannot be affected by the training process, and therefore furnishes reliable and consistent values based on its prior training), and this frozen model is used to inform both the teacher and student models, in contrast to the original approach used for this technique.

Data and Tests

The system was tested using PyTorch, and the authors state that the code and model weights will be made available for reproducibility later. All experiments and training were conducted on a single NVIDIA RTX 8000 GPU, which features a formidable 48GB of VRAM. However, as stated, the system can likewise run on a CPU.

Input images were resized to 256x256px. To increase the variety of data, training routines can optionally perform data augmentation, where the source data is fed to the system a number of times with diverse transformations, such as flipping (reversing) an image randomly, changing its angle, and even turning it upside down.

Typical types of data augmentation, designed to create resilient and flexible models from what may be limited data. Source:
Typical types of data augmentation, designed to create resilient and flexible models from what may be limited data. Source:

For systems that are seeking to reproduce a particular individual, such augmentations are not advised, since most people do not have perfectly symmetrical faces; however, in terms of generic emotion recognition, parity of symmetry is irrelevant, and augmenting the data in this way can aide generalization.

Therefore the source data was augmented in these ways for testing purposes, for LibreFace.

The model was trained on the AdamW optimizer at a rather high batch size of 28. Due to this high batch size, the learning rate was set to a moderate 3e-5, with a weight decay of 1e-4. If the learning rate had been set very low, as is increasingly becoming common in generative frameworks, the system would have learned a lot about each individual face, but would not have gained such a useful general understanding of the gamut of expressions studied.

The model was trained for a maximum of 20 epochs (an epoch being a complete, end-to-end examination of the data by the training process), with early stopping (i.e., manually intervening to stop training when it has become clear that the model has reached its optimal convergence, or efficiency).

It was trained on MSE loss for AU intensity, and also cross-entropy loss for FER, with evaluation against a validation set after each epoch (i.e., part of the source data was held back from training and used instead to test the progress, flexibility and accuracy of the model).

The system was coded in C#, and comprises three essential components: the MediaPipe mesh extraction pipeline; an image aligner; and Open Neural Network Exchange (ONNX) models (a compressed and portable, rationalized format for trained models).

A graphical user interface (see earlier image above) was developed using the Microsoft Platform for Situated Intelligence components.

In addition to the aforementioned datasets used in training the LibreFace model, the researchers also used the RAF-DB dataset at the end of each approach.

Rival approaches tested were Deep learning-based FACS Action Unit occurrence and intensity estimation, a deep convolutional network designed to evaluate FER (‘CNN’ in results); Deep Region and Multi-label Learning for Facial Action Unit Detection (‘D-CNN’ in results); and OpenFace 2.0.

Results for the performance of LibreFace on the DISFA dataset, carried out with five-fold cross validation, and Pearson Correlation Coefficient (
Results for the performance of LibreFace on the DISFA dataset, carried out with five-fold cross validation, and Pearson Correlation Coefficient (

Of these results, the authors state*:

‘In addition to achieving the best average [Pearson Correlation Coefficient], our method also performs better than other methods on 9 out of total 12 AUs included in DISFA.’

The researchers also fine-tuned LibreFace on BP4D-Spontaneous, to cover AU detection (or lack thereof) for the action units that were not labeled in DISFA.

Here the competing frameworks were Deep Adaptive Attention for Joint Facial Action Unit Detection and Face Alignment; and Learning Multi-dimensional Edge Feature-based AU Relation Graph for Facial Action Unit Recognition (as well as base ResNet-18 and the baseline for the MAE ViT).

Results against BP4D.
Results against BP4D.

As we can see, the results across these frameworks are less definitive for LibreFace.

Further tests were conducted with FER accuracy as the core metric. Rival frameworks here were Very Deep Convolutional Networks for Large-Scale Image Recognition (‘VGG-16’ in results below); Deeply-Learned Part-Aligned Representations for Person Re-Identification (‘DLP-CNN’ in results below); GAN-Inpainting; and Occlusion aware facial expression recognition using CNN with attention mechanism (‘gaCNN’ in results below).

FER tests.
FER tests.

Here the authors comment:

‘We achieve comparable results to the state-of-the-art methods, which require far more computation and training/inference time than LibreFace.’

Though we do not usually cover a paper’s ablation studies (i.e., the practice of removing facets of the new system to see the extent to which results are affected), it’s worth noting, as the authors observe, that LibreFace was found in these tests to operate twice as fast as OpenFace 2.0.

In conclusion, the authors state:

‘Extensive experiments show that LibreFace provides a more accurate, comprehensive, and efficient alternative to [OpenFace 2.0], the most commonly used facial expression analysis toolkit, and other open facial behavior analysis software.

‘For AU intensity estimation, LibreFace achieves superior performance than [OpenFace 2.0] while running two times faster on a CPU-only environment. For FER, LibreFace can achieve competitive performance to the heavy state-of-the-art methods.’


There’s a lot of money in this sector, mainly due to the potential of FER to augment security systems and other ‘down to earth’ applications. Consequently, the general run of FER literature and studies produce an above-average number of ‘closed’ papers, despite the fact that the more community-minded neural facial image synthesis sector is going to be in imminent need of accessible, resource-efficient pipelines for FER – and that, to date, it has been stuck with some rather dated statistical methods of achieving this.

Though LibreFace cannot clear up the current controversies about the value and applicability of systems such as FACS, it does at least potentially offer a rational and capable modern framework that would sit well in many a neural synthesis pipeline, and makes use of far more recent approaches and methodologies than the OpenFace series.

As is so often the case, we will have to see if the weights and code that the authors promise will populate the currently-empty repository actually manifest, so that the sector can try LibreFace out first-hand.

* My conversion of the authors’ inline citations to hyperlinks.

More To Explore

LayGa - Source:

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source:

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.