Restoring Facial Expressions with CycleGAN

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Restoring hidden parts of facial images is a strong strand of research in the computer vision sector. To make new models robust to potentially damaged or partial data, many are trained with masked methodologies, where parts of the source face at any one time in the training process are blacked out or in some way perturbed, so that the model is forced to develop an agile understanding of faces, instead of just memorizing the faces that it is seeing.

From a 2021 paper from Facebook research, the process of obscuring parts of the image so that a model will learn to infer what ought to be in the hidden areas, because it will have seen those areas unobscured at other times. Source: https://arxiv.org/pdf/2111.06377.pdf
From a 2021 paper from Facebook research, the process of obscuring parts of the image so that a model will learn to infer what ought to be in the hidden areas, because it will have seen those areas unobscured at other times. Source: https://arxiv.org/pdf/2111.06377.pdf

Additionally, the idea of hiding parts of the data so that the system is more resilient is not limited to computer vision models, but is also practiced in the development of language-based models.

In terms of Facial Expression Recognition (FER), this kind of semi-blind training can be a great aid to the development of more robust emotion recognition functionality. As humans, we are so sensitized to reading emotions as a survival or evolutionary skill, that most of us are able to infer emotional state from very limited information. Machine learning models, however, tend towards binary classification and thresholds rather than spectra, and often need larger volumes of known signifiers than a person would.

Therefore the prevailing practice in model training for these purposes has been to fine-tune existing models, so that they are endowed with these extra perceptual capabilities. But as we have noted before, fine-tuning not only affects all the weights in the model (rendering it less useful or even useless for the wider tasks it was originally trained for), but is a resource-intensive and time-consuming process.

Now, a new paper from Germany is offering a method of reconstructing obscured parts of a face without the need for fine-tuning, instead using Generative Adversarial Networks (GANs) to reformulate training data.

Restoring hidden parts of the face with CycleGAN. Source: https://arxiv.org/pdf/2311.05221.pdf
Restoring hidden parts of the face with CycleGAN. Source: https://arxiv.org/pdf/2311.05221.pdf

In tests, the new method is able to notably improve on facial capture reconstructions obtained by real-world facial sensors, which seems a notable achievement.

The core intended application of the approach is to facilitate FER through analysis of the disposition of facial muscles, largely according to the Facial Action Coding System (FACS), a methodology developed in the 1970s that has become dominant in the field, despite notable criticism. The authors believe that the new system will have wider applications, however.

The facial muscle dispositions that connote emotional states, according to FACS. Source: https://www.eiagroup.com/the-facial-action-coding-system/
The facial muscle dispositions that connote emotional states, according to FACS. Source: https://www.eiagroup.com/the-facial-action-coding-system/

The paper states:

‘This proposed approach retains the visual appearances of the test subjects. In fact, we show that completely covered facial features can be restored correctly. With respect to quality, our clean videos resemble the normal videos more than the sensor videos.

‘More importantly, downstream facial analysis algorithms can be applied directly without the need of fine-tuning them first for images with sEMG sensors. We eliminate the problem of obstructed facial features that otherwise would render an in-depth analysis of expressions and muscle activity impossible.’

The new paper is titled Let’s Get the FACS Straight – Reconstructing Obstructed Facial Features, and comes from six researchers across Friedrich Schiller University Jena and University Hospital Jena.

Method

For the work, a new dataset was originated to measure the correlation between mimicry of emotions (i.e., acting out facial expressions) and the way that the muscles of the face change. The researchers recorded the facial movement and muscle disposition of 36 test subjects, with a 19/17 female/male split.

Three of the selected study subjects sEMG sensors attached, which form a severe block to the visual integrity of the face.
Three of the selected study subjects sEMG sensors attached, which form a severe block to the visual integrity of the face.

The recordings were taken over three sessions (once with bare faces, twice with attached sensors) at a resolution of 1280x720px at 30 frames per second, with muscle activity recorded through surface electromyography (sEMG) sensors attached to the face. No particular parity between the sessions was observed: the color of the wires was allowed to change, and the subjects did not necessarily wear the same clothes or hairstyle between sessions.

Though all the participants were required to run through a script, and to act out emotional states according to that script, it was not possible (or desired) that the subjects all reach exactly the same emotional intensity at exactly the same moment across all the diverse script readings. While this could arguably have been achieved through intensive work, it is not a realistic scenario for downstream applications, and the new methodology is designed to work within these limitations.

The subjects were required to impersonate 11 distinct facial expressions three times, known as the Schaede task (in reference to the original work for this methodology). Next, they had to repeat five spoke sentences (known as the Sentence task). Finally they were required to evince 24 emotional facial expressions (called the Emotion task).

The researchers obtained 174 videos of the participants in this way, with a ratio of 1:2 for bare-faced and sensor-laden content, respectively.

Since the sessions were monitored by medical experts, who would occasionally pause the experiments, there was no final parity in the timing of the executed simulations of emotion. Within reasonable constraints, the subjects were allowed to vary their head angle, gaze and distance to the camera, though ultimately the extraction process would crop all faces to a similar representative sized in an extracted frame.

The difference between a bare-faced and sensor-laden image was treated as a style transfer challenge by the researchers, who used the CycleGAN framework to perform the transformations.

Examples from the CycleGAN project, which uses an unusual GAN figuration to train on unpaired data. Source: https://junyanz.github.io/CycleGAN/
Examples from the CycleGAN project, which uses an unusual GAN figuration to train on unpaired data. Source: https://junyanz.github.io/CycleGAN/

The unmatched frames available in the source material made CycleGAN the optimal choice, since it is designed to train image-to-image systems without the use of paired data, using a double-generative structure.

From the new paper, an illustration of the, well, cyclical machinations of CycleGAN during the training process.
From the new paper, an illustration of the, well, cyclical machinations of CycleGAN during the training process.

The researchers essentially developed a translation model between the bare and sensor-pasted frames, capable of accurately restoring the subject’s facial features independent of the expression being produced, with the double generator structure of CycleGAN used for the removal of sEMG sensors (see image above).

In CycleGAN, one of the generators learns a traditional transformational GAN direction, while the second learns the reverse of that direction. Since this is an uncommon task for GAN, extra loss functions are used, including consistent cycle loss and identity loss.

Data and Tests

Effectively, the completed system is designed to produce naked faces from sensor-laden faces, and therefore the criteria for testing is this task.

For testing purposes, two standards were adopted: the visual quality of the generated images, which was performed by comparing their perceptual quality to frames extracted from the source video; and facial analysis for fitting of Facial Action Units (FAUs) from the FACS system – which characteristics were also being recorded by the facial sensors.

For the latter, the authors used the Python-based PyFeat library, which conveniently incorporates two relevant methodologies: random decision forest, and the attention-based JAA-Net model.

The architecture of JAA-Net, incorporated into the PyFeat library used in the new project. Source: https://arxiv.org/pdf/2003.08834.pdf
The architecture of JAA-Net, incorporated into the PyFeat library used in the new project. Source: https://arxiv.org/pdf/2003.08834.pdf

Emotion detection was performed by the ResidualMaskingNetwork (ResNetMask) initiative, which already has some experience of detecting emotions when faced with occlusions:

Examples from the ResidualMaskingNetwork project. Source: https://github.com/phamquiluan/ResidualMaskingNetwork
Examples from the ResidualMaskingNetwork project. Source: https://github.com/phamquiluan/ResidualMaskingNetwork

The authors also evaluated the results between the real-world videos – the actual source captures from the sessions, with no faked information involved.

The experimental setup for the new work. The check marks and crosses indicate the possibility of solving a specific task based on the available data for the experiments.
The experimental setup for the new work. The check marks and crosses indicate the possibility of solving a specific task based on the available data for the experiments.

Metrics used included Learned Perceptual Similarity Metrics (LPIPS), and Fréchet Inception Distance (FID). The latter is particularly useful, since the averaging of content output is pertinent in a scenario, such as the one proposed in the new work, where exactly analogous frames cannot be generated. The project additionally uses the Inception V3 architecture, and the FastFID implementation.

All variants are run at a batch size of 128.

For each test subject, six of their videos were used, four with and two without the attached facial sensors. Frames were randomly chosen from these videos, and the amount of training data was limited to 2% of available frames.

All extracted faces were resized to 286px2, matching the CycleGAN backbone. The frames were split 90/10 for validation, and the usual data augmentation was performed, including flips, cropping and normalization.

For training, a ResNet model consisting of nine blocks was used, trained from scratch, for the generator network, with two additional downsampling blocks prepended to this pipeline. Both models were powered by the PatchDiscriminator project.

All models were trained for 30 epochs at a learning rate of 3e-4, with a continuous learning rate decay update after 15 epochs.

The authors note, in reference to the image below (which depicts training progress), that the model immediately learned the general removal procedure for the sEMG sensors, after which it proceeded to restore fine-grained detail on the faces.

During training, the model addresses sEMG removal within a mere five epochs, because these are very manifest differences, before proceeding to add detail.
During training, the model addresses sEMG removal within a mere five epochs, because these are very manifest differences, before proceeding to add detail.

For tests, videos from each participant were transformed (sensor>no sensor) using a network trained on only 2% of the frames, with eight basic emotions and diverse FAUs interpreted. Only videos of the same recording session were compared to each other since the lack of truly paired data meant that this setup represented adequately novel data for such a sparsely-seen dataset.

For a baseline, all evaluations between the two normal videos were estimated (i.e., real>real).

Though extensive results from this are not provided in the paper, some sample selections are released (see the second image after the start of this article, above).

The authors comment:

‘We assume that the model learned a generalized version of each test subject’s face as in some of the shown examples the view was zoomed out. Thus, missing information must have been encoded inside the model.

‘The examples show that the model retains head posture, orientation, and most significantly the correct facial expressions.’

A perceptual score was then obtained using LPIPS and FID, with mean scores averaged over all test subjects, with their mean standard deviation from the real>real baseline, and the three aforementioned standard tests applied:

Results from the initial run of perceptual tests.
Results from the initial run of perceptual tests.

According to the authors, the fact that the generated clean (no sensor) videos have a higher resemblance to the baseline indicates superior performance, while the relative parity of the LPIPS scores are another excellent sign that the generated images tend to fall within the distribution of ‘real’ videos.

Further, the authors attribute the fact that their synthetic results perform better than the real results to the possibility of changes in the recording setup lowering the score for the real results, whereas these aspects are normalized in the synthetic results.

In reconstructing FAUs using the two aforementioned methods, the results were averaged for test subjects and FAUs, with the results table showing the respective mean scores and deviation from the baseline:

Results for the FAU inference round of tests.
Results for the FAU inference round of tests.

Here, for most of the clean videos, the new system is able to obtain a similar score to the baseline.

Qualitative comparison for the Schaede task, with five intervals of activation highlighted.
Qualitative comparison for the Schaede task, with five intervals of activation highlighted.

The authors observe that their method is capable of restoring the missing intervals visualized above, and to correct the amplitude of existing ones.

For the emotion detection comparison test, all three tasks were considered, processed via the aforementioned ResNetMask. The video pairs for all 36 participants were averaged for this test, and were compared using Dynamic Time Warping (DTW) and Monitor-Analysis-Planning-Executing (MAPE). The results below are for the neutral emotional state:

Results for emotion detection inference.
Results for emotion detection inference.

Here the authors note that their restoration again achieves similar results to the baseline scores.

In concluding, the authors concede that they are not always able to achieve a fault-free reproduction, since the equidistant selection of frames may alight on a moment when the subject closed their eyes, though the authors observe that random frame selection could mitigate this issue.

A reconstruction failure that can occur when an unfortunate frame is selected equidistantly.
A reconstruction failure that can occur when an unfortunate frame is selected equidistantly.

Conclusion

The test data scenario for this paper was quite eccentric, since the use of sensors is not necessarily entirely germane to the potential downstream applications of the method, which otherwise contributes a novel workflow for one of the most intensely-studied tasks in computer vision – the faithful reconstruction of faces where the data may not be perfect, and the removal of occlusions.

It would have been interesting to see the new technique compared to the more resource-intensive fine-tuning methods that it seeks to replace – though it has to be admitted that any approach that improves on scores from real-world data may have something genuine to offer this particular strand of research.

More To Explore

AI ML DL

Controllable Deepfakes With Gaussian Avatars

Could Gaussian Splatting become the hottest new deepfake technology since 2017? The massive surge of interest from the research sector suggests it might – and the latest innovation not only brings full controllability to neural or deepfaked faces, but also lets you become someone else at an unprecedented level of photorealism and efficiency.

AI ML DL

Badly-Compressed Images Affect CLIP’s Performance, New Research Contends

CLIP is the new darling of the computer vision research, and of image-based generative AI, with wide uptake of the image/text analysis framework across the sector. However, new research indicates that CLIP’s efficiency and usefulness is negatively affected by badly-compressed images. Though this should not be a problem in the modern high-speed broadband age, it is – because so much essential data and methodologies still in use data back several decades.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle