Restoring hidden parts of facial images is a strong strand of research in the computer vision sector. To make new models robust to potentially damaged or partial data, many are trained with masked methodologies, where parts of the source face at any one time in the training process are blacked out or in some way perturbed, so that the model is forced to develop an agile understanding of faces, instead of just memorizing the faces that it is seeing.
Additionally, the idea of hiding parts of the data so that the system is more resilient is not limited to computer vision models, but is also practiced in the development of language-based models.
In terms of Facial Expression Recognition (FER), this kind of semi-blind training can be a great aid to the development of more robust emotion recognition functionality. As humans, we are so sensitized to reading emotions as a survival or evolutionary skill, that most of us are able to infer emotional state from very limited information. Machine learning models, however, tend towards binary classification and thresholds rather than spectra, and often need larger volumes of known signifiers than a person would.
Therefore the prevailing practice in model training for these purposes has been to fine-tune existing models, so that they are endowed with these extra perceptual capabilities. But as we have noted before, fine-tuning not only affects all the weights in the model (rendering it less useful or even useless for the wider tasks it was originally trained for), but is a resource-intensive and time-consuming process.
Now, a new paper from Germany is offering a method of reconstructing obscured parts of a face without the need for fine-tuning, instead using Generative Adversarial Networks (GANs) to reformulate training data.
In tests, the new method is able to notably improve on facial capture reconstructions obtained by real-world facial sensors, which seems a notable achievement.
The core intended application of the approach is to facilitate FER through analysis of the disposition of facial muscles, largely according to the Facial Action Coding System (FACS), a methodology developed in the 1970s that has become dominant in the field, despite notable criticism. The authors believe that the new system will have wider applications, however.
The paper states:
‘This proposed approach retains the visual appearances of the test subjects. In fact, we show that completely covered facial features can be restored correctly. With respect to quality, our clean videos resemble the normal videos more than the sensor videos.
‘More importantly, downstream facial analysis algorithms can be applied directly without the need of fine-tuning them first for images with sEMG sensors. We eliminate the problem of obstructed facial features that otherwise would render an in-depth analysis of expressions and muscle activity impossible.’
For the work, a new dataset was originated to measure the correlation between mimicry of emotions (i.e., acting out facial expressions) and the way that the muscles of the face change. The researchers recorded the facial movement and muscle disposition of 36 test subjects, with a 19/17 female/male split.
The recordings were taken over three sessions (once with bare faces, twice with attached sensors) at a resolution of 1280x720px at 30 frames per second, with muscle activity recorded through surface electromyography (sEMG) sensors attached to the face. No particular parity between the sessions was observed: the color of the wires was allowed to change, and the subjects did not necessarily wear the same clothes or hairstyle between sessions.
Though all the participants were required to run through a script, and to act out emotional states according to that script, it was not possible (or desired) that the subjects all reach exactly the same emotional intensity at exactly the same moment across all the diverse script readings. While this could arguably have been achieved through intensive work, it is not a realistic scenario for downstream applications, and the new methodology is designed to work within these limitations.
The subjects were required to impersonate 11 distinct facial expressions three times, known as the Schaede task (in reference to the original work for this methodology). Next, they had to repeat five spoke sentences (known as the Sentence task). Finally they were required to evince 24 emotional facial expressions (called the Emotion task).
The researchers obtained 174 videos of the participants in this way, with a ratio of 1:2 for bare-faced and sensor-laden content, respectively.
Since the sessions were monitored by medical experts, who would occasionally pause the experiments, there was no final parity in the timing of the executed simulations of emotion. Within reasonable constraints, the subjects were allowed to vary their head angle, gaze and distance to the camera, though ultimately the extraction process would crop all faces to a similar representative sized in an extracted frame.
The unmatched frames available in the source material made CycleGAN the optimal choice, since it is designed to train image-to-image systems without the use of paired data, using a double-generative structure.
The researchers essentially developed a translation model between the bare and sensor-pasted frames, capable of accurately restoring the subject’s facial features independent of the expression being produced, with the double generator structure of CycleGAN used for the removal of sEMG sensors (see image above).
In CycleGAN, one of the generators learns a traditional transformational GAN direction, while the second learns the reverse of that direction. Since this is an uncommon task for GAN, extra loss functions are used, including consistent cycle loss and identity loss.
Data and Tests
Effectively, the completed system is designed to produce naked faces from sensor-laden faces, and therefore the criteria for testing is this task.
For testing purposes, two standards were adopted: the visual quality of the generated images, which was performed by comparing their perceptual quality to frames extracted from the source video; and facial analysis for fitting of Facial Action Units (FAUs) from the FACS system – which characteristics were also being recorded by the facial sensors.
The authors also evaluated the results between the real-world videos – the actual source captures from the sessions, with no faked information involved.
Metrics used included Learned Perceptual Similarity Metrics (LPIPS), and Fréchet Inception Distance (FID). The latter is particularly useful, since the averaging of content output is pertinent in a scenario, such as the one proposed in the new work, where exactly analogous frames cannot be generated. The project additionally uses the Inception V3 architecture, and the FastFID implementation.
All variants are run at a batch size of 128.
For each test subject, six of their videos were used, four with and two without the attached facial sensors. Frames were randomly chosen from these videos, and the amount of training data was limited to 2% of available frames.
All extracted faces were resized to 286px2, matching the CycleGAN backbone. The frames were split 90/10 for validation, and the usual data augmentation was performed, including flips, cropping and normalization.
For training, a ResNet model consisting of nine blocks was used, trained from scratch, for the generator network, with two additional downsampling blocks prepended to this pipeline. Both models were powered by the PatchDiscriminator project.
The authors note, in reference to the image below (which depicts training progress), that the model immediately learned the general removal procedure for the sEMG sensors, after which it proceeded to restore fine-grained detail on the faces.
For tests, videos from each participant were transformed (sensor>no sensor) using a network trained on only 2% of the frames, with eight basic emotions and diverse FAUs interpreted. Only videos of the same recording session were compared to each other since the lack of truly paired data meant that this setup represented adequately novel data for such a sparsely-seen dataset.
For a baseline, all evaluations between the two normal videos were estimated (i.e., real>real).
Though extensive results from this are not provided in the paper, some sample selections are released (see the second image after the start of this article, above).
The authors comment:
‘We assume that the model learned a generalized version of each test subject’s face as in some of the shown examples the view was zoomed out. Thus, missing information must have been encoded inside the model.
‘The examples show that the model retains head posture, orientation, and most significantly the correct facial expressions.’
A perceptual score was then obtained using LPIPS and FID, with mean scores averaged over all test subjects, with their mean standard deviation from the real>real baseline, and the three aforementioned standard tests applied:
According to the authors, the fact that the generated clean (no sensor) videos have a higher resemblance to the baseline indicates superior performance, while the relative parity of the LPIPS scores are another excellent sign that the generated images tend to fall within the distribution of ‘real’ videos.
Further, the authors attribute the fact that their synthetic results perform better than the real results to the possibility of changes in the recording setup lowering the score for the real results, whereas these aspects are normalized in the synthetic results.
In reconstructing FAUs using the two aforementioned methods, the results were averaged for test subjects and FAUs, with the results table showing the respective mean scores and deviation from the baseline:
Here, for most of the clean videos, the new system is able to obtain a similar score to the baseline.
The authors observe that their method is capable of restoring the missing intervals visualized above, and to correct the amplitude of existing ones.
For the emotion detection comparison test, all three tasks were considered, processed via the aforementioned ResNetMask. The video pairs for all 36 participants were averaged for this test, and were compared using Dynamic Time Warping (DTW) and Monitor-Analysis-Planning-Executing (MAPE). The results below are for the neutral emotional state:
Here the authors note that their restoration again achieves similar results to the baseline scores.
In concluding, the authors concede that they are not always able to achieve a fault-free reproduction, since the equidistant selection of frames may alight on a moment when the subject closed their eyes, though the authors observe that random frame selection could mitigate this issue.
The test data scenario for this paper was quite eccentric, since the use of sensors is not necessarily entirely germane to the potential downstream applications of the method, which otherwise contributes a novel workflow for one of the most intensely-studied tasks in computer vision – the faithful reconstruction of faces where the data may not be perfect, and the removal of occlusions.
It would have been interesting to see the new technique compared to the more resource-intensive fine-tuning methods that it seeks to replace – though it has to be admitted that any approach that improves on scores from real-world data may have something genuine to offer this particular strand of research.