Improving Deepfaked Eye Contact via Sound Analysis

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

The ability to control the eye direction of an AI-generated face is one of the most important challenges that the VFX research sector is currently facing in the search for hyper-realistic facial reconstruction and synthesis. Though we, as people, are very finely-tuned to changes in expression as a survival mechanism, we are even more aware of eye direction changes in people that we are observing, since these signal attention –  and the possibility (for instance in a group setting), that we have been singled out among an audience of listeners and may need to respond or otherwise take action.

From a study on gaze adaptation, the various possible head poses in which eye contact is increasingly unlikely, and perhaps increasingly significant. Source: https://www.frontiersin.org/articles/10.3389/fpsyg.2018.02165/full
From a study on gaze adaptation, the various possible head poses in which eye contact is increasingly unlikely, and perhaps increasingly significant. Source: https://www.frontiersin.org/articles/10.3389/fpsyg.2018.02165/full

Gaze fixation and eye contact is such a central facet of human expression (though it is not currently addressed in traditional Facial Expression Recognition systems such as the Facial Action Coding System), and such a strong potential signifier of territorialism or threat, that it has been selectively banned in the United Kingdom.

Nevertheless, giving eye contact adequate attention in machine learning is somewhat of a challenge in itself, at least when considered in the context of the face as a whole. The eye apertures occupy only around 5-7% of the entire face; when face images are being run through a machine learning training network, where a typically generous size remains around 512px2, the position of the cornea and inner pupil can be affected by the tiniest factor, such as codec-based smoothing, which can alter the fidelity of the eye pose.

Even a trend towards higher resolutions, such as 756px2 and 1024px2 does relatively little to improve this state of affairs.

Nonetheless, a number of commercial products have been attempting lately to achieve neural eye-direction control within such constraints including a new functionality in NVIDIA’s broadcast app, which uses machine learning to ‘fix’ a divergent gaze straight back to camera:

The NVIDIA broadcast app can now keep you looking to camera, even if you’re not. Source: https://www.youtube.com/watch?v=nR-vP_7XFHE

When training on people who are smiling broadly, or who present very little sclera or corneal space, or whose visible eye surface has diminished with age, or who are standing obliquely to camera, or who are of the various races that naturally present less eye aperture, the human capacity for fine-grained gaze evaluation becomes even more difficult for training systems to emulate, since they must consider the entirety of the face, and the eyes only as an equally-considered facet of the whole – a level of attention that does not reflect the importance of eye pose in human culture.

Eye size is no indicator of success or attractiveness, but a lack of inner eye detail can confound machine learning systems that are attempting to evaluate or recreate these minimal areas of pixel.
Eye size is no indicator of success or attractiveness, but a lack of inner eye detail can confound machine learning systems that are attempting to evaluate or recreate these minimal areas of pixel.

If you have ever observed a celebrity red-carpet event, you’ll have noticed the crowd of photographers tend to call out to the stars in the hope of obtaining eye contact, which is likely to increase the value of the photo, for cultural reasons, in that the viewer of the photo will then respond more favorably to the attention of the subject.

These days paparazzi could arguably use the Eye Direction slider in the Smart Portrait functionality of Photoshop’s neural filters to alter the outcome – at least until the growing impetus to reveal and constrain AI-based manipulation might begin to prejudice this kind of trickery.

In any case, this particular offering from Adobe is a very hit-and-miss affair, often changing the identity of the eye, and altering the outer lineaments of the eye excessively; and almost never recapturing the full and authentic detail of the original photo:

The use of Photoshop's eye direction changer, which transmits data to and from the cloud, since most people do not have adequate GPU power to perform this task, is rarely able to effect a totally convincing gaze-change without significant subsequent manual retouching. In both examples, we can see that the eye color has not been accurately reproduced either, and in the case of Ryan Gosling, that the cornea is significantly smaller. These were the only two examples run, and are not cherry-picked.
The use of Photoshop's eye direction changer, which transmits data to and from the cloud, since most people do not have adequate GPU power to perform this task, is rarely able to effect a totally convincing gaze-change without significant subsequent manual retouching. In both examples, we can see that the eye color has not been accurately reproduced either, and in the case of Ryan Gosling, that the cornea is significantly smaller. These were the only two examples run, and are not cherry-picked.

The problem, as is often the case with computer vision system, lies both in the data and the limited ability of training systems to interpret it in a useful way. A great deal of historical research, and a great many of the datasets currently available, treat head direction as an indicator of eye-pose direction, in that it is assumed that the eye is always facing forward, and that therefore, if the head is facing away, the eye is also facing away.

Eye direction is frequently mis-associated with head direction.
Eye direction is frequently mis-associated with head direction.

This may be useful to set a ‘baseline’ canonical template in CGI-based neural interfaces such as 3D Morphable Models (3DMMs), but it tends to teach machine systems to make this same false assumption – to the point where it may be necessary to laboriously seek out and include a large number of ‘lateral gazes’ in order to redress the balance.

The best solution would be, arguably, to improve the quality of gaze estimation. Now, new research from the United Kingdom and Korea is seeking to do this by studying soundtracks in video-clips, in order to assess where any particular speaker in a video is likely to be looking.

Sound can help to identify where a person is likely to be looking when there may not be enough eye surface visible to otherwise detect this. Source: https://arxiv.org/pdf/2311.05669.pdf
Sound can help to identify where a person is likely to be looking when there may not be enough eye surface visible to otherwise detect this. Source: https://arxiv.org/pdf/2311.05669.pdf

This novel take on a multimodal approach has produced, the authors claim, the first dataset for gaze following that includes a sound component. In tests, the new work, the paper claims, significantly outperformed prior approaches.

The new paper is titled Multi-Modal Gaze Following in Conversational Scenarios, and comes from six researchers across the University of Birmingham in the UK, and the Korea Electronics Technology Institute (KETI).

Method

The new approach is called MMGaze, and undertakes gaze evaluation for each frame of a video, using the soundtrack and detected lip movement to determine which participant is speaking.

After processing of the features obtained, the results are pooled into a gaze candidate estimation network, which is further informed by the system already having assigned IDs to each of the potential speakers. Finally, a Multilayer Perceptron (MLP) is used to select the candidate with the highest probability as the gaze target for others featured in the video, which obtains a pose for the eye.

Conceptual architecture for MMGaze.
Conceptual architecture for MMGaze.

For the initial task of active speaker detection, the MMGaze draws on the prior work Automated lip sync in the wild (ALSW). This way, features are extracted from both the visual and audio components.

Video and audio examples for the process of discerning genuine and false audio-video pairs, for the paper ' Out of time: automated lip sync in the wild'. Source: https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/chung16a.pdf
Video and audio examples for the process of discerning genuine and false audio-video pairs, for the paper ' Out of time: automated lip sync in the wild'. Source: https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/chung16a.pdf

Thus the concatenated characteristics of the features are used to identify speakers, subsequently. For this process, the S3FD facial detection library (long-used in the deepfaking software FaceSwap) creates a grayscale image, which is then cropped to the mouth region, based on landmarks. Extractions were sampled for every five frames, in videos with a frame rate of 25fps.

The Single Shot Scale-invariant Face Detector (S3FD) is a popular resource in facial individuation and landmarking. Source: https://arxiv.org/pdf/1708.05237.pdf
The Single Shot Scale-invariant Face Detector (S3FD) is a popular resource in facial individuation and landmarking. Source: https://arxiv.org/pdf/1708.05237.pdf

For the audio, the SyncNet network in ALSW extracts motion features from the captured mouth areas. This module is trained with contrastive loss, requiring correlation between the lip motion features and the complementary audio features – and out-of-frame voices are omitted via an apposite threshold.

The SyncNet network in ALSW uses contrastive loss to simultaneously train lip movement and related extracted audio features. Source: https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/chung16a.pdf
The SyncNet network in ALSW uses contrastive loss to simultaneously train lip movement and related extracted audio features. Source: https://www.robots.ox.ac.uk/~vgg/publications/2016/Chung16a/chung16a.pdf

The bounding boxes obtained from the extracted images are then converted to identity maps which represent speaker and listener/s, and these concatenations are then stacked into full-scene images in the channel dimension with the subsequent five-channel image used afterwards for the candidate estimation task.

Speaker relationships are established by studying who is speaking and who appears to be listening, with multiple possible eye directions.
Speaker relationships are established by studying who is speaking and who appears to be listening, with multiple possible eye directions.

The detection of gaze targets in the MMGaze architecture is inspired, the authors say, by the Mask R-CNN object detection initiative from Facebook Research. Enhanced images are passed to a ResNeXt101 model in order to obtain feature maps. Predetermined Regions of Interest (ROIs) are set for each point in the feature map, before being sent to a Region Proposal Network (RPN) for binary classification and bounding-box regression.

With some of the candidates thus filtered out, the remaining ones are subject to the ROIAlign framework from Mask R-CNN, which re-fits the original source image to the related pixels of the new feature map; thus a new fixed-size feature map is generated, with the candidate frames subject to further regression through the use of a fully convolutional network (FCN),

Lastly, an MLP is trained to map each subject and obtain the highest probability of what their gaze targets are likely to be.

Data and Tests

In the absence of any appropriate extant dataset, the authors compiled what they contend to be the first gaze dataset that incorporates audio, titled VideoGazeSpeech.

Examples from the VideoGazeSpeech dataset.
Examples from the VideoGazeSpeech dataset.

The new dataset comprises 35,231 video frames across 29 videos, each with a duration of around 20 seconds, and each running at 25fps, at a resolution of 1280x720px, for a total occupancy of 7.2GB.

VideoGazeSpeech is an augmented subset of the dataset associated with the 2020 paper Find Who to Look at: Turning From Action to Saliency.

Examples from the source dataset that fuels the VideoGazeSpeech curation. Source: https://www.semanticscholar.org/paper/Find-Who-to-Look-at%3A-Turning-From-Action-to-Xu-Liu/9f1a854d574d0bd14786c41247db272be6062581
Examples from the source dataset that fuels the VideoGazeSpeech curation. Source: https://www.semanticscholar.org/paper/Find-Who-to-Look-at%3A-Turning-From-Action-to-Xu-Liu/9f1a854d574d0bd14786c41247db272be6062581

The data needed annotating from scratch for the intended purpose of the new project, and the DarkLabel video/image labeling and annotation tool was used in this process.  

The open source DarkLabel annotation and labeling tool in action. Source: https://github.com/darkpgmr/DarkLabel
The open source DarkLabel annotation and labeling tool in action. Source: https://github.com/darkpgmr/DarkLabel

Additionally, the results were reviewed by three people, to check accuracy. The final outcome featured an average 2-4 people in each scene, with each clip averaging around 400-500 frames.

In a rare spirit of consideration for later re-use of their project, the researchers transliterated the format of their database into VOC, COCO and VideoAttentionTarget formats, and also spared downstream projects from extensive curation by defining a training/validation split, with the former at 31,701 frames and the test data comprising 3,524 frames.

Since the approach being proposed by the researchers is new, there were no directly comparable prior projects to test it against. Therefore they evaluated their model on its own terms, and against earlier frameworks that do not use audio as a supporting facet.

The rival frameworks tested were the 2020 paper Deformable Transformers for End-to-End Object Detection (DETR); and the VAT system from another paper that year, Detecting Attended Visual Targets in Video;

Since the target success cases are known by annotation, none of the usual metrics were necessary to classify results, but only to compare results to known characteristics. Thus Average Precision was used.

For training, the learning rate was 0.0025, over 12 epochs, run on two NVIDIA 3090 GPUs, each with 24GB of VRAM.

Quantitative results for audio/no audio on the VGS dataset.
Quantitative results for audio/no audio on the VGS dataset.

Of the results from the initial quantitative round, the authors comment:

‘The [results] demonstrate that our multimodal network structure gaze candidate estimation network (0.433) outperforms DETR (0.418) and VAT (0.324) in terms of AP performance. Moreover, as the modality increases, the AP of our method and Transformer method performs better than that of a single modality. Interestingly, we found that VAT performs worse when audio cues are added to the feature map, indicating that its network is too simple to handle multimodal information.

‘These results suggest that incorporating audio information into gaze following models, as we did in our gaze candidate estimation model, can lead to significant improvements in accuracy, particularly in real-world scenarios where audio cues play a crucial role.

‘The superiority of our multimodal network structure over traditional CNN methods and Transformer methods also highlights the importance of fusing multimodal information for gaze following detection.’

The system was next tested qualitatively across different backbones, using ResNet-101, ResNet-50, ResNeXt-101, using Fixed-pattern noise (FPN) estimation. Gaussian heat-maps were generated for this test.

Qualitative test results using AP, with higher scores better.
Qualitative test results using AP, with higher scores better.

Here the authors comment:

‘The [results] clearly demonstrate the effectiveness of our multimodal model in enhancing the prediction accuracy of gaze-following models. In particular, the gaze candidate estimation model shows superior performance compared to other models. This is because the gaze candidate estimation model takes into account the speaker’s mode, which improves its accuracy in social situations.’

Visual results for the qualitative round.
Visual results for the qualitative round.

In regard to the visual results provided for the qualitative test at hand (image above), the paper asserts:

‘Our model outperforms the VAT method in accurately detecting the gaze target. In the first frame, our model accurately detects the gaze target where the VAT method failed to do so.

‘This demonstrates the superior performance of our model in terms of gaze target detection. In the second frame, our model accurately detected the speaker as the gaze target in a conversational scenario, while another model failed. Incorporating audio cues is crucial for gaze following, and audio-visual fusion can significantly improve accuracy, especially in real-world scenarios.’

Conclusion

The research covered here seems to have notable potential for downstream tasks, such as better semantically-based ‘guesses’ for where a person is looking – or, at the very least, to disabuse newly-trained systems of the notion that head pose is a major signifier for eye pose.

The creation of completely convincing neural people or deepfake-style techniques to enhance and augment actors’ performances is going to be absolutely essential for such approaches to gain a true and resilient foothold in visual effects pipelines. The standard, epitomized by the common belief that ‘eyes follow you round the room’ in the work of the true classic masters, is perhaps higher than any other in this respect, while the area of applicable operation is as limited as it could possibly be.

More To Explore

AI ML DL

Controllable Deepfakes With Gaussian Avatars

Could Gaussian Splatting become the hottest new deepfake technology since 2017? The massive surge of interest from the research sector suggests it might – and the latest innovation not only brings full controllability to neural or deepfaked faces, but also lets you become someone else at an unprecedented level of photorealism and efficiency.

AI ML DL

Badly-Compressed Images Affect CLIP’s Performance, New Research Contends

CLIP is the new darling of the computer vision research, and of image-based generative AI, with wide uptake of the image/text analysis framework across the sector. However, new research indicates that CLIP’s efficiency and usefulness is negatively affected by badly-compressed images. Though this should not be a problem in the modern high-speed broadband age, it is – because so much essential data and methodologies still in use data back several decades.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle