The ability to control the eye direction of an AI-generated face is one of the most important challenges that the VFX research sector is currently facing in the search for hyper-realistic facial reconstruction and synthesis. Though we, as people, are very finely-tuned to changes in expression as a survival mechanism, we are even more aware of eye direction changes in people that we are observing, since these signal attention – and the possibility (for instance in a group setting), that we have been singled out among an audience of listeners and may need to respond or otherwise take action.

Gaze fixation and eye contact is such a central facet of human expression (though it is not currently addressed in traditional Facial Expression Recognition systems such as the Facial Action Coding System), and such a strong potential signifier of territorialism or threat, that it has been selectively banned in the United Kingdom.
Nevertheless, giving eye contact adequate attention in machine learning is somewhat of a challenge in itself, at least when considered in the context of the face as a whole. The eye apertures occupy only around 5-7% of the entire face; when face images are being run through a machine learning training network, where a typically generous size remains around 512px2, the position of the cornea and inner pupil can be affected by the tiniest factor, such as codec-based smoothing, which can alter the fidelity of the eye pose.
Even a trend towards higher resolutions, such as 756px2 and 1024px2 does relatively little to improve this state of affairs.
Nonetheless, a number of commercial products have been attempting lately to achieve neural eye-direction control within such constraints including a new functionality in NVIDIA’s broadcast app, which uses machine learning to ‘fix’ a divergent gaze straight back to camera:
The NVIDIA broadcast app can now keep you looking to camera, even if you’re not. Source: https://www.youtube.com/watch?v=nR-vP_7XFHE
When training on people who are smiling broadly, or who present very little sclera or corneal space, or whose visible eye surface has diminished with age, or who are standing obliquely to camera, or who are of the various races that naturally present less eye aperture, the human capacity for fine-grained gaze evaluation becomes even more difficult for training systems to emulate, since they must consider the entirety of the face, and the eyes only as an equally-considered facet of the whole – a level of attention that does not reflect the importance of eye pose in human culture.

If you have ever observed a celebrity red-carpet event, you’ll have noticed the crowd of photographers tend to call out to the stars in the hope of obtaining eye contact, which is likely to increase the value of the photo, for cultural reasons, in that the viewer of the photo will then respond more favorably to the attention of the subject.
These days paparazzi could arguably use the Eye Direction slider in the Smart Portrait functionality of Photoshop’s neural filters to alter the outcome – at least until the growing impetus to reveal and constrain AI-based manipulation might begin to prejudice this kind of trickery.
In any case, this particular offering from Adobe is a very hit-and-miss affair, often changing the identity of the eye, and altering the outer lineaments of the eye excessively; and almost never recapturing the full and authentic detail of the original photo:

The problem, as is often the case with computer vision system, lies both in the data and the limited ability of training systems to interpret it in a useful way. A great deal of historical research, and a great many of the datasets currently available, treat head direction as an indicator of eye-pose direction, in that it is assumed that the eye is always facing forward, and that therefore, if the head is facing away, the eye is also facing away.

This may be useful to set a ‘baseline’ canonical template in CGI-based neural interfaces such as 3D Morphable Models (3DMMs), but it tends to teach machine systems to make this same false assumption – to the point where it may be necessary to laboriously seek out and include a large number of ‘lateral gazes’ in order to redress the balance.
The best solution would be, arguably, to improve the quality of gaze estimation. Now, new research from the United Kingdom and Korea is seeking to do this by studying soundtracks in video-clips, in order to assess where any particular speaker in a video is likely to be looking.

This novel take on a multimodal approach has produced, the authors claim, the first dataset for gaze following that includes a sound component. In tests, the new work, the paper claims, significantly outperformed prior approaches.
The new paper is titled Multi-Modal Gaze Following in Conversational Scenarios, and comes from six researchers across the University of Birmingham in the UK, and the Korea Electronics Technology Institute (KETI).
Method
The new approach is called MMGaze, and undertakes gaze evaluation for each frame of a video, using the soundtrack and detected lip movement to determine which participant is speaking.
After processing of the features obtained, the results are pooled into a gaze candidate estimation network, which is further informed by the system already having assigned IDs to each of the potential speakers. Finally, a Multilayer Perceptron (MLP) is used to select the candidate with the highest probability as the gaze target for others featured in the video, which obtains a pose for the eye.

For the initial task of active speaker detection, the MMGaze draws on the prior work Automated lip sync in the wild (ALSW). This way, features are extracted from both the visual and audio components.

Thus the concatenated characteristics of the features are used to identify speakers, subsequently. For this process, the S3FD facial detection library (long-used in the deepfaking software FaceSwap) creates a grayscale image, which is then cropped to the mouth region, based on landmarks. Extractions were sampled for every five frames, in videos with a frame rate of 25fps.

For the audio, the SyncNet network in ALSW extracts motion features from the captured mouth areas. This module is trained with contrastive loss, requiring correlation between the lip motion features and the complementary audio features – and out-of-frame voices are omitted via an apposite threshold.

The bounding boxes obtained from the extracted images are then converted to identity maps which represent speaker and listener/s, and these concatenations are then stacked into full-scene images in the channel dimension with the subsequent five-channel image used afterwards for the candidate estimation task.

The detection of gaze targets in the MMGaze architecture is inspired, the authors say, by the Mask R-CNN object detection initiative from Facebook Research. Enhanced images are passed to a ResNeXt101 model in order to obtain feature maps. Predetermined Regions of Interest (ROIs) are set for each point in the feature map, before being sent to a Region Proposal Network (RPN) for binary classification and bounding-box regression.
With some of the candidates thus filtered out, the remaining ones are subject to the ROIAlign framework from Mask R-CNN, which re-fits the original source image to the related pixels of the new feature map; thus a new fixed-size feature map is generated, with the candidate frames subject to further regression through the use of a fully convolutional network (FCN),
Lastly, an MLP is trained to map each subject and obtain the highest probability of what their gaze targets are likely to be.
Data and Tests
In the absence of any appropriate extant dataset, the authors compiled what they contend to be the first gaze dataset that incorporates audio, titled VideoGazeSpeech.

The new dataset comprises 35,231 video frames across 29 videos, each with a duration of around 20 seconds, and each running at 25fps, at a resolution of 1280x720px, for a total occupancy of 7.2GB.
VideoGazeSpeech is an augmented subset of the dataset associated with the 2020 paper Find Who to Look at: Turning From Action to Saliency.

The data needed annotating from scratch for the intended purpose of the new project, and the DarkLabel video/image labeling and annotation tool was used in this process.

Additionally, the results were reviewed by three people, to check accuracy. The final outcome featured an average 2-4 people in each scene, with each clip averaging around 400-500 frames.
In a rare spirit of consideration for later re-use of their project, the researchers transliterated the format of their database into VOC, COCO and VideoAttentionTarget formats, and also spared downstream projects from extensive curation by defining a training/validation split, with the former at 31,701 frames and the test data comprising 3,524 frames.
Since the approach being proposed by the researchers is new, there were no directly comparable prior projects to test it against. Therefore they evaluated their model on its own terms, and against earlier frameworks that do not use audio as a supporting facet.
The rival frameworks tested were the 2020 paper Deformable Transformers for End-to-End Object Detection (DETR); and the VAT system from another paper that year, Detecting Attended Visual Targets in Video;
Since the target success cases are known by annotation, none of the usual metrics were necessary to classify results, but only to compare results to known characteristics. Thus Average Precision was used.
For training, the learning rate was 0.0025, over 12 epochs, run on two NVIDIA 3090 GPUs, each with 24GB of VRAM.

Of the results from the initial quantitative round, the authors comment:
‘The [results] demonstrate that our multimodal network structure gaze candidate estimation network (0.433) outperforms DETR (0.418) and VAT (0.324) in terms of AP performance. Moreover, as the modality increases, the AP of our method and Transformer method performs better than that of a single modality. Interestingly, we found that VAT performs worse when audio cues are added to the feature map, indicating that its network is too simple to handle multimodal information.
‘These results suggest that incorporating audio information into gaze following models, as we did in our gaze candidate estimation model, can lead to significant improvements in accuracy, particularly in real-world scenarios where audio cues play a crucial role.
‘The superiority of our multimodal network structure over traditional CNN methods and Transformer methods also highlights the importance of fusing multimodal information for gaze following detection.’
The system was next tested qualitatively across different backbones, using ResNet-101, ResNet-50, ResNeXt-101, using Fixed-pattern noise (FPN) estimation. Gaussian heat-maps were generated for this test.

Here the authors comment:
‘The [results] clearly demonstrate the effectiveness of our multimodal model in enhancing the prediction accuracy of gaze-following models. In particular, the gaze candidate estimation model shows superior performance compared to other models. This is because the gaze candidate estimation model takes into account the speaker’s mode, which improves its accuracy in social situations.’

In regard to the visual results provided for the qualitative test at hand (image above), the paper asserts:
‘Our model outperforms the VAT method in accurately detecting the gaze target. In the first frame, our model accurately detects the gaze target where the VAT method failed to do so.
‘This demonstrates the superior performance of our model in terms of gaze target detection. In the second frame, our model accurately detected the speaker as the gaze target in a conversational scenario, while another model failed. Incorporating audio cues is crucial for gaze following, and audio-visual fusion can significantly improve accuracy, especially in real-world scenarios.’
Conclusion
The research covered here seems to have notable potential for downstream tasks, such as better semantically-based ‘guesses’ for where a person is looking – or, at the very least, to disabuse newly-trained systems of the notion that head pose is a major signifier for eye pose.
The creation of completely convincing neural people or deepfake-style techniques to enhance and augment actors’ performances is going to be absolutely essential for such approaches to gain a true and resilient foothold in visual effects pipelines. The standard, epitomized by the common belief that ‘eyes follow you round the room’ in the work of the true classic masters, is perhaps higher than any other in this respect, while the area of applicable operation is as limited as it could possibly be.