Protecting Neural Videoconferencing From Deepfake Puppeteering Attacks

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

It may eventually amuse us that video-conferencing tools such as Zoom and Skype once had to capture and compress huge amounts of video data and send them around the world in order to let people conduct virtual meetings; or that we had to max out our upstream broadband bandwidth in order to even attempt to achieve a semblance of real-time video streaming; or that we would struggle at peak times to catch up the lost packets from the other speaker, and need to ask them to repeat what they just said.

An approximation of the 1999-style ‘postage stamp’ and jerky playback that occurs when Zoom is forced to run at 256Kbps at 20% packet loss. Source:

In the future, using resource-intensive video for conferencing may be equivalent to the difference between taking a play on a world tour – with all the associated logistics, access limitations and bottlenecks – or just ‘watching the movie’.

This is the promise of low-bandwidth neural videoconferencing systems such as NVIDIA’s Maxine, which adopts a new dynamic in video-conferencing architecture: the use of ‘deepfake’-style person rendering, where only the essential facial movement is transmitted from the person at the other end, weighing little more than a series of text messages, and which is used to guide a locally-hosted avatar in the receiver’s system.

Besides a rich locus of interest from the research sector in live neural facial puppeteering, NVIDIA is not the only player in this particular space: Facebook AI Research also weighed into low-bandwidth neural videoconferencing with the 2021 paper Low Bandwidth Video-Chat Compression using Deep Generative Models – a representational neural system that requires only 10kbits/s to transmit ‘control’ information from the actual correspondent into locally-rendered actions of a photorealistic avatar.

In the Facebook system, landmarks are evaluated live and sent instantly to the receiver, where they are used to control a locally-situated high-quality photorealistic avatar that has been assembled on-the-fly from initial captures in the transmission. Source:
In the Facebook system, landmarks are evaluated live and sent instantly to the receiver, where they are used to control a locally-situated high-quality photorealistic avatar that has been assembled on-the-fly from initial captures in the transmission. Source:

The central idea is to compose the speaker’s avatar from the initial ‘full data’ packets sent, and then to use this information to recreate the speaker’s current appearance and situation, thereafter using the on-the-fly avatar as a receptacle to recreate the true movements and facial expressions of the speaker, at practically no cost in terms of bandwidth. This involves evaluating the ever-changing facial landmarks of the distant correspondent, using their own system and resources, and sending these to the recipient, where they are used as guidance for rendering:

Please allow time for the animated GIF below to load

In the NVIDIA Maxine system, landmarks power the motive appearance of the on-the-fly avatar. Upscaling further enhance the quality of the video. Source:
In the NVIDIA Maxine system, landmarks power the motive appearance of the on-the-fly avatar. Upscaling further enhance the quality of the video. Source:

Neural upscaling can be used to improve image quality at a cost of minimal local resources, so that the photorealistic avatar can be generated at a lower resolution, but visualized at 720p or higher. At the same time, the fact that the speaker has been ‘neuralized’ in the process means that one can optionally realign them and freely interpret the viewpoint of the output avatar:

Please allow time for the animated GIF below to load

'Reframing' a real-world correspondent in Maxine. Source:
'Reframing' a real-world correspondent in Maxine. Source:

Systems of this nature are tempting to develop, because local neural processing capacity, even on mobile devices, has grown notably in recent years, with dedicated machine learning-capable hardware appearing and growing in capability in the new generation of smartphones and tablets

Additionally, the promise of reducing the bandwidth burden of videoconferencing is of universal interest, both to the end users that must currently contend with the inevitable shortcomings of the system, and the infrastructure providers that would like to reduce network resource usage.

However, the security implications are quite obvious: such approaches essentially resemble deepfake puppeteering systems where the identity being recreated is supposedly the same as the source. The risk, naturally, is that an attacker could intervene in a call, using image and voice recreation to initially misrepresent themselves. The typical weaknesses of live deepfaking would be notably reduced in a situation where the attacker only has to ‘pull it off’ for the initial few seconds in which the puppeteering system is capturing the initial data.

Thereafter, the quality of the attacker’s own live deepfake would not matter so much, since the other participants in the call would be seeing an ‘official’ and approved deepfake personality. In terms of fraud, it’s equivalent to a tense few seconds with the passport inspector; once you’re in, you’re in.

Defending Against Attacks

It’s not known when systems such as Maxine may become widely-diffused, but the security research sector is keen to get ahead of the game. To this end, a recent paper from Drexel University in the USA offers a system of biometric safeguarding that aims to stop people exploiting this very new model in order to impersonate others in ‘neural’ video calls.

The authors envision the advent of video-centered equivalents to the audio-based deepfake attacks that have garnered headlines in recent years:

From the new paper, the core threat scenario, where an individual appropriates the identity of a target speaker. Source:
From the new paper, the core threat scenario, where an individual appropriates the identity of a target speaker. Source:

The authors observe that the threat scenario involves the creation of only a single ‘fake’ frame, from which the subject’s appearance is initially calculated. This is the ‘passport’ checkpoint.

They explain:

‘In a puppeteered video, the biometric identity of the driving speaker is different from that of the reconstructed speaker. Our proposed system leverages this fact to detect puppeteered videos. While the identity of the driving speaker is not directly observable to the receiver, the receiver does have access the series of facial expression and pose vectors ft sent by the driving speaker. These vectors inherently capture biometric information about the driving speaker.

‘By analyzing the reconstructed video and comparing it to the corresponding ft’s, our system is able to identify biometric differences between the driving and reconstructed speaker present in puppeteering attacks.’

Overview of the proposed defensive system.
Overview of the proposed defensive system.

The essential idea is to use characteristics in the ‘official’ reconstructed video to discern differences between the reconstruction and the source data. This keeps the attacker permanently ‘in front of the passport officer’, and under constant scrutiny, instead of only needing to fool the system briefly, once.

One obstacle to this is the need to evaluate biometric accuracy at different depth variations, i.e., to account for the distance of the speaker from the camera. When a speaker moves back in the frame, the biometric traits become more aggregate and general; arguably, this may eventually become a factor in authentication, in systems which require the speaker to maintain a reasonably close distance to the camera, instead of leaning back and letting algorithmic approximation aid their impersonation.

Besides live deepfaking, one additional possible attack scenario is that the attacker will use initial alteration of video to quickly pass through the ‘passport’ stage, and thereafter switch back to their native and true persona, which will never be seen by the other people in the video call. In such a case, the biometric assessment is easier to conduct, since there are manifest differences between the true speaker’s facial landmarks and those which are being interpreted on the other end.

Above, interpreted landmarks from real>real; below, the tell-tale signs of puppeteering in a deceptive reconstruction.
Above, interpreted landmarks from real>real; below, the tell-tale signs of puppeteering in a deceptive reconstruction.

A detection system of this nature also needs to account for the native limitations of the legitimate system, which may generate reconstruction errors when the input is challenging, for instance when the user turns to a very oblique angle, or moves quite suddenly. Such errors should not be interpreted as evidence of a malefactor on the other end. Therefore the proposed system calculates a time-averaged value for biometric distance – a generalized margin for error which will not throw false positives in a legitimate videoconferencing situation.

Data and Tests

To populate a dataset for tests, the researchers curated a novel set of videos of multiple speakers from celebrity interviews obtained from YouTube.  Each video was 20-30 seconds long, and the collection took in 24 celebrities across the racial groups Hispanic, White, Asian and Black, for a total of 72 videos. The celebrity speakers were evenly split across gender, with 12 male and 12 female subjects.

These videos were then used to generate both real and puppeteered talking head videos, in imitation of the target environment. Frameworks used were ReenactGAN; SAFA; X2Face; and DA-GAN.

Examples from the researchers' curated dataset.
Examples from the researchers' curated dataset.

Each speaker produced 18 puppeteered videos, using diverse driving videos (i.e., people who would power the recreation, as in a deepfake attack scenario). This yield 432 videos for each network. The final volume reached 2016 talking head videos covering around 14 hours of video footage. The dataset has been made available on GitLab.

The first round of tests evaluated puppeteering detection, using three deepfake detection frameworks: CNN Ensemble; Efficient ViT; and Cross-Efficient ViT.

Initial results.
Initial results.

The authors comment:

‘We can see that our system [achieves] strong puppeteering detection performance across all four talking head video systems, with an average detection accuracy of 98.03%.’

The researchers also provide a comparative evaluation of Receiver Operating Characteristic (ROC) curves for their defensive system:

Comparative ROC curves for the new system.
Comparative ROC curves for the new system.

The authors observe:

‘These ROC curves demonstrate that we can achieve strong puppeteering detection performance at low false alarm rates. We note that we are still able to achieve strong performance for SAFA even though facial expression and pose vector ft used by SAFA do not correspond to explicit facial landmark positions. Instead, these correspond to learned abstract landmark representations.

‘Despite this, we are still able to use SAFA’s ft’s to measure the biometric distance between the driving and reconstructed speaker.’

The researchers claim that their system notably outperforms conventional deepfake detectors, with around a 20% point increase in accuracy. However, they also state that this result is not surprising, since such systems are not designed for the study’s target environment and scenario.

Additionally, the team conducted research into the extent to which window size (i.e., the general resolution of the capture area) affects the accuracy of the detection. As can be seen from the graph below, accuracy increases with window size up until an evident plateau, where the size no longer seems to have a direct effect.

The effect of window size.
The effect of window size.

The project is intended not as a direct rival to any existing systems, but as a bolt-on measure for the upcoming generation of neural videoconferencing frameworks, and, presumably, is not intended as the only safeguard against unwanted interference.

The authors note that current deepfake detectors are naturally prone to falsely flag ‘genuine’ reconstructed video (i.e., occasions where no interference has occurred, and the system is operating as intended), since they are trained to key on non-natural or in some way processed video streams and artifacts.

Though there are no indications that the proposed system introduces any bias relating to race or gender, the authors concede that it will inevitably inherit any bias that may affect the native way that systems interpret landmarks into neural avatars.


In a case where the user is employing full-fledged live deepfaking, with a system such as DeepFaceLive, the proposed detection system would need to rely on errors in transliteration between the real and faked identity coming in over the live-stream. The alternative scenario – where the attacker is able to authenticate with only a single (perhaps static) image – may be harder to pull off, but is less resource-intensive for the attacker.

This approach is most susceptible to the biometric methods employed by the researchers of the new paper, but is also likely to be the most tempting for criminal elements, as it’s opportunistic, and does not involve the intensive curation of target-specific datasets, and the often lengthy training of models.

It is surprisingly difficult to effect a persistent and consistent real-time deepfake over a sustained time-period, particularly if the system is aware of the possibility, and puts the attacker through CAPTCHA-style challenges.

One currently reliable way of detecting a deepfaked video caller. Source:

Therefore duping a vulnerable interpretive system into doing all the synthesizing is obviously preferable for the attacker. It remains to be seen the extent to which neural videoconferencing systems will be prepared for the kind of zero-day exploit envisaged by the researchers of the new paper.

More To Explore

LayGa - Source:

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source:

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.