A new research collaboration between Poland and the UK may offer the first effective method to obtain a much-cherished ‘holy grail’ of deepfake image synthesis research – the ability to generate temporally coherent human video from latent diffusion systems such as Stable Diffusion.
The new work is capable of generating human movement that’s genuinely ‘hallucinated’, instead of being driven by a real-world video, and also uses audio speech as its driver.
Though the project’s ability to generate movement from speech is a popular current research trend – not least because it’s marketable in the emerging AI avatar space – what may be more interesting is the novel attention mechanism through which the architecture generates ‘motion frames’, since this approach could potentially enable realistic-looking human movement in a much wider range of applications, and has scant competition at the moment.
The system uses a latent diffusion model trained to learn the distribution of frames extracted from videos. This abstract training is entirely applicable to out-of-distribution (OOD) data – i.e., on novel images that are introduced by the end user, and which the model has never seen.
During inference, based on the trained model, consecutive frames are sampled in an autoregressive workflow, which maintains the source identity (i.e., the image), while creating plausible head and lip movements. The authentic head motion is derived from generalized priors obtained during the training phase of the model.
The authors note that, unlike other comparable methods, the new approach requires no additional guidance, such as extra keyframes, or (the most traditional approach) the use of a real video as a guideline for the generated video.
The approach, called Diffused Heads, was evaluated on two datasets at varying complexity levels, achieving, the authors assert, state-of-the-art results. The system was further subjected to a Turing test with 140 participants, who the authors state found the synthesized results ‘indistinguishable from ground-truth videos’.
The new paper is titled Diffused Heads: Diffusion Models Beat GANs on Talking-Face Generation, and comes from researchers at the University of Wrocław, the Wrocław University of Science and Technology, University College London, and the Polish custom software development company Tooploox. A range of sample videos (including those featured here, but at better quality) are available at the project’s homepage. The authors state that they will eventually release the code for public use.
The Search for Stability
Since Stable Diffusion was released to open source last summer, its adherents have been waiting for some method to come along by which it could produce human motion that, ironically, is actually stable instead of ‘sizzling’.
While there has been some progress in actually generating core authentic human body motion through diffusion, this doesn’t solve the rendering problem for human faces, or the fact that Stable Diffusion has no native temporal mechanism at all, and has very limited ability to create a consistent sequence of frames of ‘imaginary’ people (or anything else, for that matter).
This has left the diffusion video scene in a ‘psychedelic’ state, with platforms such as DeForum allowing users to make ‘trip’-like discursive video essays and visualizations, but not fully photorealistic and convincing human movement.
There are a number of ‘cheat’ methods that can turn latent diffusion-generated images into more temporally coherent video, but they’re not specific to diffusion output, and work generically with any image source (such as a real photo).
These include Thin-Plate-Spline-Motion-Model (TPSM), the MyHeritage DeepStory/LiveStory feature (powered by D-ID’s proprietary methods), and EbSynth, a frame interpolation model that’s finally getting a Linux release this year, after five hectic months of use by Stable Diffusion video enthusiasts.
The EbSynth project in particular seems to have been revivified by the new attention from Stable Diffusion fans. Though it’s currently capable of creating very authentic video from Stable Diffusion character output, it is not an AI-based method, but a static algorithm for tweening key-frames (that the user has to provide).
Given this, despite the impressive results, EbSynth is a fairly 20th-century solution to the generative challenge, while its need for multiple consistent character frames is an unsolved problem for Stable Diffusion.
All of these methods are also essentially ‘deepfake puppetry’, where one or more source images are used as keyframes in a traditional, Disney-style animation pipeline (albeit an automated one), where the software fills in the ‘missing’ interstitial frames (i.e., EbSynth and TPSM), powered by a driving video.
Therefore the user still has to do a lot of the heavy lifting; none of these approaches are easy to scale up, and none of them are truly ‘generative’ in the way that the new proposed method is (since it is actually generating diffusion-based interstitial frames, and not just performing 1980s-style morph transitions between manual user input frames).
Diffused Heads, instead, infers adjacent frames by denoising new frames until they have a direct relationship with the source frame (i.e., the frame that the user provides in order to establish the identity for the video). This is accomplished via a 2D Unet developed for biomedical imaging.
Though the use of voice clips give Diffused Heads a temporal factor on which to hang its synthesis routine, this alone, as the authors observe, is not adequate to provide smooth and realistic video. Therefore the system uses a novel system called motion frames, a sampling mechanism that acts as a purely neural analog to optical flow (which in itself is not an intrinsically AI-based technology).
In the image above, we see a comparison between the magnitudes of traditional optical flow (above) and the equivalent predicted frames obtained by Motion Frames sampling (below).
In this sense, the system is providing its own keyframes and its own interstitial frames, though, as the authors concede, it’s inevitably bound by the maximum radius of face direction away from the source image that priors can be expected to interpret. It’s not likely to be able to generate a profile view from a passport-style source image – however, that’s a more general problem in human synthesis.
Additionally, Diffused Heads uses motion audio embeddings, which can represent both past and future audio sections, comprising a kind of optical flow for audio content. As with Motion Frames, time-points with inadequate data are padded by duplicating adjacent points, as necessary.
Though the paper deals extensively with adjunct technologies and approaches relating to the use of audio speech as a driver, we refer the reader to the paper for further detail on this aspect, as it is arguably peripheral to the abstract generative capacity and temporal coherence for video that’s enabled by the new work.
The authors evaluated Diffused Heads on datasets popular with the ‘talking heads’ research sector: the Crowd-sourced Emotional Multimodal Actors Dataset (CREMA) and Oxford University’s Lip Reading in the Wild (LRW) dataset.
Rival prior frameworks in the tests, primarily concerned with lip and mouth generation, were Realistic Speech-Driven Facial Animation with GANs (SDA), Wav2Lip, MakeItTalk, PC-AVS and One-Shot Emotional Talking Face via Audio-Based Emotion-Aware Motion Model (EAMM).
The metrics used for the tests were structural similarity index measure (SSIM) and Peak signal-to-noise ratio (PSNR).
Though Diffused Heads does not entirely lead the board, the authors contend that this is a collateral effect of comparing the system to non-equivalent technologies within the same application space*:
‘The majority of the methods used for comparison utilize additional inputs to guide the generation process. Similarly to [3, 40], we do not provide anything but a single frame and audio, allowing the model to generate anything it wants. For that reason, our synthesized videos are not consistent with the reference ones and get worse measures in the standard metrics. Moreover, as explained in , PSNR favors blurry images and is not a perfect metric in our task, although used commonly.’
To prove this standpoint, the researchers conducted a Turing test for the output from all systems, using 10 videos from the LRW dataset, generated by the current SOTA method, PC-AVS2, with a further ten from Diffused Heads, and with ten actual, real-life videos. Participants were male and female, from varying backgrounds, and were asked to evaluate whether or not the videos they had watched were real:
The authors conclude:
‘Diffused Heads generates videos that are hard to distinguish from real ones. The faces have natural expressions, eye blinks, and grimaces. The model is able to preserve smooth motion between frames and identity from a given input frame. There are hardly any artifacts, and difficult objects such as hair or glasses are generated accurately. Additionally, Diffused Head works well on challenging videos with people shown from a side view.’
* My conversion of the authors’ inline citations to hyperlinks.