A new collaboration between China, Australia and France is offering a potential way to bring temporal stability to animations in text-to-image systems such as Stable Diffusion. The new method is able to impose extracted movement from other clips into faces and bodies; to provide potentially infinite human animations from a single starting point; to effectively edit existing images; and to disentangle motion (the way something moves) from content (what something is).
Please allow time for the animated GIF below to load

Titled LEO, the new ‘Generative Latent Image Animator for Human Video Synthesis’ achieves spatio-temporal coherence through the use of flow maps – a temporal prediction and mapping technique often used in cartography and predictive analytics, but which is also applicable to neural synthesis.
Please allow time for the animated GIF below to load

Flow maps use linear symbols (like dots, lines or arrows) to plot the movement of points across a single image or entity, such as a migratory path in a historical map, or trends in consumption across a representative temporal graph.

A form of flow mapping is used also in optical flow, which effectively ‘unwraps’ a video into a single entity (rather than hundreds or thousands of individual frames) so that changes across the video content can be seen at a glance, or else illustrated between frames:

The researchers of the new work have developed a method to extract movement data from source videos without pulling the content in for the ride, achieving a new standard in disentanglement, and allowing for very ‘clean’ imposition of motion into static ‘starting’ frames, which are then brought to life by the derived motion vectors.
Please allow time for the animated GIF below to load

In contrast to previous works, LEO consists of two separate and distinct phases of training and synthesis. The Latent Image Animator (LIA) framework extracts motion priors via flow charts, creating a subject-agnostic ‘template’ into which content can be inserted; and a Latent Motion Diffusion Model (LMDM) uses diffusion-based synthesis to impose novel content within the extracted motion parameters.
Please allow time for the animated GIF below to load

LEO uses a novel motion extrapolation technique to overcome the traditional limitations of video datasets, which tend to consist of very short movements. For this reason, many of the recent new text-to-video frameworks that have received attention and acclaim tend to output ‘meme’-style short and repetitive videos – a shortcoming that LEO does not share.
In qualitative and quantitative tests, as well as in a user study, LEO was able to improve upon the current state of the art, and the authors comment:
‘We quantitatively and qualitatively [evaluate the] proposed method on both, human body and talking head datasets and [demonstrate] that our approach is able to successfully produce photo-realistic, long-term human videos.
‘In addition, we [showcase] that the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis by autoregressively applying LMDM, as well as content-preserving video editing (employing an off-the-shelf image editor (e.g., ControlNet)).
‘We postulate that LEO opens a new door in design of generative models for video synthesis and plan to extend our method onto more general videos and applications.’
The new paper is titled LEO: Generative Latent Image Animator for Human Video Synthesis, with an associated project site containing many videos, and comes from six researchers from Shanghai Artificial Intelligence Laboratory (one of the most active research centers for human image synthesis), Monash University in Melbourne, and the Inria center at Université Côte d’Azur.
Approach
The core achievement of LEO is disentanglement of motion data from content data – a recurrent and ongoing problem in the field of video synthesis, since it’s akin to removing the sugar from a cup of coffee and putting it back in the packet.
Previous attempts have used joint training to address motion and appearance features in parallel, rather than sequentially. Such works include the Snap/NVIDIA MoCoGAN, the NVIDIA/Berkeley outing TATS, the French project InMoDeGAN, ImaGINator, and the Korea-led DIGAN.
Other approaches – such as MoCoGAN-HD, Time-agnostic VQGAN, and VideoGPT – have instead opted for a LEO-style two-phase approach that trained an image generator and then a temporal network – but which retained the classic obstacles of entanglement, as traces of the content were dragged in together with the motion priors, and which were also unable to generate longer free-form videos free of spatial artefacts and other unwelcome distractions.
LEO’s first task is to obtain motion codes from a source video. This is accomplished through the use of (the above-mentioned) Inria 2022 Latent Image Animator project, created by several of LEO’s originators.

Consisting of an encoder and a generator, LIA is trained in a self-supervised manner which encodes a source image into a latent space, with the results being passed to an optical flow module. To obtain discrete motion codes, the results are then fed to a one-dimensional U-Net derived from Berkeley’s early (2020) work on diffusion models.
However, this only brings LEO up to the current state of the art in motion synthesis from priors, which is beset by entanglement, artefacts, and unwanted detritus from the motion code extraction process.
Therefore the researchers have developed a second module – a Linear Motion Condition (LMC) mechanism, titled conditional Latent Motion Diffusion Model (cLMDM).

The cLMDM module only adds noise at each processing step, rather than actual data from the source images (i.e., frames), producing a more stable and discrete series of motion priors.
Previous approaches have been constrained by the minimal length of standard clips from which motion priors are typically obtained. In many popular datasets, such clips are usually around five seconds in length, limiting the potential length of a video that uses them as a motion cue for synthesis.
The LMDM module, instead, can take the final motion code obtained as a cue for the next sequence, effectively bridging two potentially disparate (but presumably compatible) groups of motion priors, and in this way generate a sequence of an arbitrary or unlimited length.

The authors note that the starting frames for these generated sequences can either be real, or obtained from generative frameworks such as Stable Diffusion.
Data and Tests
To put LEO through its paces, the researchers set it against DIGAN, TATS, and the KAUST/Snap project StyleGAN-V.
The datasets used were the popular First Order Motion Model (FOMM/Tai-Chi-HD), FaceForensics, and Celeb-V-HQ. For the FOMM data, both 128×128 and 256×256 resolution was used, while 256×256 only was used for FaceForensics and Celeb-V-HQ.
For metrics, the authors used Kernel Video Distance (KVD) and Fréchet Video Distance (FVD). Additionally a user study was conducted in which twenty humans were asked to evaluate general video quality of the output; for this, the user was presented with paired videos – one generated by LEO, and the other either generated by one of the rival systems, or an actual real video. The feature extractor used for the metrics was I3D, trained on Kinetics-400.
LEO was run on PyTorch and trained over 4x A100 GPUs, each with 40GB of VRAM memory. 1000 diffusion steps were used during training, with a learning rate of 1e-4.
The first test round was qualitative, and concerned with short video generation, which is the common current standard in the emerging breed of video synthesis frameworks.

Regarding these results, the authors state:
‘[The] visual quality of our generated results outperforms other approaches w.r.t. both, appearance and motion. For both resolutions on TaichiHD datasets, our method is able to generate complete human structures, whereas both, DIGAN and TATS fail, especially for arms and legs.
‘When compared with StyleGAN-V on FaceForensics dataset, we identify that while LEO preserves well facial structures, StyleGAN-V modifies such attributes when synthesizing large motion.’
TATS was the sole eligible rival in the task of long video generation, with 512 frames produced from the Tai-Chi-HD dataset. The authors note, as can be seen in the image below, that the TATS generations begin to crash around frame 50, after which the entire video sequence begins to fade out, while the LEO version maintains coherency and continuity throughout.

The authors further observe that the motion priors powering this generation are based only on a 64-frame sequence.
For quantitative evaluation of unconditional (arbitrary) short video generation, LEO obtained SOTA results:

Here the authors observe:
‘LEO systematically outperforms other methods w.r.t. video quality, obtaining lower or competitive FVD and KVD on all datasets. On high-resolution generation, our results remain better than DIGAN.’
However, the researchers assert that FVD, a respected metric in this regard, is not able to recreate the success obtained in the user study (see below), and they believe, in fact, that FVD is a flawed metric method for which a superior replacement should be sought in due time.
For unconditional long video generation, LEO was compared to StyleGAN-V on FaceForensics, with DIGAN and TATS using Tai-Chi-HD. 128 frames were generated, and – aside from the researchers’ concern for the accuracy of FVD in regards to the StyleGAN-V results – LEO was able to improve on the state of the art here as well:

The authors ascribe these results to the quality and stability of the motion codes obtained by their LMDM module.
The authors conducted a further ad hoc test regarding disentanglement of motion and appearance, in which the same subject in a particular source video was processed into different motion (upper and lower frames in the image below). The fact that the results are coherent and high quality when the real motion is swapped out, the authors contend, is proof that the motion codes obtained are truly disentangled, and that the trained model has not overfitted on the source data.

Finally, the researchers tested the ability of LEO to facilitate video editing on source clips, by adding the popular Stable Diffusion library ControlNet to LEO:

Here the authors comment:
‘Given that the motion space is fully disentangled from the appearance space, our videos maintain the original temporal consistency, uniquely altering the appearance.’
To address the repetitive nature of the Tai-Chi-HD dataset, which can lead to ‘staccato’ repeats of action due to the limited nature of the depicted movements, the authors have built an additional transition diffusion model – and they believe that their system is well-adapted for continuous, even infinite video generation, which has to date been the preserve of less challenging domains, such as endless landscape fly-overs in Google Research’s Infinite Nature.
Conclusion
In cases like these, the potential influence of the new work is always bound up with access and replicability; though the authors have supplied a URL for LEO’s code, the repository is currently empty. Very often, based on the history and habits of research submissions, such code never actually makes it live.
Let’s hope that this is an exception, and that casual researchers get a chance to replicate the clean motion priors that the authors vaunt in the new paper – it could be a game-changer for the current set of objectives in generative video frameworks.