Bringing Temporal Coherence to Stable Diffusion with Flow Maps

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

A new collaboration between China, Australia and France is offering a potential way to bring temporal stability to animations in text-to-image systems such as Stable Diffusion. The new method is able to impose extracted movement from other clips into faces and bodies; to provide potentially infinite human animations from a single starting point; to effectively edit existing images; and to disentangle motion (the way something moves) from content (what something is).

Please allow time for the animated GIF below to load

Leftmost, the driving Tai Chi video source. The motion vectors obtained from this source clip are then used to animate other characters from a single source frame, with minimal or zero entanglement. See the original source videos for better resolution and smoother representation. Source: https://wyhsirius.github.io/LEO-project/
Leftmost, the driving Tai Chi video source. The motion vectors obtained from this source clip are then used to animate other characters from a single source frame, with minimal or zero entanglement. See the original source videos for better resolution and smoother representation. Source: https://wyhsirius.github.io/LEO-project/

Titled LEO, the new ‘Generative Latent Image Animator for Human Video Synthesis’ achieves spatio-temporal coherence through the use of flow maps – a temporal prediction and mapping technique often used in cartography and predictive analytics, but which is also applicable to neural synthesis.

Please allow time for the animated GIF below to load

Faces talking, using flow map-generated motion vectors, under LEO; in all cases, only the first frame of the content was provided, while the disentangled motion vectors 'suck' the identity into a clean motion path. See the original source videos for better resolution and smoother representation.
Faces talking, using flow map-generated motion vectors, under LEO; in all cases, only the first frame of the content was provided, while the disentangled motion vectors 'suck' the identity into a clean motion path. See the original source videos for better resolution and smoother representation.

Flow maps use linear symbols (like dots, lines or arrows) to plot the movement of points across a single image or entity, such as a migratory path in a historical map, or trends in consumption across a representative temporal graph.

A flow map of world air routes. Source: https://en.wikipedia.org/wiki/File:World_Air_Routes.png
A flow map of world air routes. Source: https://en.wikipedia.org/wiki/File:World_Air_Routes.png

A form of flow mapping is used also in optical flow, which effectively ‘unwraps’ a video into a single entity (rather than hundreds or thousands of individual frames) so that changes across the video content can be seen at a glance, or else illustrated between frames:

Optical flow can make visually explicit the trends in movement in a scene. Once mapped, these movements can be used as anchors for visual effects pipelines and other procedural techniques. Source: https://www.researchgate.net/figure/Optical-flow-field-vectors-shown-as-green-vectors-with-red-end-points-before-and-after_fig6_290181771
Optical flow can make visually explicit the trends in movement in a scene. Once mapped, these movements can be used as anchors for visual effects pipelines and other procedural techniques. Source: https://www.researchgate.net/figure/Optical-flow-field-vectors-shown-as-green-vectors-with-red-end-points-before-and-after_fig6_290181771

The researchers of the new work have developed a method to extract movement data from source videos without pulling the content in for the ride, achieving a new standard in disentanglement, and allowing for very ‘clean’ imposition of motion into static ‘starting’ frames, which are then brought to life by the derived motion vectors.

Please allow time for the animated GIF below to load

See the original source videos for better resolution and smoother representation.
See the original source videos for better resolution and smoother representation.

In contrast to previous works, LEO consists of two separate and distinct phases of training and synthesis. The Latent Image Animator (LIA) framework extracts motion priors via flow charts, creating a subject-agnostic ‘template’ into which content can be inserted; and a Latent Motion Diffusion Model (LMDM) uses diffusion-based synthesis to impose novel content within the extracted motion parameters.

Please allow time for the animated GIF below to load

See the original source videos for better resolution and smoother representation.
See the original source videos for better resolution and smoother representation.

LEO uses a novel motion extrapolation technique to overcome the traditional limitations of video datasets, which tend to consist of very short movements. For this reason, many of the recent new text-to-video frameworks that have received attention and acclaim tend to output ‘meme’-style short and repetitive videos – a shortcoming that LEO does not share.

In qualitative and quantitative tests, as well as in a user study, LEO was able to improve upon the current state of the art, and the authors comment:

‘We quantitatively and qualitatively [evaluate the] proposed method on both, human body and talking head datasets and [demonstrate] that our approach is able to successfully produce photo-realistic, long-term human videos.

‘In addition, we [showcase] that the effective disentanglement of appearance and motion in LEO allows for two additional tasks, namely infinite-length human video synthesis by autoregressively applying LMDM, as well as content-preserving video editing (employing an off-the-shelf image editor (e.g., ControlNet)).

‘We postulate that LEO opens a new door in design of generative models for video synthesis and plan to extend our method onto more general videos and applications.’

The new paper is titled LEO: Generative Latent Image Animator for Human Video Synthesis, with an associated project site containing many videos, and comes from six researchers from Shanghai Artificial Intelligence Laboratory (one of the most active research centers for human image synthesis), Monash University in Melbourne, and the Inria center at Université Côte d’Azur.

Approach

The core achievement of LEO is disentanglement of motion data from content data – a recurrent and ongoing problem in the field of video synthesis, since it’s akin to removing the sugar from a cup of coffee and putting it back in the packet.

Previous attempts have used joint training to address motion and appearance features in parallel, rather than sequentially. Such works include the Snap/NVIDIA MoCoGAN, the NVIDIA/Berkeley outing TATS, the French project InMoDeGAN, ImaGINator, and the Korea-led DIGAN.

Other approaches – such as MoCoGAN-HD, Time-agnostic VQGAN, and VideoGPT – have instead opted for a LEO-style two-phase approach that trained an image generator and then a temporal network – but which retained the classic obstacles of entanglement, as traces of the content were dragged in together with the motion priors, and which were also unable to generate longer free-form videos free of spatial artefacts and other unwelcome distractions.

LEO’s first task is to obtain motion codes from a source video. This is accomplished through the use of (the above-mentioned) Inria 2022 Latent Image Animator project, created by several of LEO’s originators.

The Latent Image Animator project, from 2022, which can impose motion priors into novel imagery. Source: https://arxiv.org/pdf/2203.09043.pdf
The Latent Image Animator project, from 2022, which can impose motion priors into novel imagery. Source: https://arxiv.org/pdf/2203.09043.pdf

Consisting of an encoder and a generator, LIA is trained in a self-supervised manner which encodes a source image into a latent space, with the results being passed to an optical flow module. To obtain discrete motion codes, the results are then fed to a one-dimensional U-Net derived from Berkeley’s early (2020) work on diffusion models.

However, this only brings LEO up to the current state of the art in motion synthesis from priors, which is beset by entanglement, artefacts, and unwanted detritus from the motion code extraction process.

Therefore the researchers have developed a second module – a Linear Motion Condition (LMC) mechanism, titled conditional Latent Motion Diffusion Model (cLMDM).

Conceptual architecture for LEO. Source: https://arxiv.org/pdf/2305.03989.pdf
Conceptual architecture for LEO. Source: https://arxiv.org/pdf/2305.03989.pdf

The cLMDM module only adds noise at each processing step, rather than actual data from the source images (i.e., frames), producing a more stable and discrete series of motion priors.

Previous approaches have been constrained by the minimal length of standard clips from which motion priors are typically obtained. In many popular datasets, such clips are usually around five seconds in length, limiting the potential length of a video that uses them as a motion cue for synthesis.

The LMDM module, instead, can take the final motion code obtained as a cue for the next sequence, effectively bridging two potentially disparate (but presumably compatible) groups of motion priors, and in this way generate a sequence of an arbitrary or unlimited length.

LEO bridges two motion code sequences into a potentially infinite sequence by dove-tailing the end of the previous sequence with a new sequence.
LEO bridges two motion code sequences into a potentially infinite sequence by dove-tailing the end of the previous sequence with a new sequence.

The authors note that the starting frames for these generated sequences can either be real, or obtained from generative frameworks such as Stable Diffusion.

Data and Tests

To put LEO through its paces, the researchers set it against DIGAN, TATS, and the KAUST/Snap project StyleGAN-V.

The datasets used were the popular First Order Motion Model (FOMM/Tai-Chi-HD), FaceForensics, and Celeb-V-HQ. For the FOMM data, both 128×128 and 256×256 resolution was used, while 256×256 only was used for FaceForensics and Celeb-V-HQ.

For metrics, the authors used Kernel Video Distance (KVD) and Fréchet Video Distance (FVD). Additionally a user study was conducted in which twenty humans were asked to evaluate general video quality of the output; for this, the user was presented with paired videos – one generated by LEO, and the other either generated by one of the rival systems, or an actual real video.  The feature extractor used for the metrics was I3D, trained on Kinetics-400.

LEO was run on PyTorch and trained over 4x A100 GPUs, each with 40GB of VRAM memory. 1000 diffusion steps were used during training, with a learning rate of 1e-4.

The first test round was qualitative, and concerned with short video generation, which is the common current standard in the emerging breed of video synthesis frameworks.

Results from the qualitative round.
Results from the qualitative round.

Regarding these results, the authors state:

‘[The] visual quality of our generated results outperforms other approaches w.r.t. both, appearance and motion. For both resolutions on TaichiHD datasets, our method is able to generate complete human structures, whereas both, DIGAN and TATS fail, especially for arms and legs.

‘When compared with StyleGAN-V on FaceForensics dataset, we identify that while LEO preserves well facial structures, StyleGAN-V modifies such attributes when synthesizing large motion.’

TATS was the sole eligible rival in the task of long video generation, with 512 frames produced from the Tai-Chi-HD dataset. The authors note, as can be seen in the image below, that the TATS generations begin to crash around frame 50, after which the entire video sequence begins to fade out, while the LEO version maintains coherency and continuity throughout.

LEO was pitted against TATS in the long-term video generation qualitative tests.
LEO was pitted against TATS in the long-term video generation qualitative tests.

The authors further observe that the motion priors powering this generation are based only on a 64-frame sequence.

For quantitative evaluation of unconditional (arbitrary) short video generation, LEO obtained SOTA results:

Quantitative results from the tests for unconditional short video generation.
Quantitative results from the tests for unconditional short video generation.

Here the authors observe:

‘LEO systematically outperforms other methods w.r.t. video quality, obtaining lower or competitive FVD and KVD on all datasets. On high-resolution generation, our results remain better than DIGAN.’

However, the researchers assert that FVD, a respected metric in this regard, is not able to recreate the success obtained in the user study (see below), and they believe, in fact, that FVD is a flawed metric method for which a superior replacement should be sought in due time.

For unconditional long video generation, LEO was compared to StyleGAN-V on FaceForensics, with DIGAN and TATS using Tai-Chi-HD. 128 frames were generated, and – aside from the researchers’ concern for the accuracy of FVD in regards to the StyleGAN-V results – LEO was able to improve on the state of the art here as well:

Quantitative results for unconditional long-term video generation.
Quantitative results for unconditional long-term video generation.

The authors ascribe these results to the quality and stability of the motion codes obtained by their LMDM module.

The authors conducted a further ad hoc test regarding disentanglement of motion and appearance, in which the same subject in a particular source video was processed into different motion (upper and lower frames in the image below). The fact that the results are coherent and high quality when the real motion is swapped out, the authors contend, is proof that the motion codes obtained are truly disentangled, and that the trained model has not overfitted on the source data.

Disentanglement of motion and appearance are put to the test.
Disentanglement of motion and appearance are put to the test.

Finally, the researchers tested the ability of LEO to facilitate video editing on source clips, by adding the popular Stable Diffusion library ControlNet to LEO:

Source clips are used, effectively, for 'deepfake puppeteering' in a LEO video-editing experiment that leverages the popular Stable Diffusion framework ControlNet.
Source clips are used, effectively, for 'deepfake puppeteering' in a LEO video-editing experiment that leverages the popular Stable Diffusion framework ControlNet.

Here the authors comment:

‘Given that the motion space is fully disentangled from the appearance space, our videos maintain the original temporal consistency, uniquely altering the appearance.’

To address the repetitive nature of the Tai-Chi-HD dataset, which can lead to ‘staccato’ repeats of action due to the limited nature of the depicted movements, the authors have built an additional transition diffusion model – and they believe that their system is well-adapted for continuous, even infinite video generation, which has to date been the preserve of less challenging domains, such as endless landscape fly-overs in Google Research’s Infinite Nature.

Conclusion

In cases like these, the potential influence of the new work is always bound up with access and replicability; though the authors have supplied a URL for LEO’s code, the repository is currently empty. Very often, based on the history and habits of research submissions, such code never actually makes it live.

Let’s hope that this is an exception, and that casual researchers get a chance to replicate the clean motion priors that the authors vaunt in the new paper – it could be a game-changer for the current set of objectives in generative video frameworks.

More To Explore

One2Avatar examples
AI ML DL

Better Neural Avatars From Just Five Face Images

Many neural avatar systems of the last 18 months require extensive training data, or even full videoclips. Others are performant, but have exorbitant training demands. However, a new system from Google and the University of Minnesota is proposing a photorealistic deepfake head system that’s trained on only five images – and can work quite well from just one image; and the new system of pretraining that the framework uses throws some of the conventions regarding hyperscale training datasets into question.

AI ML DL

The Challenge of Preventing ‘Identity Bleed’ in Face Swaps

KAIST AI has developed a new method of disentangling identity characteristics in a face-swap from secondary characteristics such as lighting, skin texture – and the original structure of the face to be ‘overwritten’ by the new identity. If such techniques can be perfected, facial replacement could be freed from having the original identity ‘bleeding through’ into the superimposed identity.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle