Native Temporal Consistency in Stable Diffusion Videos, With TokenFlow

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

The search for temporal consistency in Stable Diffusion continues. Is there some innate property in the world-storming text-to-image generative model that could help it produce truly realistic and smooth video without too many third-party frameworks, or the need to nail its eccentricities down with old-school CGI?

Maybe. New work out of the Weizmann Institute of Science in Israel offers such an insight, claiming that Stable Diffusion has certain native characteristics that can be exploited when generating successive and preceding frames in a potential text-to-video workflow.

Techniques outlined in the new paper – titled TokenFlow: Consistent Diffusion Features for Consistent Video Editing  –  have been extensively demonstrated in a project page and video supplementary material page*.

Text-to-image transformations take on animated life with TokenFlow. Source: https://diffusion-tokenflow.github.io/sm/supp.html

TokenFlow, the authors explain, explicitly enforces inter-frame video correspondences during editing (an ‘edit’, in this sense, means to transform the entire appearance of a video, rather than move its content into different sequences).

‘Intuitively, natural videos contain redundant information across frames, e.g., depict similar appearance and shared visual elements. Our key observation is that the internal representation of the video in the diffusion model exhibits similar properties. That is, the level of redundancy and temporal consistency of the frames in the RGB space and in the diffusion feature space are tightly correlated.

‘Based on this observation, the pillar of our approach is to achieve consistent edit by ensuring that the features of the edited video are consistent across frames. Specifically, we enforce that the edited features convey the same inter-frame correspondences and redundancy as the original video features.’

On the left of the middle row, we see the more consistent features that are extracted from a video frame source with TokenFlow, facilitating a more temporally coherent video. Source: https://diffusion-tokenflow.github.io/TokenFlow_Arxiv.pdf
On the left of the middle row, we see the more consistent features that are extracted from a video frame source with TokenFlow, facilitating a more temporally coherent video. Source: https://diffusion-tokenflow.github.io/TokenFlow_Arxiv.pdf

The authors note that what they have discovered is applicable also to existing off-the-shelf diffusion methods, and, most unusually, that the technique requires no additional training or fine-tuning.

The researchers claim state-of-the-art editing (i.e., transformational) results across a diverse range of videos and video types, including videos with complex motions – one of the biggest challenges in this particular sub-sector of image synthesis research.

Approach

As we have seen in many examples by now, per-frame text-to-image video synthesis is plagued by the fact that latent diffusion models (LDMs) such as Stable Diffusion are designed to create one-off, individual ‘masterpiece’ images. The characteristics summoned up by a user’s text-prompt are orchestrated into a unique and essentially non-replicable interpretation of the prompt, with that particular blend of features dictated by a random seed.

In theory, you can retain that seed in subsequent or prior frames, in the hope that the adjacent frames will look like the target frame – but as the source content changes (i.e., people move and new objects appear or exit frame), the seed’s consistency in fact becomes redundant and destructive.

Further examples from the qualitative testing round. Source: https://diffusion-tokenflow.github.io/sm/supp.html

The fact that features break between frames in this way is what led the researchers to understand that there is an implicit relationship between the internal consistency of the frames; even if that relationship causes unwanted results by default, that it exists at all gives scope not only for new applications and remedial measures, but for a broadly applicable approach that is fairly agnostic to the host generative architecture.

Therefore TokenFlow manipulates the features of a prompt-altered video in order to preserve consistency between frames, which leads in turn to results that offer temporal consistency.

Click to play. Though the figure is fixed in the frame, they are moving quite vigorously, in contrast to the recent Runway text-to-video demo**, which also features a person fixed in frame, but who instead remains largely unmoving – a lesser challenge.

As seen in the outline of the conceptual architecture below, TokenFlow alternates at each frame generation between two components: the editing of a group of keyframes, which will then share a global appearance; and the propagation of the features in those keyframes across the entirety of rendered keyframes.

Conceptual architecture for TokenFlow.
Conceptual architecture for TokenFlow.

Each of these stages is facilitated by the Plug-and-Play Diffusion features project, another Weizmann Institute initiative, from 2022 (co-authored by one of the contributors to the new paper).

To each frame, DDIM inversion (with a classifier-free guidance scale of 1, and 1000 forward steps) is applied in order to extract features. The latent codes extracted in this way are used later in the process to establish correspondences for diffusion features between frames.

At each generation step, a set of keyframes is chosen at random for propagation – a kind of ‘in house’ generalization that ensures features are uniformly distributed across the clip, instead of blindly looking back or forward by only a few frames.

The keyframes are jointly edited by Transformers, which extend attention across the entire range of frames, in accordance with the prior methodology of Tune-A-Video.

Click to play. Source: https://diffusion-tokenflow.github.io/sm/supp.html

Regarding the need for two apparently competing approaches inside the same architecture, the authors comment:

‘Intuitively, the benefit of alternating between keyframe editing and propagation is twofold: first, sampling random  keyframes at each generation step increases the robustness to a particular selection.

‘Second, since each generation step results in more consistent features, the sampled keyframes in the next step will be edited more consistently.’

However, after all this faux generalization, the process does indeed begin to resemble more traditional codec-based workflows, as it pays attention with greater detail to the preceding and following two frames around the one under study.

At each generation step, the system computes the nearest neighbor (NN) of the derived latent tokens from each original frame, and those of the frames adjacent to it. The result is then transposed onto the edited (i.e., altered by Stable Diffusion) equivalent frame/s.

Click to play. Source: https://diffusion-tokenflow.github.io/sm/supp.html

Therefore the entire process breaks down in this way: the source video is subject to DDIM inversion, which extracts from it a series of noisy latent codes representing all the frames; the video is then denoised with TokenFlow propagation (as described above), with random selection of keyframes to ensure a generalized and consistent transformative effect; and finally the entire video is denoised, by combining the aforementioned Plug-and-play diffusion approach with the Stanford/Carnegie Mellon SDEdit framework (see video directly below).

Data and Tests

TokenFlow was evaluated against a number of more-or-less competing methods, using the DAVIS videos dataset, as well as diverse internet videos featuring food, animals, humans, and assorted moving objects.

Video resolutions used were 384x672px and the more conventional 512x512px, with each video consisting of 200 frames, and subject to a range of text prompts. The evaluation dataset prepared for the project comprised 61 text/video pairs.

The frame-editing method was the above-mentioned PnP-Diffusion, with uniform hyperparameters. The authors note that PnP-Diffusion can corrupt the structure of frames because of inaccurate DDIM inversion, and that TokenFlow improves the method’s performance in this respect, since multiple frames contribute to the generation of each video frame.

The authors observe that TokenFlow can be incorporated into ‘any diffusion-based image editing technique that accurately preserves the structure of the image’. Referring to the results seen in the videos directly below, the authors note:

‘Our edits are temporally consistent and adhere to the edit prompt. The man’s head is changed to Van Gogh or [marble]; importantly, the man’s identity and the scene’s background are consistent throughout the video.

‘The patterns of the polygonal wolf (bottom left) are the same across time: the body is consistently orange while the chest is blue.’

Click to play. Source: https://diffusion-tokenflow.github.io/sm/supp.html

Click to play. Source: https://diffusion-tokenflow.github.io/sm/supp.html

Click to play. Source: https://diffusion-tokenflow.github.io/sm/supp.html

The system was pitted against Text-to-Video-Zero; the aforementioned Tune-a-Video; RunwayML’s Gen1 (which you can see an example of here); and Text2LIVE. Since Text2Live uses a layered video representation (i.e., Layered Neural Atlases, or NLA) and performs training that uses CLIP losses, the researchers selected DAVIS videos that were compatible with these restrictions.

Additionally, the authors tested against a PnP-Diffusion baseline, and also applying PnP-Diffusion on a sole keyframe and propagating the edit to the entirety of the video via Stylizing Video by Example. We have concatenated the diverse resulting video comparisons from the supplementary materials page into the assemblies embedded below.

Click to play. Source: https://diffusion-tokenflow.github.io/sm/supp.html

Click to play. Source: https://diffusion-tokenflow.github.io/sm/supp.html

Click to play. Source: https://diffusion-tokenflow.github.io/sm/supp.html

Of these results, the authors note that TokenFlow adheres more closely to the edit prompt provided, and offers better temporal consistency. They observe also that Tune-a-Video ‘inflates’ the 2D image model into a video model, and fine-tunes it, which makes it suitable for very short clips only. They comment:

‘Applying PnP for each frame [independently] results in exquisite edits adhering to the edit prompt but, as expected, lack any temporal consistency.’

The researchers contend that the results from RunwayML’s Gen1 demonstrate ‘significantly worse’ frame quality than a text-to-image diffusion model, while the modifications performed by Text2Video-Zero demonstrate ‘heavy jittering’, due to the more limited way that the architecture deploys attention, in comparison to TokenFlow.

They also conducted an additional test against Text2Live, which propagates a single keyframe to a whole video. Since this is the purview of EbSynth, an example of EbSynth approximation of the challenge is presented in the supplementary material (see video below) as well.

Click to play, Comparison between TokenFlow, Text2Live, and EbSynth. Source: https://diffusion-tokenflow.github.io/sm/supp.html

Here, the researchers observe:

‘Text2LIVE lacks a strong generative prior, thus [has] limited visual quality. Additionally, this method relies on a layered representation of the video, which takes around 10 hours to train and is limited to videos with simple motion. Using [EbSynth] to propagate the edit produces propagation artifacts on frames that are not near the edited keyframe…’

For a quantitative evaluation, the researchers tested for edit fidelity, with CLIP evaluating this factor, and temporal consistency, which was measured by calculating the optical flow of the source video, using Princeton University’s 2020 RAFT framework.

Results from the quantitative tests.
Results from the quantitative tests.

The authors comment:

‘Our method achieves the highest CLIP score, showing a good fit between the edited video and the input guidance prompt. Furthermore, our method has the lowest warp error, indicating temporally consistent results.’

In terms of limitations, the researchers concede that since TokenFlow is designed to reinterpret the motion in an original video, it cannot intervene in any major structural way (i.e., it cannot strap a rucksack on the wolf’s back, or put a roof-rack on the car). However, this limitation is shared by all other recent similar forays into text-to-video reinterpretation via Stable Diffusion.

They further admit that a certain amount of flickering can remain in the final result, though they say that this can be ‘easily eliminated’ with post-process deflickering (and, indeed, there are recent dedicated works in this regard).

The code is coming soon, apparently – let’s hope so.

EDIT – Wednesday, September 6, 2023 10:03:30

The code has now been released.

Conclusion

It is notable in this new work that the authors have devised a stabilization method that keys on structure and characteristics which are native to the generative denoising model. The general run of temporal coherency outings produces methods which quite tortuously append and graft on secondary systems, effectively turning Stable Diffusion into little more than a texture generator for optical flow-style approaches.

What must be said, here as with the other recent candidates in this space, is that systems of this nature remain suited solely to ‘Instagram-style’ viral clips – one-off amusements that may dazzle, but cannot be developed into a series of editable clips that feature the same content. In this respect, TokenFlow’s entire world is the clip that it is working on; for a system that can reference external models, lighting conditions, poses, etc., in a consistent way comparable to traditional CGI work-flows, we will likely have to wait for new developments.

Click to play. Source: https://diffusion-tokenflow.github.io/sm/supp.html

* The supplementary video page is at https://diffusion-tokenflow.github.io/sm/supp.html; but please be aware that it features more than 80 autoplaying videos, and may be difficult to load. The videos have been converted and concatenated into a more discrete form for this article.

** https://blog.metaphysic.ai/using-ebsynth-to-create-better-nerf-facial-avatars/#runway

My conversion of the authors’ inline citations to hyperlinks.

More To Explore

One2Avatar examples
AI ML DL

Better Neural Avatars From Just Five Face Images

Many neural avatar systems of the last 18 months require extensive training data, or even full videoclips. Others are performant, but have exorbitant training demands. However, a new system from Google and the University of Minnesota is proposing a photorealistic deepfake head system that’s trained on only five images – and can work quite well from just one image; and the new system of pretraining that the framework uses throws some of the conventions regarding hyperscale training datasets into question.

AI ML DL

The Challenge of Preventing ‘Identity Bleed’ in Face Swaps

KAIST AI has developed a new method of disentangling identity characteristics in a face-swap from secondary characteristics such as lighting, skin texture – and the original structure of the face to be ‘overwritten’ by the new identity. If such techniques can be perfected, facial replacement could be freed from having the original identity ‘bleeding through’ into the superimposed identity.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle