Temporally Coherent Stable Diffusion Videos via a Video Codec Approach

VideoControlNet

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

New research from China has used the well-established precepts of video frame encoding as a central approach to a new method for creating Stable Diffusion videos that are temporally consistent (i.e., that do not show jarring changes throughout the video).

By using the principles behind video compression as a central tenet of a new architecture, VideoControlNet is able to greatly improve on the state of the art in Stable Diffusion video generation. Click on full-screen button for better detail. Source: https://vcg-aigc.github.io/

Featuring a complex and comprehensive range of demonstration videos at the project page, the new system, dubbed VideoControlNet, imposes a diffusion pipeline on the traditional creation of I-frames (sometimes known as key-frames), P-Frames and B-Frames, so that the framework has a series of ‘tent-pole’ or pivotal frames (the I-Frames, and, to a lesser extent, the P-frames) between which to interpolate.

If you have ever tried to scrub forward in a video, and can't quite land on the place that you want to, it's because you're being forced to an I-frame, which is a complete and minimally compressed frame with the maximum amount of information preserved faithfully. P-frames and B-frames, by contrast, are interpretational adjuncts to these tent-pole frames. The only way to 'seek' through a video with complete accuracy is to render it out without any compression, which may turn the file size of the video from megabytes to gigabytes, but which will make every single frame an I-frame. Source: https://en.wikipedia.org/wiki/Inter_frame#/media/File:Block_partition.jpg
If you have ever tried to scrub forward in a video, and can't quite land on the place that you want to, it's because you're being forced to an I-frame, which is a complete and minimally compressed frame with the maximum amount of information preserved faithfully. P-frames and B-frames, by contrast, are interpretational adjuncts to these tent-pole frames. The only way to 'seek' through a video with complete accuracy is to render it out without any compression, which may turn the file size of the video from megabytes to gigabytes, but which will make every single frame an I-frame. Source: https://en.wikipedia.org/wiki/Inter_frame#/media/File:Block_partition.jpg

As with compression codecs, the use of optical flow (which can ‘unwrap’ a video so that multiple parts of it are directly accessible at any one time, and not just the current frame or series of frames) helps to maintain long-term coherence, so that the wild imaginings of Stable Diffusion, which have no native mechanism for this kind of continuity, can be preserved through notable movement as the video proceeds (unlike the use of EbSynth, which cannot adequately reproduce major motion changes in a source video when interpreting Stable Diffusion content).

Source videos can be freely interpreted by Stable Diffusion prompts, as with traditional text-to-image or image-to-image. Click on full-screen button for better detail. Source: https://vcg-aigc.github.io/

The system makes use of the ControlNet system for Stable Diffusion, a pivotal and influential add-on with a variety of methods that can preserve source information when transforming or originating new images. The two methods used in VideoControlNet are canny and depth. The latter provides improved overall 3D reproduction of subjects, while the former is better at obtaining detail; both are played to their respective strengths in VideoControlNet.

VideoControlNet is able to effect greater changes to video through Stable Diffusion, with much-improved temporal consistency, in comparison to currently available approaches. Source: https://arxiv.org/pdf/2307.14073.pdf
VideoControlNet is able to effect greater changes to video through Stable Diffusion, with much-improved temporal consistency, in comparison to currently available approaches. Source: https://arxiv.org/pdf/2307.14073.pdf

The authors state:

‘[By] using the video coding paradigm that uses motion information for reducing redundancy, our method prevents the regeneration of the redundant areas based on the motion information and thus we can keep better content consistency. Specifically, we set the first frame as the I-frame and divide the following frames into different groups of pictures (GoP), in which the last frame of different GoPs is set as the key frame (i.e., P-frame) and other frames are set as B-frames.’

Click to play, and click on full-screen button for better detail. Source: https://vcg-aigc.github.io/

It should be noted that the Stable Diffusion community has been clamoring for temporal consistency for nearly a year, and that many of the intervening solutions (such as NVIDIA’s Align Your Latents, and the LEO project) proposed to date fall short in some or other respect.

Regarding the samples from VideoControlNet, we can observe that there are only a limited number of human depictions, and no examples of brusque or sudden movement in those that are attempted, since these are the very situations that not only challenge EbSynth, but which have always been an encoding challenge for traditional video codecs.

One of a small number of VideoControlNet videos to include notable human-depicting content. Click to play, and click on full-screen button for better detail. Source: https://vcg-aigc.github.io/

In the ‘landscape’ video example that begins this article, almost every pixel of the image is changing from moment to moment, but the gradation is smooth and slow; in the ‘bus’ example above, the camera is near-static, with many parts of the scene unvarying, which is an aide to compression as well as to the new system that follows its tenets; in the ‘bear’ (above) and ‘camel’ (below) video examples, we have more movement than either EbSynth or the state-of-the-art in Stable Diffusion video rendering could obtain, but without violent motion.

Click to play, and click on full-screen button for better detail. Source: https://vcg-aigc.github.io/

Tested qualitatively and quantitatively against the very limited number of remotely analogous systems available, VideoControlNet is able to achieve a new state-of-the-art, the authors claim.

Click to play, and click on full-screen button for better detail. Source: https://vcg-aigc.github.io/

The new paper is titled VideoControlNet: A Motion-Guided Video-to-Video Translation Framework by Using Diffusion Model with ControlNet, and is a collaboration between two authors, from Beihang University and the University of Hong Kong.

Approach

The first frame of a video sequence is perforce an I-frame, and VideoControlNet generates this by passing the original frame from the source video to Stable Diffusion, at which point the video will be altered both by the content of the text-prompt and the way that the ControlNet method interprets it (this is visualized in the ‘a’ section on the left of the image below).

Central workflow for the generation of I-frames, P-frames and B-frames.
Central workflow for the generation of I-frames, P-frames and B-frames.

The B-frame is the most dependent and accommodating of the frames generated, and will where possible repeat information from the I or P frames, to save resources.

In VideoControlNet, the creation of B-frames is handled by the dedicated B-frame Interpretation (‘MgBI’) module (‘b’, in the middle of the image above).

The P-frame module (‘MgPG’ – ‘c’ on the right of the image above) created for VideoControlNet is central to the system’s capacity to create temporally consistent video. MgPG, as the authors note, utilizes the motion information of the source video by preventing the regeneration of redundant or inaccurately repeated material from frame to frame.

It may occur, as it does with video compression routines, that new material appears between these pivotal frames. I-frames cannot be created ad hoc and out of regular sequence to address this (though certain video encoding processes make this possible, this remains a relatively uncommon approach).

For these cases, a dedicated inpainting module has been included in VideoControlNet, which can impose novel material as necessary. Such frames are generated with the help of backward warping, which optical flow offers as a way to let sub-frames know what was happening in previous frames, just as optical flow in general allows ‘looking forward’ into the content of subsequent frames.

Backward warping refers back to a previous I-frame in order to create an apposite intermediary frame. Source: https://www.youtube.com/watch?v=9iN-dAKqcwM
Backward warping refers back to a previous I-frame in order to create an apposite intermediary frame. Source: https://www.youtube.com/watch?v=9iN-dAKqcwM

The speed of the VideoControlNet approach is greatly aided by the fact that the generation of B-frames (which represent the majority of the frames) does not require dedicated attention from Stable Diffusion, but is calibrated from reference information in the I and P frames.

Schematic of the motion-guided B-frame interpolation process in VideoControlNet. Motion information from the I and P frames is estimated, and backward warping generates the warped frames. Though this will be a very familiar concept to those familiar with video compression architectures, it also resembles the key-frame-driven approach of EbSynth.
Schematic of the motion-guided B-frame interpolation process in VideoControlNet. Motion information from the I and P frames is estimated, and backward warping generates the warped frames. Though this will be a very familiar concept to those familiar with video compression architectures, it also resembles the key-frame-driven approach of EbSynth.

Data and Tests

For testing purposes, VideoControlNet was evaluated on four datasets: High Efficiency Video Coding (HEVC); Ultra Video Group (UVG); MCL-JCV; and the DAVIS dataset. The first three of these are video encoding datasets, while the use of the DAVIS dataset is inspired by its use in the Text2Live system – one of the challengers in the VideoControlNet testing rounds.

Key-frames in the MCL-JVC dataset, one of the datasets used for testing VideoControlNet. Source: http://mcl.usc.edu/wp-content/uploads/2016/09/07532610.pdf
Key-frames in the MCL-JVC dataset, one of the datasets used for testing VideoControlNet. Source: http://mcl.usc.edu/wp-content/uploads/2016/09/07532610.pdf

The tests were conducted using the Stable Diffusion V1.5 checkpoint, using the depth and canny map conditioning models. DPM-Solver was employed for sampling, with a fairly economical 20 steps used for sampling, for the generation of both I-frames and P-frames.

For optical flow estimation, the researchers leveraged the Flowformer Transformers architecture, and the experiments were all carried out on a single NVIDIA V100 GPU with 16GB of VRAM.

For quantitative testing, the authors conducted a user study, where each participant was shown 30 video/prompt combinations, with 720 votes ultimately collected across 24 users.

Rival frameworks tested were PicsArt’s Text2Video-Zero, and Contrastive Coherence Preserving Loss for Versatile Style Transfer and CCPL – the latter currently considered the state-of-the-art solution for the target task, according to the authors.

Results from the user study.
Results from the user study.

Of these results, the authors state:

‘Our VideoControlNet outperforms previous methods [Text2Video-Zero] and [CCPL] by a large margin and achieves 74.7% user preference. It is observed that the generated videos of Text2Video-Zero lack temporal consistency. Although CCPL sometimes achieves comparable results on specific prompts, most generated videos are much worse than the results of our VideoControlNet.’

Further quantitative testing involved the generation of videos including topic-prompts such as paragliding, bus, dogs-jump, etc. Metrics used were Fréchet Video Distance (FID), Inception Score (IS), ClipSim, Perceptual Similarity Metric (LPIPS), and Euclidean distance (L2 norm).

Because CCPL takes direct video as input, and VideoControlNet does not, only Text2Video-Zero was tested.

Quantitative results for VideoControlNet on the DAVIS dataset.
Quantitative results for VideoControlNet on the DAVIS dataset.

Here the authors comment:

‘[Our] method outperforms the SOTA diffusion-based method [Text2Video-Zero] in terms of all metrics including FVD, IS, FID, CLIPSIM, LPIPS and Optical Flow Error on the DAVIS dataset with faster inference speed, which demonstrate the effectiveness of our method’

For a qualitative round, the researchers fed various prompts to both Text2Live and VideoControlNet. Text2Live uses the Atlas system, in which the entirety of a video clip is effectively ‘unfolded’ or concatenated into a single image that can be addressed in one pass, in a manner similar to optical flow, but aimed more towards direct editing of textures.

The results of these, which are videos, do not currently appear to be included in the extensive range of videos featured at the project site, but the researchers provide still images and commentary.

Static image results from the qualitative comparison round with Text2Live.
Static image results from the qualitative comparison round with Text2Live.

‘[The] generated video of our method has better visual quality than the generation results from Text2LIVE due to the strong generation quality of [Stable Diffusion]. For example, in the snow scene, the road of our generated video is more realistic than the output of Text2LIVE.

‘We also observe that the Text2LIVE method achieves good temporal smoothness due to its reliance on Layered Neural Atlases [ 15 ]. However, the generation quality of Text2Live outputs varies a lot on different types of videos. For example, in the nighttime scene, the road is illuminated without street lights.

‘Moreover, the inference speed of Text2Live is extremely slow and it even requires more than 10 hours for editing a single video, while our method generates the video at about 3.4 seconds per frame.’

The system’s inpainting method, designed to fill in occlusions in B-frames (see above), also provides very stable inpainting masks, which effectively furnish a type of semantic segmentation. This functionality in itself may prove interesting in other projects:

VideoControlNet’s targeting capabilities allow the user to address either the foreground (in the video above, the moving human or bear figure) or the background of a source clip. Click to play, and click on full-screen button for better detail. Source: https://vcg-aigc.github.io/

Conclusion

The results from VideoControlNet are arguably among the best – perhaps the best – seen in the long and ongoing search to infuse the versatility of latent diffusion generative models with the capability for temporally consistent video generation. However, we should understand the apparent limitations of this approach, in terms of creating anything longer than a cool meme or a gorgeous Instagram clip.

The achieved congruence of multiple simultaneous frames – which gives the impression of remarkable temporal consistency in the supplied videos that support the paper – is not likely to preserve the diffusion-based alterations across multiple clips, or even across very long clips (i.e., videos which exceed the average five-second length of those made available at the site).

Click to play, and click on full-screen button for better detail. Source: https://vcg-aigc.github.io/

Stable Diffusion and VideoControlNet are systems ‘that deal with the moment’ and each lacks the means to create and recall canonical references for objects in a scene.

Conversely, traditional CGI approaches use models that have distinct and consistent meshes. These meshes will not change unpredictably, and can be summoned up for other work with 100% accuracy; and the same applies for their textures, which are either linked bitmap files and/or procedural textures that are equally controllable.

Click to play, and click on full-screen button for better detail. Source: https://vcg-aigc.github.io/

In theory, in a system such as VideoControlNet, diffusion-imposed changes could be preserved over greater lengths of time, through the use of prompts within fine-tuned models such as DreamBooth, or through the use of LORA, or similar adjunct files that are dedicated to preserving the appearance of a style, a person, a scene or an object.

In effect, this is likely to be quite ‘hit-and-miss’ in terms of reproducible and consistent scene entities, because even the most restricted DreamBooth model is subject to other vagaries from the diffusion process.

VideoControlNet allows the rendering process to ‘remember’ for much longer, but it only really extends the retained memory of details in the clip from nanoseconds to about 5-10 seconds (based on the supplied material from the paper).

Though the repurposing of video-encoding methods for this much sought-after purpose is a genial innovation, there is nothing in the approach that suggests it could ever be extended to provide consistency across more than sporadic, ad hoc video-clips.

More To Explore

AI ML DL

Solving the ‘Profile View Famine’ With Generative Adversarial Networks

It’s hard to guess what people look like from the side if you only have frontal views of their face; and the chronic lack of profile views in popular datasets makes this a stubborn data problem that’s standing in the way of 360-degree facial synthesis. Now, researchers from Korea are offering a method that might alleviate this traditional roadblock.

AI ML DL

Repairing Demographic Imbalance in Face Datasets With StyleGAN3

New research from France and Switzerland uses Generative Adversarial Networks (GANs) to create extra examples of races and genders that are under-represented in historical face datasets, in an effort to offset controversies such as the tendency for facial recognition systems to fail to recognize (or to over-recognize) particular types of people.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle