Generating Temporally Consistent Stable Diffusion Video Directly in the Latent Space

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

We regularly cover the latest attempts from the image and video synthesis research community to address the difficult challenge of achieving temporal coherence using Latent Diffusion Models (LDMs) such as Stable Diffusion. Systems of this kind are designed to produce single images and then discard all the contributing facets; which is unhelpful, since what video generation needs is at least 24 highly-related successive frames.

For this reason, generating video that doesn’t flicker or jump around, or look like an off-cut from A Scanner Darkly (2006), remains a tantalizing goal.

In the absence of a robust solution, cheap tricks abound: a recent demonstration of video generation from RunwayML may have wowed audiences, but was carefully constructed to avoid any of the real pitfalls of LDM-based video generation; a very recent attempt at grafting video codec principles into LDM-based video generation doesn’t address long-term generative continuity; and a similar project that hijacks the principles of optical flow for the same purposes shows some distinct practical limitations.

To boot, the very best of the current crop of contenders can only hope to create improved ‘one off’, ad hoc viral video clips, because the methodologies of all these approaches cannot straddle multiple clips.

Even technologies such as optical flow can only achieve coherence within the bounds of the clip in question. If you want a second clip that’s stylistically and semantically consistent with the previous one, there is absolutely nothing in the neural synthesis research scene that can oblige you – at least for the moment.

Therefore the sector is currently addressing that which it can address: the improved coherence of single clips – a goal which, though less ambitious, is far from definitively achieved at this time.

The DiffSynth Approach

With these limitations in mind (because it’s easy to think that the advent of true generative-video temporal coherence is imminent), a new collaboration from the Chinese academic sector, in concert with Alibaba, is offering yet another novel approach to what we can only begin to think of as ‘the better AI Instagram clip’.

Titled DiffSynth, the new system applies deflickering methods not in pixel space (as with optical flow and similar techniques), but in the trained latent space of the neural network, where the content still exists as an abstract, vector-based features, and is still very malleable.

The new system can therefore not only reduce ‘flickering’ (which we can read in this instance as ‘improve temporal coherence’), but can offer a diverse set of templates for varying methods of image/video generation, such as text-to-image and image-to-image, among others. In the multimodal example below, a CGI template is used to generate neural renderings informed by text prompts.

Click on full-screen button for better detail. Source:

The authors state:

‘Leveraging the research achievements in image synthesis, we have designed pipelines for multiple downstream tasks, including text-guided video stylization, fashion video synthesis, image-guided video stylization, video restoring, and 3D rendering.

‘[…] Without any cherry-picking, we are able to generate coherent and realistic videos. Unsurprisingly, DiffSynth comprehensively outperforms existing baseline methods in quantitative metrics and user studies.’

Fashion synthesis with DiffSynth. Click to play, and click on full-screen button for better detail. Source:

As we can see in the example video above, which demonstrates the ‘fashion synthesis ‘application, DiffSynth is more than usually capable of pulling off photorealistic full-body deepfakes, compared to the current run of temporal coherence systems on offer, which tend to avoid that particular challenge.

Though the video restoring capabilities of the new system are arguably less convincing, the extensive video examples (embedded below) for text-guided video stylization demonstrate impressive consistency and lack of flickering – even if it has to be admitted that the challenges thrown at the system are not always among the toughest.

Text-guided video stylization. Click to play, and click on full-screen button for better detail. Source:

The new paper is titled DiffSynth: Latent In-Iteration Deflickering for Realistic Video Synthesis, and comes from eight researchers across East China Normal University, Xiamen University, and the Alibaba Group.


DiffSynth, the authors state, was inspired by prior projects in this space, including Blind Video Temporal Consistency via Deep Video Prior, from 2020*, where the goal is to perform all necessary operations directly in the neural space.

Though the paper provides no architectural schema or visual overview, its authors observe that the challenge for the latent in-iteration deflickering part of the architecture is to ensure that the tensors involved in the generative flow do not become treated in isolation (which is their default position in a latent generative workflow).

If this occurs, flickering is inevitable, since each generated frame will not have referred to other frames, or been conditioned on the remainder of the content in the video.

Image-guided video stabilization from DiffSynth. Click to play, and click on full-screen button for better detail. Source:

The process described (though not illustrated) by the authors of the new paper passes the image output of ‘isolated’ tensors through the decoder component of a variational autoencoder (VAE). The deflickering method is applied in the latent space to successive frames, and the final sequence of deflickered frames pushed into the latent space, where they will subsequently be rendered out as inference content.

The deflickering component itself is supported by the 2009 PatchMatch project, a collaboration between Princeton, the University of Washington, and Adobe.

PatchMatch, the 2009 collaboration that underpins the deflickering component in DiffSynth. See original source video for better quality. Click to play, and click on full-screen icon for a better view. Source:

The new module needs to inpaint the difference between successive frames in a consistent manner – a challenge that the recent VideoControlNet project (see our coverage from last week) addresses with a workflow largely based on the principles of optical flow. However, the authors of the new work eschew this approach, commenting ‘We have also considered [optical flow], but we find that optical flow is typically not accurate when the interval of two frames is large.’

The PatchMatch approach, instead, calculates the difference between two adjacent frames into overlapping patches. DiffSynth calculates a nearest neighbor field (NNF) that represents these returned patches, and reconstructs the frame content based on these, before blending the reconstructed frames with the rest of the output.

The system also offers some novel approaches to a video synthesis pipeline, including the use of fixed Gaussian noise:

‘When we synthesize images, sampling from the same Gaussian noise leads to the same image if we leave other settings fixed. In video synthesis, the frames in a video are expected to be similar; thus we synthesize each frame from the same Gaussian noise.

‘In some downstream tasks, some information from the input video is supposed to be retrained in the edited video, thus we add the same Gaussian noise to each frame.’

DiffSynth offers a solution to one of the most recurrent bottlenecks in similar systems – the misuse of cross-frame attention, where the generated image is concatenated with the source image for inpainting of changed regions. The authors comment**:

‘This is a widely used trick to control the generated content. However, the model will draw unexpected components near the seam line, because it tends to combine the two images into one complete image. Essentially, the information from the reference images is passed to our image mainly by self-attention [i.e. Transformers].

‘Thus, we change self-attention layers to cross-frame attention layers.’

The authors note that the self-attention layers in ControlNet are also changed to cross-frame attention layers, where necessary.

Another interesting wrinkle in the new work is the use of non-square, non-conventional training resolutions – adaptive resolution. Typical training data image resolutions are even-sided, such as 256x256px, and 512x512px. However, the authors state that given the flexible architecture of a U-Net, it is trivial to change the format and resolution in downstream tasks, which allows the implicit patches generated in the process to carry greater detail.

The authors observed during their study that certain annotator modules were more inclined to cause flickering than others – notably the OpenPose annotator, which, the authors state, can cause ‘tic of the limbs’ if the frames are not adequately well-defined.

Poses can be imposed and expressed in the Stable Diffusion-based generative pipeline of DiffSynth. Source:
Poses can be imposed and expressed in the Stable Diffusion-based generative pipeline of DiffSynth. Source:

The new method overcomes this with the use of Savitzky-Golay smoothing filters, which provide a cleaner transition between the coordinates of the detected keypoints.

Data and Tests

In order to test DiffSynth, the researchers selected 100 high resolution videos from Pixabay. The videos were cut down to 3-5 seconds in length, with none exceeding 150 frames.

The DreamShaper checkpoint, trained on the Stable Diffusion V1.5 model, was used for the tests, along with two ControlNet models – Depth and SoftEdge.

Text-guided video stylization, with two different inputs (middle row): the ControlNet annotator Depth (left) and the SoftEdge annotator (right).
Text-guided video stylization, with two different inputs (middle row): the ControlNet annotator Depth (left) and the SoftEdge annotator (right).

The information fed into the video pipeline is entirely handled by ControlNet, which operates (at best) in grayscale. Therefore color information is ignored. As a final sequence of steps, an additional deflickering pass is run, and contrast ratio and frame-sharpening tweaked. The video for this round of tests is embedded above (beginning with flowers).

For the fashion synthesis tests (video also embedded above), the authors used the 2019 DwNet dataset.

Ablation studies on the dataset created for the 2019 DwNet project, used in the DiffSynth initiative. Source:
Ablation studies on the dataset created for the 2019 DwNet project, used in the DiffSynth initiative. Source:

The dataset contains hundreds of videos, each featuring a fashion model. Ten source videos were randomly selected from the dataset, and each used to fine-tune Stable Diffusion V1.5. Each fine-tuned model is now essentially a DreamBooth-style template that can be imposed into the generation pipeline.

With another ten random target videos selected, ControlNet’s OpenPose is used to control the pose of the models (please refer to earlier video above).

To quantitatively evaluate the quality of the synthesized videos in these tests, the authors compared DiffSynth to various baseline approaches: Fusing Attentions for Zero-shot Text-based Video Editing (FateZero); Pix2Video; and PicsArt’s Text2Video-Zero. For fashion video synthesis, DiffSynth is compared to DreamPose, a recent joint offering from the University of Washington, UC Berkeley, Google Research, and NVIDIA.

The 2023 DreamPose fashion video synthesis project. Source:
The 2023 DreamPose fashion video synthesis project. Source:

In line with the prior work from Pix2Video, the researchers used Pixel-MSE to estimate the consistency of the resulting videos. However, since FateZero can only process 512×512 format, this was excluded from the Pixel-MSE loss estimations.

Additional metrics used were CLIP-score, which evaluated the pertinence of the output content to the text prompt for the video; and the Aesthetic Score Predictor to evaluate the aesthetic quality of the output frames. Further metrics used were Pose-MSE and Fréchet Inception Distance (FID).

Results from the quantitative round.
Results from the quantitative round.

Of these results the authors state:

‘[DiffSynth] clearly outperforms other approaches. It shows that our method can generate smoother videos than others.

‘[…] FateZero can only make minor changes to the frames, failing to generate videos that match the textual description. Pix2Video is excessively focused on textual information, resulting in incoherent frames. Text2Video-Zero performed slightly better than the aforementioned methods but still lags behind our approach.

‘[…] Our method exceeds DreamPose in all metrics. These experimental results demonstrate the effectiveness of DiffSynth.’

Finally, as is customary in this type of project, the researchers conducted a user study, where twenty participants were invited to evaluate videos based on consistency, text/video similarity, and general aesthetics. Here, the majority of the participants favored the results from DiffSynth:

Results from the user study.
Results from the user study.


This is a difficult paper to evaluate, partly because of the chaotic avalanche of example videos, and partly due to a certain lack of rigor in reporting of methods. Nonetheless, within the bounds of the objective, which is to improve temporal consistency of a single Stable Diffusion-generated (or affected) video clip, DiffSynth appears to have something to offer.

In spite of the gains documented, the search for true and genuinely applicable temporal consistency (after a year of furious effort since the release of Stable Diffusion in August of 2022) is beginning to resemble the desperate years of scrabbling for instrumentality in Generative Adversarial Networks – which was ultimately succeeded by the reluctant realization that while GANs are dazzling, their only potential for the outputting of consistent video is as ‘texture machines’ aided by ancillary technologies such as 3DMMs.

Likewise, the temporal consistency gold rush may be heading for a similar decline, with the LDMs reduced to the status of a tool in wider frameworks, rather than what the community still fervently hopes for – a host network that’s just waiting for a video module that actually works, and can produce consistent and repeatable video styles, objects and general content. To date, there is still no sign of this, and still no evidence that it lies around the corner. Rather, the ‘cheap tricks’ continue to pile up.

* The paper does not clearly cite prior works in a standard fashion, making it difficult to accurately refer to all the contributing projects. We’ve done our best to fill in the gaps, and offer apologies for any errors.

** My conversion of the authors’ inline citations to hyperlinks.

More To Explore


Solving the ‘Profile View Famine’ With Generative Adversarial Networks

It’s hard to guess what people look like from the side if you only have frontal views of their face; and the chronic lack of profile views in popular datasets makes this a stubborn data problem that’s standing in the way of 360-degree facial synthesis. Now, researchers from Korea are offering a method that might alleviate this traditional roadblock.


Repairing Demographic Imbalance in Face Datasets With StyleGAN3

New research from France and Switzerland uses Generative Adversarial Networks (GANs) to create extra examples of races and genders that are under-represented in historical face datasets, in an effort to offset controversies such as the tendency for facial recognition systems to fail to recognize (or to over-recognize) particular types of people.

It is the mark of an educated mind to be able to entertain a thought without accepting it.