The Continuing Struggle to Create Significant Motion in Text-To-Video Output

Animate Your Motion

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

While there is currently great excitement about the potential of diffusion-based text-to-video systems, all the major contenders, including recent outings such as Sora, suffer from the same critical shortcoming – they cannot convincingly create large movements in videos.

Click to play. Tracking the extent to which the ‘Sora lady’ actual moves in the video that wowed the internet. Source:

If we cast a green bounding box around the ‘Sora lady’ in the new system’s most viral clip, we can see the extent to which her movement is constrained – effectively, if we discount the background, she is walking in place and making very little actual movement; rather, it’s the inventive ‘cinematography’ that conveys a sense of reality and scope.

Likewise, last year, when Runway demonstrated their Gen1 text-to-video system, the web was similarly impressed at the extent to which plausible movement seemed to have been achieved, even though the video in question was using exactly the same cheap trick – a person largely glued into position, and the background alone demonstrating real movement (and at a speed too intense for anyone to be able to spot the inevitable ‘continuity errors’ that tend to occur in such renderings, several of which can be gleaned by the keen observer, in the video below):

Click to play. There’s rapid motion in the 2023 Runway Gen1 demo, but the focus of attention (the series of changing characters) is barely moving. Source:

Arguably, sophisticated sleight-of-hand such as this is intended to grow confidence in the source company, and to provide other social or financial leverage, by making it appear that such systems can do anything, and that the choices made for the videos were arbitrary instead of tightly confined by the limits of the technologies involved.

In-between Times

The truth is, most generative video systems can only create a minor iteration on the previous frame in the video, and concatenate these ‘nudged’ frames into a long video; a video which, inevitably, will feature no abrupt movement, generated as it is by a system that cannot retain identity and other central characteristics if the previous frame is too far away in content-style to the next desired frame.

In traditional cel-based animation, the master artists tend to only draw ‘keyframes’, while the sub-animators are tasked with ‘in-betweening’ these frames; i.e., supplying the interstitial frames that cohere the two keyframes.

Key-frames drawn by a master animator – but many interstitial frames will need to be created in order to generate smooth motion, and this was traditionally the work of lower-ranked animators. Source:

The example above is from a respected Disney lead animator, with an abundance of keyframes, and not too much work for the junior ‘in-betweeners’. In less vaunted production companies, with lesser budgets, the in-betweeners might be required to fill in far more content than in the above example:

There is a big disparity between these two frames, and filling in the gaps requires a leap of imagination on the part of an inbetweener animator, as well as a cohesive inner conception and familiarity with the character.
There is a big disparity between these two frames, and filling in the gaps requires a leap of imagination on the part of an inbetweener animator, as well as a cohesive inner conception and familiarity with the character.

This is the kind of thing that generative AI does not excel at.

In the case of old-school animation, the in-betweeners will have reference concept sketches and general familiarity with the character, and thus will have built up a reasonable domain knowledge of it, and be able to maintain identity and consistency when filling in content between the master frames.

For generative systems, the left-most frame in the image above is all they have to go on, and they may not be able to make such a large bridge and still end up at the ‘B” keyframe above (the right-most image). Non-AI systems such as EbSynth can perform an automated version of this task, but the results effectively still constitute ‘traditional’ animation, and come with many caveats.

Likewise, optical flow-based interpretive systems can, to an extent, make similar use of keyframes – but once again, this methodology does not fit easily into current generative AI approaches.

The older ‘optical flow’ technology is used for generative AI purposes, in a recent project – but abrupt or brusque movements are off the table. Source:

Adjunct systems for Stable Diffusion, such as LoRA, can retain generalized domain knowledge of a character in the same way that the ‘classic’ in-betweener once did, by retaining a consistent trained internal reference of the character’s appearance – though no current text-to-video system has yet succeeded in implementing this approach in a versatile and temporally consistent manner.

In Search of Sudden Moves

T2V systems that can really get the action going are rare in the current run of literature. One recent project used a box-based method to prescribe the areas in which target action is desired to move within the frame, and succeeded in escaping the merely ‘fidgety’ movement of the general run of such architectures:

The Boximator method allows for significant movement, a rarity in T2V systems. Source:

This box-based system allows the pipeline to move away from the strictures of the previous frame, and force in-betweening that is more radical.

Target boxes in Boximator may be in a radically different part of the frame. If the runtime of the video is only a few seconds, this will force brusque and extreme movement, which is not currently a native behavior in T2V frameworks. Source:
Target boxes in Boximator may be in a radically different part of the frame. If the runtime of the video is only a few seconds, this will force brusque and extreme movement, which is not currently a native behavior in T2V frameworks. Source:

More recently, a newer system has been proposed, called HDRFlow, which uses High Dynamic Range (HDR) loss together with a novel flow system in order to create more rapid movement.

Click to play. The sole video, to date, from the HDRFlow project, which demonstrates the kind of brusque movement that generative systems will need to portray authentic action. Source:

Though the project page has a space reserved for a ‘showcase’ video for the new system, and though the lead authors have said that the video will be put online soon*, it remains empty – a tantalizing prospect, if it should materialize.

Animate Your Motion

This brings us on to the latest attempt to get T2V moving – a project titled Animate Your Motion (AYM), from KU Leuven in Belgium.

Like the prior Boximator project, AYM uses a box-based method to illustrate target areas for movement, which, when short runtimes are dictated, necessitate rapid and notable movement.

Further examples of wider movement from AYM.

However, AYM also integrates two approaches which hitherto have only been attempted separately: semantic control (where the words in the text prompt are used to alter the outcome of the generation) and motion cues (the boxes illustrated in the video directly above).

Thus AYM is a multimodal system that appears to be the first of its kind. Since there are no prior analogous systems, the authors tested the new framework in an ablative and qualitative manor, and appear to have obtained some promising results for a potential new direction in text-to-video methodologies.

The new paper (which has an accompanying project page) is titled Animate Your Motion: Turning Still Images into Dynamic Videos, and comes from four researchers across the Department of Computer Science and the Department of Electrical Engineering at KU Leuven.


The new method introduces a Scene and Motion Conditional Model (SMCD) to manage the multimodal input, from text and box placement – an additional module that operates over a trained text-to-video diffusion model.

Conceptual architecture for the SMCD. Source:
Conceptual architecture for the SMCD. Source:

The approach borrows a Motion Integration Module (MIM) from the GLIGEN generative system, as well as a Dual Image Integration (DIIM) module consisting of a zero-convolutional layer, similar to the schema in the ControlNet adjunct module for Stable Diffusion.

The latter module is incorporated into the image features, and progressively refines state of the image condition across each block in the UNet of the diffusion system.

The paper states:

‘This dual approach allows SMCD to impose image conditions across the entire generation process, enhancing the coherence and consistency of the generated frames. Interestingly, we find that simultaneously training these two signal integration modules can lead to competitive interference, resulting in outputs of inferior quality.

‘To mitigate this, we propose a two-stage training strategy. We first train the motion integration module and then, with this module frozen, proceed to train the dual image integration module. This sequential approach prevents the competition between signals, leading to the generation of cleaner and more focused videos.’

The central objective of the system is to animate video from a still image using text conditioning and motion placement indicators in the form of boxes, similar to the Boximator scheme.

The base Latent Diffusion Model used in AYM is Stable Diffusion, with each frame projected from pixel space via a Variational Autoencoder (VAE). At the same time, the image caption and object labels are preprocessed with the CLIP textual encoder.

The SMCD framework is a development from Alibaba Group’s ModelScope project, adding object trajectories and an initial seed image frame as central pivotal conditioning.

The combination of motion indicators and text that powers the system.
The combination of motion indicators and text that powers the system.

In this configuration, the MIM is intended to precisely capture and orchestrate the trajectories of the objects in the generated video, while the DIIM preserves the semantic detail from the seed frame.

The GLIGEN-derived Motion Integration Module in the system freezes ModelScope’s pretrained Unet, and generates box locations indicating the motion to be generated in the video, through a multilayer perceptron (MLP) that encodes the bounding box coordinates together with their associated (text) labels.

After this, a gated self-attention layer is inserted into each block of the Unet, equidistant between the spatial attention layer and text-based cross-attention layer, enabling orchestrated self-attention across the concatenated output of the image and text location tokens.

Borrowing an innovation from the TrackDiffusion project, a convolutional neural network (CNN) is used to add parameters to the latent features obtained up to this point.

During both sequential phases, classifier-free guidance (CFG) is a factor in the training. During both stages, dropout is used to randomly omit both the image and the box sequence, in a manner similar to Boximator. This allows the system to train on the boxes without burning them into the final output, so that it knows where the boxes are located, but is not obliged to reproduce them at inference time.

The paper states:

‘This strategy introduces an element of unpredictability that encourages the model to learn robust feature representations.’

During inference, a Gaussian noise element is sampled and run through a typical LDM denoising process to generate a simple video latent representation, before being transformed into RGB pixel output by the VAE decoder, with CFG applied during this stage.

Data and Tests

Datasets used for the testing stages included the GOT10K collection and YTVIS 2021. Both these datasets feature sequences with bounding box annotations. The former contains 10,000 video segments designed to track real-world objects, and the latter consists of 2,985 training videos across 40 semantic classes.

In line with TrackDiffusion, the LLaVA model (also used by Boximator) is used to generate captions for each dataset.  

In the initial training phase, frames are extracted from both datasets, and data samples originated that include images and object boxes. The model is trained on these for 50,000 steps at a batch size of 32.

In the subsequent phase, the model is fine-tuned on the videos for a further 80,000 steps at a lower batch size of 8,000 (since a lower batch size is likely to obtain better detail, while a higher batch size is likely to help generalization).  

The probabilities of excluding bounding boxes and/or images is set, and the AdamW optimizer employed at a learning rate of 5e-5 (which is at the upper bounds of standard learning rates in projects of this type).

All tests were conducted on a single NVIDIA A100 GPU (though the paper does not specify whether the A100 has 40GB or 80GB of VRAM). Training for each experiment took three days.

Videos were generated at a resolution of 256x256px, with eight frames created simultaneously during the process.

Metrics used to evaluate performance were Fréchet Video Distance (FVD), and CLIP similarity (CLIP-SIM), the latter applied to individual rendered video frames. To evaluate how close subsequent frames are to the seed frame, the DINO visual encoder was used.

Additional metrics used were Area Overlap (AO) and Success rate (SR), implemented through Autoregressive Visual Tracking (ARTrack).

Since AYM is, to the best of the authors’ understanding, a truly novel approach, there were no analogous systems against which to test it directly, which forced a more ablative approach to quantitative and qualitative testing, and therefore the ModelScope and TrackDiff systems were trialed, with evaluations conducted against GOT10K.

Comparison with existing methods.
Comparison with existing methods.

Of these results, the authors state:

‘[SMCD] markedly surpasses both referenced methods in FVD, highlighting the benefits of integrating image conditions. Furthermore, SMCD not only significantly improves upon MS in terms of FFFDIN O scores but also achieves results on SR50 that are on par with those of TrackDiff. This underscores our model’s robust ability to effectively incorporate multiple input modalities.’

Next, the researchers tested the capabilities of different image integration methodologies, including zero convolutions (ZC), ControlNet, and gated cross-attention (GCA) – and note that SMCD itself adopts a combination of ZC and GCA. In the table below we see the results from evaluation metrics for video quality across the two tested datasets:

Tests for image integration methodologies.
Tests for image integration methodologies.

Here the paper comments:

‘Notably, the scores for CLIP-SIM and FFFDIN O are consistently high across the various methods evaluated, nearing or even surpassing those of the ground-truth videos (GT-video)3.

‘However, the FVD scores exhibit significant discrepancies, underlining notable differences in video quality among the various approaches.’

A further test was made for grounding accuracy, i.e., evaluating the compatibility of the methods with the Motion Integrating Module, and how effective they are at generating objects that obey the target bounding boxes:

Comparison for grounding accuracy.
Comparison for grounding accuracy.

Here the paper observes:

‘Interestingly, employing a single CtrlNet module yields better FVD compared to both ZC and GCA individually. However, when ZC is combined with CtrlNet, the performance on FVD deteriorates for both datasets. This decline could be attributed to the methodological overlap where ZC and CtrlNet: both enhance image conditions through zero-convolutional layers and feature addition.

‘Specifically, ZC integrates the conditional image with a noised image for the encoder, while CtrlNet introduces the encoded clean image to the decoder. This dual approach to processing the conditional image might lead to conflicts, adversely affecting the outcome.’

However, the researchers also observe that integrating ZC and GCA with their SMCD framework brings notable enhancements, including the best adherence to grounding accuracy.

For a qualitative round, the authors used the text prompt A hippopotamus that is walking:

Qualitative results across different models. Please refer to the source paper for better resolution and superior detail.
Qualitative results across different models. Please refer to the source paper for better resolution and superior detail.

Here the paper notes that GCA focuses on high-level semantic consistency, but does not succeed in generating fine detail, while ControlNet produces excessively rapid movements, as the hippopotamus’s head shifts too quickly from left to right; at the same time, ZC produces semantic consistency, but at the cost of subject consistency, as the subject identity evolves, including color changes.

‘[Our] SMCD method (ZC+GCA) generates a video that not only meets the specified conditions but also maintains consistency across frames.’


Quite a lot of the thunder of the AYM approach was stolen recently by the prior work for Boximator. However, both approaches are using similar methodologies to break through the ‘rigid’ barrier of stilted movement that is currently plaguing T2V systems, including the use of LLaVA captioning, the dropping out of boxes during training (so that the target is preserved without rendering the boxes at inference time), and the employment of generally similar resources, such as ModelScope.

One retarding factor in the impetus to make real progress in the area of ‘abrupt-movement’ T2V systems is the extent to which the ‘cheap tricks’ employed by more ‘static’ generation systems such as Sora and Runway methods (where richness of detail and select cinematography masks the systems’ inability to create brusque and significant movement) are capable of capturing the imagination of the public, and the concomitant funding and investment – making the necessary extra effort of projects such as Boximator and AYM less significant and urgent.

* In an email between myself and the corresponding author, on March 7th 2024.

More To Explore

Main image derived from

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.