Powering Generative Video With Arbitrary Video Sources

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

How long has it been since you picked up a physical, printed encyclopedia to look up a topic, instead of searching for it on the internet?

The idea of date-bound collections of common knowledge, curated into single reference volumes, with all the biases and omissions that come with this approach, was already in a death spiral by the end of the twentieth century. These days, we search for information using our own parameters, in the form of search terms, to drill down to the apposite knowledge.

A new paper from China has suggested lately that this paradigm could also be applied to the practice of training thousands, or even millions of video clips, at great expense and cost of time, into motion priors, in the latent space of text-to-video (T2V) models.

Videos created with the new ‘prior search’ system. Source: https://hrcheng98.github.io/Search_T2V/

Why not simply consider what it is that you’re trying to generate, and find the video clip that is most likely to yield the best motion priors, and use that, instead of hoping that a pre-trained model has what it is that you’re looking for..?

In effect, the prospect is therefore one of automating what the professional and amateur VFX communities currently do to get accurate motion priors, which is to find a clip with the desired motion (or generate it personally), and use that in a T2V generative workflow, such as the use of the AnimateDiff or ControlNet adjunct systems for Stable Diffusion.

It is common practice to employ user-selected videos as source motion priors for generative T2V transformations. The approach featured here is using the AnimateDiff framework in the ComfyUI environment. Source: https://www.youtube.com/watch?v=kJp8JzA2aVU

The new method uses diverse Natural Language Processing(NLP) techniques to pick apart the supplied user-prompt, and to retrieve the most semantically-likely candidate from an arbitrary video database. This process must also drill down beyond the action (i.e., ‘swimming’) to consider the actor (i.e., ‘a dog swimming’ and ‘a man swimming’ will necessarily have anatomy-specific motion priors).

The system, quite impressively, can operate on a NVIDIA RTX 4090 GPU with 24GB of VRAM, which is high-end in the consumer space, but pretty undemanding compared to the usual resources cited in literature along these lines.

Tested against prior approaches, the new system scored best in a user study, and competitively against the older methods, with samples available at a project website.

CLICK TO PLAY. Examples of the new system (left) compared to prior approaches, the latter having committed to only the motion priors trained into their models.

The paper states:

‘By integrating dynamic information from real videos as motion priors into a pre-trained T2V model, our method achieves high-quality video synthesis with precise motion characteristics at low costs. This approach alleviates the issues of current video generation models and offers a cost-effective solution by leveraging the abundance of real-world video resources available on the internet.

‘Our method offers the potential to popularize text-to-video generation by bridging the realism gap between synthetic and real-world motion dynamics, paving the way for applications in various domains such as entertainment, education, and virtual environments.’

The new paper is titled Searching Priors Makes Text-to-Video Synthesis Better, and comes from eight researchers across the State Key Lab of CAD&CG and the College of Software at Zhejiang University, FABU Inc., and Tencent Data Platform.

Method

Though the WebVid-10M dataset is used for the purposes of the paper, the principles of the system are data-agnostic. The core idea is to essentially create a search engine that can retrieve an appropriate clip from an ad hoc video collection that has reasonable semantic information available (the same criteria as applies to any search index, such as Google Search, which considers content and multiple other associated parameters).

Conceptual schema for the approach.
Conceptual schema for the approach.

The user’s text-prompt must therefore be semantically interpreted in two stages. The first of these is text vectorization, which uses pre-trained models from spaCY and Sentence Transformers to analyze the word vectors of the prompt and video label (captions, annotations or other metadata of the videos found in the dataset).

With the text vectors established, atomized semantic vectors are then derived, which decompose the prompt and video-data annotation into components, seeking to isolate actors, actions, context, and any other factor that would make one particular video clip suitable for the task that is deemed to have been requested by the picked-apart user text-prompt.

The original text is parsed into a dependency parse tree, in order to put the word into rational context, with verbs used as cues to infer the relevant object/subject words.

Next comes semantic matching, where the obtained semantic vectors are compared to all texts in the dataset. ‘Our goal,’ the authors state. ‘is to find a text-video [pair] as a reference, which maintains the most similar semantic and dynamic information with the input.’

This process also consists of two steps: coarse filtering and ranking.

For coarse filtering, the cosine similarity is estimated for the semantic vector of the input and each of the semantic vectors of the target dataset, and most of the possible matches disposed of according to a standard of minimum criteria.

In the ranking phase, the optimal text/video pairing is established via a ranking score, which evaluates the motion semantic similarity between the text (of the dataset) and the prompt.

The authors note:

‘It is worth mentioning that in many instances of motion information, the action is strongly coupled with the object of the action (for example, “excavator digging a hole” and “mole digging a hole” correspond to two completely different sets of dynamic information). Therefore, when calculating the [similarity], we take the actor similarity into consideration.’

Next, motion extraction is enacted, where pivotal and apposite key frames must be selected from the retrieved video that was selected by the NLP processing stages.

It has to be considered that the best-suited section of a video may occur at any point in the source data, and that part of the job of this process is to select only the action/s that suit, and to discard the video context in itself. This involves obtaining a close match between the extracted semantic units from both the prompt and the existing annotations.

Temporal Attention Adaptation, adapted from the 2023 VMC project, is used to extract the relevant frames, which are then used to fine-tune the generative model, which in this case is the 2023 Show-1 system from the National University of Singapore – a Denoising Diffusion Probabilistic Model (DDPM), essentially similar in concept and functionality to Stable Diffusion.

Click to play. Examples from the 2023 Show-1 project, used as a rendering engine for the new paper. Source: https://showlab.github.io/Show-1/

Data and Tests

To test the system, the WebVid-10M dataset was used, featuring ten million web-sourced and annotated clips. Besides Show-1 as the rendering engine, the Video Motion Customization (VMC) framework was used for distillation, augmenting the residual vectors between consecutive frames into motion vectors, for guidance purposes.

Prior approaches trialed were MetaAI’s 2022 offering Make-a-Video; the Preserve Your Own Correlation (PYoCo) framework, a 2024 collaboration between NVIDIA and the University of Maryland; the 2023 NVIDIA-contributed collaboration Video LDM; and comparison results against CogVideo, Show-1, ZeroScope_v2, and AnimateDiff-Lightning – all using officially released model distributions.

Below are results of the former criteria:

A qualitative comparison with state of the art prior frameworks. Please refer to the source paper for better resolution.
A qualitative comparison with state of the art prior frameworks. Please refer to the source paper for better resolution.

Though we are not able to show all the examples related to this section, we refer the viewer to the project site for further video comparisons, and include some selections below:

Click to play. A couple of the qualitative examples from the accompanying project site.

Here the authors comment*:

‘Compared to Video LDM and Make-a-Video, our method generates more temporally coherent motion. Compared to PyoCo, ours generates more detailed and realistic frames.’

From the latter category, a selection from the paper’s static examples:

Qualitative comparison with existing SOTA models CogVideo (CV), AnimateDiff-Lightning (ADL), ZeroScope_v2 (ZS), and Show-1 (S1).
Qualitative comparison with existing SOTA models CogVideo (CV), AnimateDiff-Lightning (ADL), ZeroScope_v2 (ZS), and Show-1 (S1).

The project site features two video comparisons for the second category, one of which is featured below:

Click to play. From the project site, comparisons against four prior frameworks.

Here the paper states*:

‘Compared to CV (abbr. for CogVideo), ZS (abbr. for ZeroScope_v2), and S1 (abbr. for Show-1), our method generates more realistic appearances and more vivid motions.

‘For ADL (abbr. for AnimateDiff-Lightning), although it produces more detailed and realistic images, it fails to accurately represent the motion information required by the text (e.g., the dog is barely running, and the little girl’s hand is almost static). In contrast, our method generates videos with greater motion range and more dynamic realism.’

A user study was conducted with the aid of thirty participants assessing three aspects of generated samples from the study – visual quality (VQ); motion quality (MQ); and Video-text Alignment (VTA)  – the extent to which the generated video accords with the perceived intent of its text-prompt.

As we can see from the table below, the authors’ method was favored in this study:

Results from the user study.

Though we do not usually cover ablation studies, those conducted for this paper are unusually pertinent, wherein the researchers reduced the available amount of videos in the dataset, confirming that the full method obtains the best results:

Results from the ablation study show that reducing available dataset size affects ranking scores.
Results from the ablation study show that reducing available dataset size affects ranking scores.

The authors comment:

‘It can be observed that as the dataset size increases, the results obtained by the search become more ideal. However, the dataset size exhibits marginal effects. When the size of the original dataset reaches 50% (i.e., 5 million text-video pairs), the search results approach the peak.’

Conclusion

The authors concede that the method has some limitations, predictably regarding potential semantic ambiguities, as well as challenges when the keyframe extraction process can miss ‘broader dynamics’ in the evaluated motion.

They comment:

‘[Our] text-based retrieval method inherently suffers from the limitations of textual meaning. At times, two vastly different texts can describe the same dynamic scene. For instance, “a car speeding over a viaduct” and “bustling traffic in a thriving city” might both refer to the same visual portrayal. This poses difficulties in text-based matching.

‘Moreover, the decoupling of motion and appearance from a semantic perspective is not always feasible. Take the action “dig” as an example; the motion of a person digging differs significantly from that of a groundhog. Consequently, the coupling effect of motion and appearance during the search process can compromise the quality of priors.’

It should additionally be considered that the WebVid-10M dataset is unusually well-annotated, and that many of the challenges inherent in expanding the pool of potential videos (particularly to a live, web-based framework) are concerned with the quality of available captions.

This means that users of the system may have to restrain themselves to WebVid-10M, engage in bringing other collections up to this standard, or develop interstitial annotation augmentation approaches that could bring better captions to ad hoc videos in the wild.

Nonetheless, in framing the video-to-video motion prior model in the context of arbitrary sourcing, instead of developing endless and expensive bespoke models devoted to subsets of motion priors (i.e., models designed specifically for people, or animals, or cars, etc.), the central proposition of the work may be presenting a worthwhile challenge to the research community.

* Inline citations omitted.

More To Explore

Images from the accompanying YouTube video for the paper ' MeshAvatar: Learning High-quality Triangular Human Avatars from Multi-view Videos'. Source: https://www.youtube.com/watch?v=Kpbpujkh2iI
AI ML DL

Extracting Controllable CGI From the ‘Black Box’ of Neural Human Avatars

A new collaboration between China and Denmark offers a way to extract traditional CGI meshes and textures from implicit neural human avatars – a task that is extraordinarily challenging, but which could pave the way for more controllable AI-generated imagery and video in the future.

Montaged images from the paper 'From Descriptive Richness to Bias: Unveiling the Dark Side of Generative Image Caption Enrichment' - https://arxiv.org/pdf/2406.13912
AI ML DL

Generative AI’s Captioning Crisis May Not Be Fixable With Large Language Models

Text-to-image and text-to-video models such as Stable Diffusion and Sora rely on datasets of images that include captions which accurately describe the photos in the collection. Most often, these captions are either inadequate or inaccurate – frequently both. Sometimes they’re downright deceptive, damaging models trained on them. But the research sector’s hopes that multi-modal large language models can create better captions is challenged in a recent new paper from NVIDIA and Chinese researchers.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle