A Text-To-Video Method That Actually Generates Some Action

Boximator examples

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Although the last five or so months have brought forth a plethora of new academic papers offering novel ways to produce purely generative AI video, we have limited our coverage of these – because, frankly, the research scene didn’t seem to be getting anywhere substantial with the challenge.

This rash of academic interest has been characterized by associated project pages stuffed with multiple examples of fairly repetitive videos, so short (generally in the range of 2-3 seconds’ length) that they offer little immediate potential beyond meme-creation.

More importantly, most of the associated example videos demonstrate very little substantial movement, with the first frame usually quite similar to the final frame, and not a hell of a lot happening in between.

Examples from the January 2024 Latte system, typical of the run of T2V frameworks lately, with very short and repetitive clips demonstrating little change between the A and B state. Source: https://maxin-cn.github.io/latte_project/

However, sample videos from a new project called Boximator have caught our attention rather more; not least because the authors of the new work offer video examples that directly demonstrate the shortcomings of other recent approaches, and show that their own method offers a little more actual…well, action.

In a series of videos that compare text-prompted output from Pika and Runway’s Gen-2 product (with the samples based on the PixelDance dataset), Boximator seems able to produce footage that fully reflects the prompt, instead of giving up at the halfway mark, or setting its sights low, as many recent efforts have done:

In this example comparison against two prior text-to-video systems, including the prestigious Runway Gen-2, the text-prompt is ‘A cute 3D boy is standing and then walking’ – but only Boximator seems able to get the little fellow actually moving. Please refer to the project page for better resolution. Source: https://boximator.github.io/

In another example, Boximator is able to accurately render the generative text-prompt ‘A handsome man is taking out a rose from his pocket with his right hand and looking at the rose’, while the man depicted in the other two versions merely looks a little ‘twitchy’:

Though Pika 1.0 is able to get our dapper friend to at least blink a bit to show that he’s ‘alive’, only the Boximator approach can get him to produce the rose as requested. Please refer to the project page for better resolution.

Sometimes the failure to obey the prompt is not quite as clear. In the example below, the authors note, the prompt is ‘Adding wine to a glass’ – but only Boximator actually adds wine to the glass, while the more stylish Gen-2 merely swills it around a little, with the level never rising. The simulated physics of Gen-2 are impressive, but the central command is not obeyed:

Here Pika cannot even get the wine moving; Gen-2 provides elegant slow-motion, but the wine never actually accrues; whereas Boximator is able to actually fill up the glass.

These improved results come about because the new system, which can operate as a plugin for existing generation systems, uses a box-object correlation system to identify objects in an image which should change temporally – a system whose inner training schema is not dependent on text-based training, as most recent efforts are, but instead learns to associate subjects with box-based regions instead of purely semantic context.

In this way, users can specify not only target boxes, but also motion paths, and Boximator is able to generate substantial movements from these instructions:

Both of these examples are by Boximator, and demonstrate the soft/hard box system in action. By distinguishing ‘target’ or associated areas for change, Boximator can generate real movement – also because it is trained on only the most dynamic video-clips from a reference video dataset.

Since it is designed as drop-in functionality for existing text-to-image systems, there is an evident text component within the example videos – however, as mentioned, the internal training of the new system is not text-dependent, and appears the more robust for it. In fact, the project is inspired by the Gligen text-to-image system, but without Gligen’s internal dependency on semantic association.

Gligen's box-based system inspired Boximator, but its training remains dependent on semantic instructions.
Gligen's box-based system inspired Boximator, but its training remains dependent on semantic instructions.

As many similar projects have done, Boximator makes use of the popular WebVid-10M video-clip dataset. However, one of the reasons for the shortcomings of other frameworks that have used this dataset, according to the new paper, is that most of the brief clips in the WebVid-10M collection don’t really feature much movement either.

The paper states:

‘Through empirical analysis, we find that a vast majority of WebVid videos do not exhibit substantial object or camera movements. Consequently, sampling from this collection would be inefficient for training our motion control module.

‘To address this issue, we curated a more dynamic subset from WebVid. This involved evaluating every clip in the dataset, comparing their start and end frames, and retaining only those clips where the two frames are sufficiently different. This filtration yielded a total of 1.1M video clips.’

Though we cannot represent the entire 47 videos produced to showcase the new project, these can be seen at the project page itself, and the accompanying YouTube video (also embedded at the end of this article). While the clips produced are of a similar, very short length to similar text-to-video solutions recently proposed, at least Boximator can make significant movement within its output.

The new paper is called Boximator: Generating Rich and Controllable Motions for Video Synthesis, and comes from seven equal contributors at ByteDance research.

Method

As the accompanying video (embedded at end of article) explains, position, trajectory, size and shape are not easy to express in text:

Because of this, a text-based approach is unlikely to resolve a text-based problem (though there are many current projects seeking to use Large Language Models [LLMs] to find ‘magic solutions’ for better prompts, in this regard).

To begin to address the issue, Boximator first freezes the original model, training only the new adjunct motion control structure. This leaves the original model’s latent space and embeddings completely untouched, as opposed to the havoc that can be wrought on the original quality of a trained model through fine-tuning it directly.

Conceptual architecture for Boximator.
Conceptual architecture for Boximator.

A novel self-attention layer denoting the visual tokens within a frame is added to the standard architecture of a text-to-video model (middle-left in the image above). The constraints of the aforementioned boxes and the embeddings of the text prompt are handled in this new layer.

For each remaining clip in the authors’ slimmed-down subset of WebVid-10M, the first frame was used to generate an image description, with the use of the LLaVA language model.

An example of the boundary-based captioning made possible by the LLaVA language model. Source: https://arxiv.org/pdf/2304.08485.pdf
An example of the boundary-based captioning made possible by the LLaVA language model. Source: https://arxiv.org/pdf/2304.08485.pdf

Broadly representative ‘noun chunks’ were then obtained from these verbose descriptions, such as ‘white shirt’ and ‘young man’. These were subsequently fed to the Grounding DINO grounding model and the decoupled video segmentation approach (DEVA) object tracker, which allowed bounding boxes to be added to the entire length of the video in question. In this way, bounding boxes for 2.4 million objects were estimated.

During training, a random crop of each video, conforming to a specified aspect ratio, was made.

The paper states:

‘If a bounding box entirely fall outside the cropped area, then we project it as line segments along the border of the crop.

‘This allows users to control object movements both into and out of the frame by drawing line segments on the frame’s border.’

In the training data, all the bounding boxes are projected into the cropped target region, representing an aspect ratio. The smaller (yellow) box is 'hard', and a likely target of user-initiated change; the larger encompassing red box is 'soft', and can be any size, up to and including the entirety of the frame. The encompassing white box is the target aspect ratio itself.
In the training data, all the bounding boxes are projected into the cropped target region, representing an aspect ratio. The smaller (yellow) box is 'hard', and a likely target of user-initiated change; the larger encompassing red box is 'soft', and can be any size, up to and including the entirety of the frame. The encompassing white box is the target aspect ratio itself.

The two types of boxes considered are hard boxes (which represent target areas, or objectives for generation – the yellow box in the image above), which will likely be of a discrete size within the frame; and soft boxes, which may represent a discrete area or even the entire frame (the red box in the image above).

When both the soft and hard box are very large (even to the complete extent of the frame), manipulations such as the entire ‘panning’ of a scene can be accomplished:

In another comparison against rival tested systems, Boximator (leftmost) is able to completely obey the prompt ‘Camera rotate in a bedroom, showing a big landscape painting on the wall’, while the two prior frameworks are unable to comply.

Training is undertaken using frames which actually contain the rendered bounding boxes, so that the colored delineations of the boxes are seen and assimilated by the training process:

Generated images in which both a horse and its associated bounding box are rendered directly into an image. They move together as the frames progress, and the system learns this 'box association' in this way.
Generated images in which both a horse and its associated bounding box are rendered directly into an image. They move together as the frames progress, and the system learns this 'box association' in this way.

Later on, the system is further trained on generations that do not contain visual representations of these boxes – but by that stage, the model has already assimilated this association, and can offer it to end-users as an innate method of instrumentality.

The authors assert:

‘Upon completing the self-tracking training phase, we proceed to further train the model using the same dataset, but excluding bounding boxes from the target frames.

‘Remarkably, the model quickly [learns] to cease generating visible bounding boxes, but its box alignment ability persists.

‘This indicates that the self-tracking phase assists the model to develop an appropriate internal representation.’

The training runs in three phases: in the first, all boxes are defined as hard boxes, since these are easier for the model to learn. A second phase redefines 80% of these hard boxes as soft boxes, which are obtained by randomly expanding the prior hard boxes in four directions.

In the final phase, the second stage is continued without the use of the self-attention layer mentioned earlier, so that the host model can learn the imposed box distinctions on its own terms, and will not require a fundamental revision in architecture.

At inference time, when the trained model is actually being used, only a small selection of frames (i.e., the first and last frames) will contain user-defined boxes, or ‘areas of interest’.

The user can also impose a motion direction (see the earlier embedded video containing an image of the man drinking coffee, on the left).

Soft boxes at inference time. The native soft boxes are interpolated and 'relaxed' in accordance to the nearest coordinates dictated by the boxes that the user drew (upper row), or by a user-specified box in combination with a user-defined motion path (lower row).
Soft boxes at inference time. The native soft boxes are interpolated and 'relaxed' in accordance to the nearest coordinates dictated by the boxes that the user drew (upper row), or by a user-specified box in combination with a user-defined motion path (lower row).

The authors explain:

‘In cases where a user draws a hard box in a frame and defines a motion path for it, we let the box to slide along the path to construct interpolated boxes for each subsequent frame, then relax them to form soft box constraints.’

Data and Tests

To test the system, the researchers used the aforementioned PixelDance model, and ModelScope. The tests used box constraints, text prompts, and, sometimes, the first frame of a video as an input condition.

Though ModelScope does not support the latter approach, equivalent circumstances were approximated in tests by substituting the ground-truth frame’s latent embeddings at each denoising step.

The models were trained on sequences containing 16 frames, at a resolution of 256x256px, for output that runs at 4FPS. The maximum number of objects generated was eight, and the training used the Adam optimizer.

A batch size of 16 was used over 16 NVIDIA Tesla A100 GPUs*, across the aforementioned three stages; the first two with 50,000 iterations, and the third stage at 10,000 iterations. A learning rate of 2×10−4 was used for the first two stages, and 3×10−5 for the third stage.

All stages used a linear learning rate scheduler (a method to adapt the learning rate constructively, depending on the ongoing quality of results)

Box coordinates were encoded with Fourier embeddings, in accordance with the standard method for Neural Radiance Fields (NeRF).

For the generation of images, the DDIM algorithm was used (one of the most popular sampling methods in Stable Diffusion).

Rival Systems

Boximator was tested against Microsoft’s MSR-VTT; the ActivityNet benchmark; and UCF101. Various equivalence compensations were necessary to impose uniform conditions across these frameworks – not all of which natively supported all the testing criteria.

For one-shot tests, in addition to PixelDance and ModelScope, further models tested were MagicVideo; LVDM; Show-1; Phenaki; and FACTOR-traj.

Evaluation Metrics

Evaluation metrics were Fréchet Video Distance (FVD) and the measurement of text-conformity via a CLIP similarity score, derived from the GODIVA project. Average Precision (AP) was used for the evaluation of motion control, in accordance with the protocol for MS COCO.

The first and last frames of the generated videos were used as ground truth, and the tracking of boxes evaluated by DINO and DEVA. Mean Average Precision (MAP) is also reported as AP over Intersection Over Union (IOU) thresholds.

Quantitative Tests

Zero-shot results on MSR-VTT, where FO signifies the first frame as a given condition, and 'Box' signifies box constraints.
Zero-shot results on MSR-VTT, where FO signifies the first frame as a given condition, and 'Box' signifies box constraints.

Regarding the initial quantitative round, against the MSR-VTT dataset, the authors state:

‘The results show that Boximator retains or improves the video quality (FVD) of the base models. In all cases, adding box constraints (Box) significantly improves the average precision (AP) score of bounding box alignment…

‘… In text-to-video synthesis, our Boximator model outperforms the base models, achieving competitive FVD scores of 237 and 239 with PixelDance and ModelScope, respectively. This improvement, despite using frozen base model weights, is probably due to the control module’s training on motion data, enhancing dynamic scene handling.’

The authors note, however, that CLIPSIM scores are only on a par with the current state-of-the-art, and that these scores fall when additional conditions such first-frame initialization and box are used. They hypothesize that this is because Boximator handles multiple types of alignment simultaneously, in contrast to the base model, which is oriented solely toward a text>video approach, and does not take account of the additional methodology that Boximator introduces.

A further quantitative test was made against ActivityNet, and in all cases, the addition of box constraints improved the AP scores obtained in the test:

On ActivityNet, box alignment also produces a notable improvement in AP scores.
On ActivityNet, box alignment also produces a notable improvement in AP scores.

Since Boximator is introducing something very novel, like-for-like comparisons are difficult to obtain. Consequently the researchers provide a note of caution, advising that reported AP scores are not directly equivalent to a success rate in motion control, because the DINO/DEVA evaluations are only approximate.

‘Therefore,’ they observe, ‘it’s more insightful to focus on the difference in AP scores between methods, rather than absolute values’.

User Study

The authors conducted a user study with four participants, each of which was subjected to 100 samples from the methods outlined. Each was shown two videos in random order: a video from the base PixelDance model, originated by a text prompt and initialized from the first frame (currently the most common method in text-to-video systems); and one from Boximator.

Participants were required to rate the videos for quality and for motion control. In terms of video quality, the raters evaluated visual distortions, blurs and other quality defects, and, crucially, for temporal consistency, which remains the prime bugbear of diffusion-based video.

In terms of motion control, the participants evaluated whether the output videos conformed to the bounding boxes in the first and final frames.

Sample videos from the user study. Please refer to the source paper for many more examples, and for better resolution.
Sample videos from the user study. Please refer to the source paper for many more examples, and for better resolution.

The authors note that Boximator’s output was preferred by ‘a considerable margin’.

Results from the user study of 100 samples.
Results from the user study of 100 samples.

The paper states:

‘[The] Boximator model was preferred by a significant margin. It excelled in motion controls in 76% of the cases, outperformed by the base model in only 2.2% of the cases.

‘The Boximator model’s video quality was also favored (+18.4%), likely due to the dynamic and vivid content resulting from box constraints.’

Case Study

Finally, a case study, more quantitative in nature, was carried out, to demonstrate Boximator’s capability for the handling of complex scenarios. In the first of these, the provided text prompt was ‘Four pigs are running in the snow’. The results are depicted with (above) and without (below) box constraints:

The authors comment:

‘Boximator successfully populates each box with the target object (a pig) as specified in the text prompt. This contrasts with the second row, where the model without box constraint only produces two pigs.’

Please refer to the paper for further similar supporting examples, as well as to the earlier embedded videos in this article.

The paper concludes:

‘We believe that our design choices and training techniques can be adapted to enable other forms of control, such as conditioning with human poses and key points.’

Conclusion

Text-to-video synthesis is currently plagued by an inability to produce drastic and radical motion. Most of the recent crop of such systems produce video content more akin to ‘living paintings’ than to the kind of video output that may have a future in VFX production pipelines.

Because so much money and effort has been invested in the more commercial branches of these systems, an increasing tendency is to ‘craft’ videos that pander to the weakness of the frameworks, by producing convincing motion that looks dynamic and impressive, but does not actually feature much real movement, if one takes the time to examine it.

Perhaps one of the most impressive innovations of Boximator is not the actual box system, but the will to strip WebVid-10M of the vast quantity of videos that in themselves feature limited movement. Future frameworks., whether or not they adopt a box model of the type described here, might do well to disencumber themselves likewise of training data that will not advance the state-of-the-art in this particularly thorny sub-challenge of T2V architectures.

* The A100 can have either 40GB or 80GB of VRAM, and the new paper does not specify which version was used in the experiments.

More To Explore

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
AI ML DL

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
AI ML DL

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle