Multitask Video Synthesis Without Fine-Tuning


About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

In a period where new AI video alteration and text-to-video (T2V) frameworks seem to emerge daily in the round of scientific literature, our attention has been caught by a new system that leverages existing text-to-image (T2I) and image-to-video frameworks, and seems able to obtain superior results across a broad range of common synthesis tasks – including identity replacement – without expensive and damaging fine-tuning of the base models involved.

Examples from the ‘Prompt based Editing’ section of the project site for AnyV2V. Source:

The new framework, titled AnyV2V, is a ‘plug-and-play’ system that can accommodate a variety of potential T2I and I2V systems, and obviates the need for fine-tuning by extracting features from the workflows of guest systems and re-applying them at an apposite moment downstream.

AnyV2V deals with four tasks: prompt-based editing (featured in the example above), where text is used to condition changes in the video; reference-based style transfer (popularized in recent years by various usage of classic artist styles such as Van Gogh), which can be accomplished with even a single source image; subject-driven editing, where an object in a video is replaced by a target object supplied in the form of a photo, by the user; and identity manipulation, where a target identity in the source video is replaced by an identity chosen by the user.

Style Transfer from static images can be applied to moving video in AnyV2V.

Identity manipulation at work in AnyV2V, with a single source image.

Subject-driven editing can substitute creatures and/or objects in the new system, using existing frameworks.

The new approach breaks down these tasks into two stages: the use of an off-the-shelf image editing framework (such as InstructPix2Pix, InstantID, or others); and the use of an existing I2V system (such as I2VGen-XL) for DDIM inversion (the projection of user-supplied content and variables into the model’s latent space) and feature injection.

The paper states:

‘In the first stage, AnyV2V can plug in any existing image editing tools to support an extensive array of video editing tasks. Beyond the traditional prompt-based editing [methods], AnyV2V also can support novel video editing tasks, including reference-based style transfer, subject-driven editing, and identity manipulation, which were unattainable by previous methods.

‘In the second stage, AnyV2V can plug in any existing image-to-video models to perform DDIM inversion and intermediate feature injection to maintain the appearance and motion consistency with the source video.’

The authors claim that AnyV2V outperforms prior approaches by 35%, even without fine-tuning, and that results obtained by it are preferred in human-based tests by 25%. They also believe that the system’s methodology will make it easily adaptable to later innovations in T2I and T2V frameworks.

A promotional overview of the concept of AnyV2V.
A promotional overview of the concept of AnyV2V.

The new paper is titled AnyV2V: A Plug-and-Play Framework For Any Video-to-Video Editing Tasks, and comes from five authors across the University of Waterloo, the Vector Institute at Toronto, and


In AnyV2V, the initial seed frame is extracted from the beginning of the source video, and edited into the user’s desired configuration, as necessary. The edited initial frame is then fed with a target prompt into an I2V generative model, while the latent codes from the source video (i.e., the video which contains the guiding motion) are re-injected back into the flow.

Conceptual schema for AnyV2V.
Conceptual schema for AnyV2V.

The authors note that the general run of generative video architectures can only influence the content based on text-prompts (i.e., TokenFlow) or text instructions (i.e., InsV2v). By contrast, AnyV2V allows ad hoc modification of the first frame, facilitating fine-grained editing of the output video content, and enabling workflows that offer inpainting, subject-driven image editing, style transfer, and mask-based image editing, among others.

They note also that the first frame can be edited manually, by more traditional methods.

The DDIM inversion that projects the modified frame into the system is conducted without the aid of text prompts, relying entirely on the semantic content of the first edited frame, and the features derived from it.

However, the researchers observe*:

‘In practice, we find that due to the limited capability of certain I2V models, the edited videos denoised from the last time step are sometimes distorted. Following [Dreamedit], we observe that starting the sampling from a previous time step T ′ < T can be used as a simple workaround to fix this issue.’

Though the authors’ research indicates that existing I2V models have some nascent capability for editing based on the first frame, they note also that using this approach tends to distort other parts of the video that are not desired to be changed, such as background and environments.

Therefore, when re-injecting features obtained from the source video, the system inserts the features into both the convolutional layers and the spatial attention layers.

For purposes of motion guidance, a novel approach was needed, since the researchers deduced that the I2V models likely to be used in the process tend to be fine-tuned at low learning rates, and concentrate on temporal layers (i.e., layers that affect movement), tending to freeze less relevant parts of the model.

Though this is the least damaging way to fine-tune, for these purposes, and most minimally affects the original weights of the base model, this also tends to ‘lock’ the obtained motion features from the source video, inhibiting extensive editing.

For this reason, the researchers inject the temporal attention features after other essential processing has taken place, the same as is done for the spatial attention layers in the previous stage.

The ‘method’ section of the paper is relatively short, compared to most T2V release publications, since the central concept is an aggregating framework rather than a labor-intensive reworking of the inner processes. Therefore, as the paper itself does, we’ll proceed more quickly than usual to the testing and data phase.

Data and Tests

For the testing phase, the four core functionalities of AnyV2V were tested with a variety of guest systems, and against prior approaches, as well as in ablative studies (the latter not covered here).

To recap, the four modalities are: prompt-based editing; reference-based style transfer’ subject-driven editing; and identity manipulation.

To implement the tests, three I2V generation models were used: the aforementioned I2VGen-XL; ConsistI2V; and the SEINE framework. DDIM sampling was set to the default values of the respective systems, and Classifier-Free Guidance (CFG) was used for all models with the same negative prompt, ‘Distorted, discontinuous, Ugly, blurry, low resolution, motionless, static, disfigured, disconnected limbs, Ugly faces, incomplete arms’.

The seed-frame candidates used were InstructPix2Pix, Neural Style Transfer (NST), AnyDoor (a subject-driven image editing framework), and the InstantID architecture. Only successfully-edited frames were used in the method, which is not intended for totally automatic operation.

For an initial quantitative evaluation round for prompt-based editing, as well as determination for overall preference of videos, a human study was conducted, with AnyV2V trailed against baseline models Tune-A-Video; TokenFlow; and FLATTEN.

For a truly quantitative aspect (see notes on nomenclature at end), and in line with prior works UniEdit and Pix2Video, CLIP was used to assess average cosine similarity between CLIP’s own image embeddings across all frames (therefore, comparing subsequent and prior frames continuously). CLIP was also used to compare the sum of the rendered results to the sum of CLIP embeddings obtained from the editing prompt (these two metrics were dubbed CLIP-Image and CLIP-Text).

Quantitative results for the prompt-based editing round.
Quantitative results for the prompt-based editing round.

Of these results, the authors state:

‘Human evaluation [results] demonstrate that our model achieves the best overall preference and prompt alignment among all methods, and AnyV2V (I2VGen-XL) is the most preferred method. We conjecture that the gain is coming from our compatibility with state-of-the-art image editing models.’

To test reference-based style transfer, identity manipulation and subject-driven editing, the results were likewise subjected to human trials, concentrating on the evaluation of the quality and alignment of the reference (seed) images, as well as overall image quality.

Results of the (semi) quantitative round for the tasks of style transfer, identity manipulation, and subject-driven editing. Image editing success is also reported here, due to the inability of current models to produce consistent and precise results.
Results of the (semi) quantitative round for the tasks of style transfer, identity manipulation, and subject-driven editing. Image editing success is also reported here, due to the inability of current models to produce consistent and precise results.

The authors comment here:

‘[We] observe that AnyV2V (I2VGen-XL) is the best model across all tasks, underscoring its robustness and versatility in handling diverse video editing tasks. AnyV2V (SEINE) and AnyV2V (ConsistI2V) show varied performance across tasks.

‘AnyV2V (SEINE) performs good reference alignment in reference-based style transfer and identity manipulation, but falls short in subject-driven editing with lower scores.

‘On the other hand, AnyV2V (ConsistI2V) shines in subject-driven editing, achieving second-best results in both reference alignment and overall preference.’

For a round of purely qualitative testing, the authors trialed AnyV2V against Tune-a-Video, TokenFlow and FLATTEN. InstructPix2Pix was used to edit the seed frame.

Qualitative results for prompt-based editing against a variety of contenders.
Qualitative results for prompt-based editing against a variety of contenders.

Regarding these tests, the paper states:

‘[Our] method correctly places a party hat on an old man’s head and successfully turns the color of an airplane to blue, while preserving the background and keeping the fidelity to the source video. Comparing our work with the three baseline models [TokenFlow], [FLATTEN], and [Tune-A-Video], the baseline methods display either excessive or insufficient changes in the edited video to align with the editing text prompt.

‘The color tone and object shapes are also tilted. It is also worth mentioning that our approach is far more consistent on some motion tasks such as adding snowing weather, due to the I2V model’s inherent support for animating still scenes.

‘The baseline methods, on the other hand, can add snow to individual frames but cannot generate the effect of snow falling, as the per-frame or one-shot editing methods lack the ability of temporal modelling.’

Testing reference style-based transfer, the authors used the NST framework to create an edited frame. The authors claim that the results obtained offer artists an ‘unprecedented opportunity’ to express themselves creatively. The results, seen below, leverage Kandinsky and Van Gogh:

Static results for style-based transfer.
Static results for style-based transfer.

Qualitative tests were made also for subject-driven editing, using the AnyDoor architecture for initial frame editing:

Comparison between frameworks for subject-driven editing.
Comparison between frameworks for subject-driven editing.

Of the subject-driven tests, the paper states:

‘AnyV2V produces highly motion-consistent videos when performing subject-driven object swapping. In the first example, AnyV2V successfully replaces the cat with a dog according to the reference image and maintains highly aligned motion and background as reflected in the source video. In the second example, the car is replaced by our desired car while maintaining the rotation angle in the edited video.’

Finally, qualitative tests were produced for identity manipulation. Here a combination of InstantID and ControlNet was used for initial frame generation, though the authors observe that this approach will inevitably change the background content as well. However, this can presumably be remedied in the future by segmentation-based approaches, and this functionality, the paper states, is amenable to a variety of diverse identity manipulation systems.

Identity substitution also changes the background.


An increasing number of synthesis projects are becoming concerned with the provision of ‘catch-all’ frameworks – and only last week we reported on a similar project designed to unify facial analysis.

In the case of AnyV2V, the primary innovation appears to be the extraction, withholding and judicious re-injection of features at inference time, which has obtained some quite impressive results. Though it would have been interesting to see a wider range of guest frameworks trialed for the project, and while the scene’s disposition to standardize on anything but historical datasets is somewhat lacking in this period, any attempt to instrumentalize some of the more arcane processes of recent I2V systems has got to be a welcome development.

* My substitution of the authors’ inline citation/s for hyperlinks.

Though it is unusual to define a qualitative method such as human user study for quantitative terms, which are normally evaluated by algorithmic or otherwise automated metric evaluation procedures, we present the results here as the paper puts them forward. Such metric methods, as the article makes clear, were also employed under the same umbrella.

More To Explore

Main image derived from

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.