Instruct-Video2Avatar utilizes the InstructPix2Pix adjunct framework for Stable Diffusion, and the EbSynth (non-AI) tweening software to extrapolate temporally consistent frames from the altered images. Source: https://github.com/lsx0101/Instruct-Video2Avatar
The new approach is called Instruct-Video2Avatar (IV2A). Like some of the prior systems from which it takes inspiration, IV2A interferes with the photogrammetry process native to Neural Radiance Fields (NeRF); instead of allowing real-world face images to be composed directly into an explorable neural matrix, IV2A first runs two additional procedures on the captured images.
First, it converts the source face image into an artistic, alternate or otherwise stylized version using Stable Diffusion and InstructPix2Pix (an improved image synthesis add-on framework which we covered when it came out late last year)…
These consistent altered face/head images are then passed to a version of the INSTA (Instant Volumetric Head Avatars) system, which rationalizes the rendered images into an explorable neural NeRF space.
The process allows for photorealistic or more fantastical transformations, though its obvious potential for deepfakes-style usage (i.e., identity transfer rather than stylization) is not closely examined in the new work.
However, the smooth interpretations of EbSynth also allow for an unusual level of temporal coherence in more simplistic animated styles – one of the most sought-after results in the Stable Diffusion community.
EbSynth’s interpretive abilities mean that the images provided to the NeRF system are consistent, leading to smooth video interpretations.
The resulting NeRF is a deformable radiance field, which means that the canonical ‘default’ disposition is used as a baseline from which morphs and deviations are generated, such as turning the head, tilting it, and changing facial expressions.
Though several research initiatives of the last six months have leveraged Stable Diffusion innovations such as ControlNet and DreamBooth, IV2A is the first to come to our attention that has used EbSynth, which is a popular method in the Stable Diffusion community of providing fluid and coherent temporal motion to a system that has no native mechanism to provide such functionality.
EbSynth has a number of shortcomings in respect to this objective, not least that it can require many keyframes in order to provide the smoothest motion, and at the same time limits the number of keyframes usable for any one clip – which means that long productions require the stitching together of multiple EbSynth projects.
In terms of overall consistency, these requirements also oblige the Stable Diffusion user to create a consistent series of keyframes, so that the characteristics of the material do not subtly change as the video progresses (which is in itself difficult to achieve).
By using EbSynth in the more limited way outlined in the new project, the user need only obtain consistency for a small number of keyframes, with EbSynth generating interstitial frames, and NeRF thereafter handling temporal consistency in a predictable and relatively convincing manner.
The use of EbSynth has been a hobbyist or artisanal pursuit for some years, but the recent beta trial of the Studio 1.0 version offers a CLI-driven ‘version optimized for studio pipelines’, perhaps signifying that tweening will remain the primary method of overcoming Stable Diffusion’s temporal shortcomings in the near future. IV2A is the first notable academic initiative to incorporate EbSynth into a rational neural synthesis pipeline.
The new paper is titled Instruct-Video2Avatar: Video-to-Avatar Generation with Instructions, and comes from Shaoxu Li of the John Hopcroft Center for Computer Science at Shanghai Jiao Tong University.
As with typical NeRF workflows in avatar creation, IV2A takes an input video as source material from which to build up a neural reconstruction, which can then be subject to deformations that represent natural movement. The text-to-image component comes in the form of Stable Diffusion image-to-image manipulations of each frame, facilitated by InstructPix2Pix.
One ‘exemplar’ image is initially fed into the system, an image altered by text instructions such as ‘Make him older’, ‘make him an elf’, etc. If one were to run InstructPix2Pix sequentially and unaided over the extracted video frames, the results would exhibit inconsistencies typical of Stable Diffusion’s inability to reproduce any solution perfectly twice; but the EbSynth tweening instead takes the previous frame as the starting point for the next frame, supporting the necessary continuity of appearance.
Though the paper diligently documents the process through formal academic method, there really isn’t a lot more to the system than has been outlined here.
One additional requirement was to ensure that the output from EbSynth maintains adequate consistency. Though EbSynth takes the previous frame as input for the next, various factors can contrive to warp or deviate from the original design as the tweening continues. Therefore the author performs some additional processing on the EbSynth output. He states:
‘For high-quality synthesis, we propose an iterative dataset update. We only edit the sampler image once and execute iterations on other images. In the first training, the editing is carried on the head images from the original video. In the later training cycle, the editing is carried out on the rendered images from the optimized head avatar.’
In line with more traditional VFX CGI pipelines, the ‘core’ image chosen should ideally be one with the mouth open, since this facial movement cannot be easily inferred from a closed-mouth image, but the mouth can be ‘re-sealed’ as necessary, once the system has some knowledge of the mouth interior (tongue, teeth, facial mouth disposition, etc.).
The rest of IV2A system relies on the avatar reconstruction abilities of Max Planck’s prior (2023) work INSTA (see video above, and the source paper), to effect a NeRF synthesis of the input images.
Data and Experiments
The researcher conducted various qualitative and quantitative experiments of IV2A, comparing it across various methodologies, and also with the use of Dual Attention GAN (DaGAN), using the extracted images from diverse videos as datasets.
The Windows-only standard EbSynth executable was used for the tweening process, with other experiments carried out on Ubuntu Linux on a NVIDIA 3090 GPU with 24GB of VRAM.
The method used in the tests were: ‘InstructPix2Pix+One Seed’, wherein a single fixed seed and guidance weights value informed the facial image alteration; ‘ InstructPix2Pix+EbSynth’, wherein the extra rounds of sampling were not carried out; and ‘InstructPix2Pix+DaGAN’, using the prior framework.
We can observe in the left section of the second row from the top that the single-seed method does not accomplish the text instruction to convert the image to an anime style, rather producing a more deepfake-style effect, while the iterative method (bottom row) appears to most faithfully adhere to the prompt.
In terms of temporal stability, the paper refers to supplementary videos which may not yet be available, or may be the animated GIFs supplied at the project site (which have been concatenated in their entirety for this article). We have reached out to the author for access to any additional information, and for clarification, but have not heard back at this time.
Regarding one particular section of these tests, the author states:
‘With DaGAN, the edited image consistency increases a lot. But the image quality is inferior and there are significant inconsistencies before and after editing. For example, ”The Hulk” can hardly open his mouth and the eyes of the ”17 years old man” open unexpectedly.
‘With EbSynth, the edited images are sharpest with good quality and are consistent with the original images. Some noises exist in the edited results. For example, there are noises in the mouth of the ”anime man”. Our method produces images with good quality. Some shadow noises exist around the avatar head, which are caused by the radiance field. The mouth expressions vary with DaGAN, EbSynth, and our method.’
The paper emphasizes the importance of the aforementioned iterative updates when processing the facial images, and provide comparisons to illustrate the effect of this:
The author additionally carried out a user study, wherein 20 participants were asked to score 10 edited videos (not as yet published) demonstrating all the aforementioned methods. Regarding these results, the author states:
‘For “High Definition”, InstructPix2Pix+EbSynth gets the highest score and ours is the second-highest. For “Temporal Consistency”, ours gets the highest score and One time Dataset Update with EbSynth is the second-highest.’
In ablation tests, the study found that EbSynth ‘significantly enhances’ consistency in per-frame editing results, and that the 3x refinement process notably improves the quality of output.
The author suggests that the method proposed can be extended eventually into an effective pipeline for arbitrary video editing, in which text-to-image instructions could be used to directly manipulate rasterized video content with one (admittedly resource-intensive) pass through the neural pipeline indicated in the new work.
The main takeaway from this paper is the leveraging of EbSynth’s algorithmic (rather than AI-based) tweening method as way of tackling Stable Diffusion’s shortcomings in terms of temporal stability. The additional use of NeRF, which is naturally stable ( since it is effectively a neural analog to older CGI approaches), indicates the possible extent of the growing desperation of the research community.
Ageing a subject using the new system.
It seems hard to believe that a generative system as powerful as Latent Diffusion Models (LDMs) cannot be coaxed, in some more intrinsic and fundamental way, into producing temporally coherent output. Yet all indications are that Stable Diffusion and similar LDM-based systems will need to rely on secondary technologies to perform this functionality, which will in effect turn LDMs into mere skinning or texture-based content systems for entirely discrete temporal systems.
This unsatisfactory state of affairs is essentially a repeat of the slow and disappointing process by which the community eventually came to realize that the similarly astounding reconstructive potential of GANs could not be made to produce coherent movement and temporal consistency without adjunct and ‘bolt on’ technologies such as 3DMM, SMPL, and various other CGI-based methods.
It could be that the emergent Studio 1.0 version of EbSynth will enable similar projects to this one to be able to construct entirely Linux-based VFX pipelines that use algorithmic tweening to produce consistent results, leading to less clunky, multi-platform methodologies.
As it stands, the output of new text-to-video systems continues to leverage the same kind of ‘cheap tricks’ that proponents of autoencoder-based deepfakes used for years, to make the systems seem more capable and versatile than they really were. For instance, in the past week, RunwayML has dazzled less sophisticated users with a new video demonstrating the power of its LDM T2V generative system, with a video showing extraordinary transformations of a Scorsese-style POV shot:
However, many users have observed that this impressive shot pre-constrains the generative requirements by keeping the person in the picture practically unmoving in the shot – a shortcoming that’s common to EbSynth itself, which does not handle ‘wild’ or fast motion very well, since it is difficult to guess a successive transformed frame from a prior frame that is radically different, in terms of positioning.
Thus, we haven’t necessarily got further than cheap tricks yet, either in terms of the new paper or the latest T2V demos. The fundamental capabilities are missing from the target system; therefore they are inevitably also missing in the output.