Creating Lip-Synced Video From Audio Speech and Music

AniPortrait

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Deriving photorealistic video from audio is a surprisingly strong trend in human synthesis research at the moment, with every week bring at least a handful of new approaches or innovations. The only implementation of this paradigm that has come to any kind of commercial fruition in recent years is the MyHeritage LiveStory feature, which allows the user to write text for (supposedly) a departed love one, upload an image, and create a short video of the person speaking the written text.

In the hobbyist AI scene, there has been some interest in using such audio/text inputs to animate one-off clips of individuals rendered by systems such as Stable Diffusion; yet the potential marketability of this ambit does not seem to be proportionate to the amount of effort currently being expended on it, since this functionality can only really result in Instagram-style, ‘one-off’ clips, hinting at deeper functionality that does not actually exist (yet).

From the perspective of the visual effects community, many of the inner workings of these schemes, are actually of deeper interest, such as the relationship between audio speech and visualized phonemes (lip shapes). One possible use for such systems, with a far more commercial scope, is the potential redubbing and visual re-shaping of actors’ mouths, so that multiple audio versions of a movie or TV show could be distributed to diverse language markets without the curse of badly-synced overdubbing.

In any case, a relatively interesting new project in this line was put forward last week, titled AniPortrait:

Contains audio. Click to Play. An example of the transliteration of an audio file and a source image into a relatively coherent video representation, courtesy of AniPortrait. Source: https://github.com/Zejun-Yang/AniPortrait

The system is capable of animating either stylized or photorealistic characters. Drawing on a number of prior projects, AniPortrait is a two-stage framework that initially derives 2D facial landmarks from audio, and then passes the output to a framework based on Stable Diffusion.

As with 90% of current AI projects that deal extensively with audio, or in any way with vocal reproduction through machine learning, the output has a deep Asian influence (since Asia represents by far the greatest academic commitment to all kinds of AI voice synthesis and related projects, for reasons that are not entirely clear):

Contains audio. Click to Play. The lip-syncing facility of systems such as AniPortrait are primarily interesting for their potential to re-sync dubbed audio, so that neither subtitles not sub-standard dubbing sync will be necessary any more.

The new paper (which is very short, and rather light on details, as will necessarily be reflected in this review) is titled AniPortrait: Audio-Driven Synthesis of Photorealistic Portrait Animation, and comes from three researchers at Tencent.

Method

The AniPortrait framework consists of two modules: Audio2Lmk (i.e., audio-to-landmark), and Lmk2Video (landmark-to-audio). The first of these is able to extrapolate visual content in the form of facial landmarks that correspond to the interpreted audio, while the latter employs these derived landmarks as anchor guidelines for a Stable Diffusion-based pipeline.

Conceptual schema for AniPortrait. Source: https://arxiv.org/pdf/2403.17694.pdf
Conceptual schema for AniPortrait. Source: https://arxiv.org/pdf/2403.17694.pdf

Audio2Lmk uses the pre-trained Wav2Vec library, which contains the necessary audio waveform shape/phoneme associations. Two fully-connected convolutional layers are used to create canonical landmarks, after which a pose is obtained by the same method.

The paper states*:

‘However, we do not share the weights with the audio-to-mesh module. This is due to the fact that pose is more closely associated with the rhythm and tone present in the audio, which is a different emphasis compared to the audio-to-mesh task. To account for the impact of previous states, we employ a transformer decoder to decode the pose sequence.

‘During this process, the audio features are integrated into the decoder using cross-attention mechanisms. For both of the above modules, we train them using simple L1 loss.’

After the mesh is obtained (and here, it has to be said, this very short paper is very  sparsely-illustrated), perspective projection is used to transform the 3D information into 2D landmarks, which become signals for the Lmk2Video phase.

In this phase, the Lmk2Video module utilizes the V1.5 iteration of the Stable Diffusion weights as a backbone, converting the noise inputs of multiple frames into concurrent video frames (again, there is very little detail on this process, with no effective illustrations).

At the same time, a ReferenceNet module, designed to mirror the architecture of SD1.5, is used to derive appearance information from the produced image and integrate this back into the processing flow of the backbone.

Though the second module derives inspiration from the described workings of the AnimateAnyone project (which, at the time of writing, has not yet received its promised code release), the authors of the new work observe*:

‘This strategic design ensures the face ID remains consistent throughout the output video. [Different] from AnimateAnyone, we enhance the complexity of the PoseGuider’s design. The original version merely incorporates a few convolution layers, after which the landmark features merge with the latents at the backbone’s input layer.

‘We discover that this rudimentary design falls short in capturing the intricate movements of the lips. Consequently, we adopt [ControlNet’s] multi-scale strategy, incorporating landmark features of corresponding scales into different blocks of the backbone. Despite these enhancements, we successfully keep the parameter count relatively low.’

The authors note that the new system enjoys an additional benefit over the prior approach, in that the reference image’s landmark information in used as an extra input during inference. This, they say, gives the network additional clues about the correlation between appearance and facial landmarks, facilitating more precise motion than previous approaches.

This really is about as much detail as the authors provide on AniPortrait’s inner mechanisms, so let’s take a look at the (almost equally sparse) experiments conducted for the project.

Data and Tests

In the Audio2Lmk section of the process, the researchers made use of Google’s popular MediaPipe framework to obtain 3D meshes and 6D poses for annotations. The training data used for the Audio2Mesh system employed inside AniPortrait comes from ‘an internal dataset’, which is not outlined in any detail (and it should be re-emphasized here that the project’s GitHub page has many placeholders for this content, but none of it is online at this time).

(It should be noted also that modules are sometimes referenced without context in this paper, and that it is necessary to research them oneself, to understand their place in the overall framework)

The Audio2Pose section (which is mentioned only once in the work, and in the schema illustration, which is featured above), comes from the SadTalker project, and in this context was trained for tests using the HDTF dataset.

The training for this section was performed on a single NVIDIA A100 (which comes either with 40 or 80GB of VRAM, not specified here), using the Adam optimizer, at a learning rate of 1e-5 (the lowest effective rate possible).

For the training of the Lmk2Video workflow, a two-step process was created, where the 2D components of the backbone were trained (i.e., ReferenceNet and the PoseGuider module from AnimateAnyone), while the motion module was ignored. Next, training occurred exclusively for the motion module, with all other modules frozen.

The two datasets used to train the motion module are VFHQ and CelebV-HQ. MediaPipe is used to extract landmarks (we are not given details of splits or other filtering criteria), and the lower and upper lips are given different colors during rendering of the pose image, to increase the developing network’s sensitivity to upper and lower characteristics of the phonemes (sadly, this interesting innovation is not illustrated in the paper).

Examples from the CelebV-HQ dataset. Source: https://arxiv.org/pdf/2207.12393.pdf
Examples from the CelebV-HQ dataset. Source: https://arxiv.org/pdf/2207.12393.pdf

For this part of the training, four A100 GPUs (again, VRAM unspecified) were used, with two days dedicated to each step, under the AdamW optimizer, and at a learning rate of 1e-5.

To represent the results of what is apparently a self-referential qualitative test (surprisingly, no prior frameworks are pitted against the system, though many analogous ones are available), the authors compiled a static image, visualized below, which provides examples of conversions that are self-driven, driven by audio, or facial reenactments (i.e., using video as a driver to substitute an identity):

The only results provided in the new paper.
The only results provided in the new paper.

There are no quantitative tests (i.e., featuring standard metric evaluations), and no ablative tests featured in the paper.

Of the tests that were done, the authors observe:

‘[Our] method generates a series of animations that are striking in both quality and realism. We utilize an intermediate 3D representation, which can be edited to manipulate the final output.

‘For instance, we can extract landmarks from a source and alter its ID, thereby enabling us to create a face reenactment effect.’

Below is a selection of examples from the project page, that perhaps better illustrate the actual effect of the system:

Contains audio. Click to Play.

Contains audio. Click to Play.

Contains audio. Click to Play.

Conclusion

As alluded to earlier, the most interesting thing about AniPortrait is the extra attention that has been given to selective lip training, since several of the core components seem likely to feature in later projects that aim to re-sync lip phonemes based on audio.

It’s a shame that the paper is so light on detail, and that the researchers either did not choose to trial the system against similar alternatives, or to publish any such comparisons that may have been undertaken.

The PR-style tone and  lack of academic rigor in this work is perhaps symptomatic of the scene’s current impatience to monetize and gather investment impetus around functionalities that may have some spurious and limited potential in apps and social media.

We haven’t actually had a major breakthrough in any of the core human synthesis problems in quite a while, despite the advent of carefully-crafted video samples from the likes of Sora and Runway. In the past, initially exciting systems such as Generative Adversarial Networks (GANs) and Neural Radiance Fields (NeRF) have led to expensive investment in refinement, only to find that the systems were relatively intractable, entangled, and difficult to edit or adapt to commercial usage.

A similar sense of frustration and panic may be providing the impetus for a new round of ‘viral AI tools’ that can at least make something profitable out of what there is. The upside is that the high level of motivation to make this happen may inadvertently accelerate development in some of the core areas in which we really do need a breakthrough.

* My conversion of the authors’ inline citations to hyperlinks.

More To Explore

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
AI ML DL

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
AI ML DL

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle