Better Neural Avatars From Just Five Face Images

One2Avatar examples

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

As the image synthesis sector’s ability to produce amazing results with neural humans improves, an increasing locus of effort towards optimization has become apparent in the run of literature lately.  

Systems that can perform extraordinary face-swaps or reproduce human appearance through AI-based methods, no matter how dazzling the results, are of limited use if the resources required are exorbitant; or the methods by which the results are achieved are labor-intensive or time-consuming; or, in short, if the system itself doesn’t allow of a relatively lightweight, flexible and sustainable deployment.

Results for driven animation, from the testing phase of the recent 'Gaussian Avatars' project, which achieved impressive results, but with extensive resources. Source: https://www.youtube.com/watch?v=lVEY78RwU_I
Results for driven animation, from the testing phase of the recent 'Gaussian Avatars' project, which achieved impressive results, but with extensive resources. Source: https://www.youtube.com/watch?v=lVEY78RwU_I

In the field of generative avatars (neural representations of human heads and/or bodies), great results are generally synonymous with great effort, with these demanding systems risking to limit themselves to specialist fields such as visual effects pipelines, or high-latency online implementations that depend on expensive cloud compute. Therefore the system that can achieve effective human avatars with fewer requisites is likely to obtain greater interest, both in ‘heavy’ applications and, potentially, at a consumer level.

One such approach has just been put forward, in a project from Google, in collaboration with the University of Minnesota; titled One2Avatar, the new method, in line with image personalization systems such as DreamBooth and LoRA (both of which require very little training data), offers a flexible and dynamic human avatar representation using just five source images:

From the One2Avatar project page, an example of the derivation of a drivable avatar based on only five source images. Source: https://zhixuany.github.io/one2avatar_webpage/

This improved approach is possible because One2Avatar makes use of notably fewer priors than previous methods, and incorporates multiple synthesis systems, including 3D Morphable Models (3DMMs), Neural Radiance Fields (NeRF), and Generative Adversarial Networks (GANs).

In tests, the new work was able to achieve substantial improvements on current similar state-of-the-art architectures, and may indicate a leaner way forward for effective facial synthesis.

It should be noted that  this includes the substitution of identity (i.e., deepfaking) –  though the paper steers away from this topic in favor of emphasizing the system’s potential for telepresence, where a subject is recreated neurally, instead of driving the appearance of a different identity.

Click to play. The trained model can be driven by alternative personalities, effecting deepfake video functionality. However, this aspect is de-emphasized in the new work

The new paper is titled One2Avatar: Generative Implicit Head Avatar For Few-shot User Adaptation, and comes from nine researchers across Google and UoM – and draws heavily on the April 2023 paper Learning Personalized High Quality Volumetric Head Avatars from Monocular RGB Videos, which features some of the same authors.

Prior Restraint

It must be obvious from the video above that no system could possibly know how to render such transitions between pose and expression, based only on five source images. In common with many other such frameworks, One2Avatar uses prior knowledge in the same domain (i.e., ‘faces’, rather than ‘churches’, or ‘cats’) in order to fill in the gaps.

This is nothing new. Even in traditional, 2017-era deepfake systems such as DeepFaceLab, it’s possible to ‘prime’ the deepfake model, prior to actual training, by pretraining on thousands, or even millions of faces, using open source face datasets such as NVIDIA’s Flicker Faces HQ (FFHQ) or the widely-used CelebA dataset.

In effect, pretraining in this way is equivalent to the system ‘cramming’ for a tough exam by studying topics similar to the one that the exam will be about.

Though it’s unlikely, in the case of One2Avatar, that the exact five faces to be interpreted will be featured in the training data, by learning about such a large number of other faces, the system develops priors – teleological templates (and generalized knowledge about human faces) that are likely to be applicable, in their broadest principles, to any face that may be thrown at it later.

(The same principle applies for priors in text-to-video or other kinds of systems that generate video, except that in such case motion priors are obtained, which contain temporal information, rather than just data about poses, texture, and geometry)

Since FFHQ has become one of the most popular face datasets in vision synthesis, it has become the go-to source for pretraining – not least because this allows researchers to compare their new projects directly with previous similar approaches that also used FFHQ (and it should be noted that although this is great for comparative purposes over time, it causes a certain amount of entropy and friction in the community, since novel or alternative datasets can be penalized for not adhering to an established ‘standard’).

In any case, one of the major innovations of One2Avatar is that it needs far fewer such pretrained priors than similar projects, featuring only 2,047 individual identities, in comparison to the far higher number used in other works.

Such is the power of this prior knowledge, that, to a slightly less effective extent, it allows One2Avatar to create a neural synthesis based only on a single source image:

Click to play. Though not quite as effective as when five source images are provided, priors allow even a single source training image to be turned into a flexible avatar that can be driven by source movement.

Method

The aim of the project is to use the generalized prior knowledge to power the rendering of a driven avatar, capable not only of diverse poses, but also of variegated and even extreme facial expressions.

To form the modest dataset (which would ultimately yield 208,000 images, which is spartan by local standards), the authors captured high-resolution facial images from 2,407 people, asking each participant to replicate thirteen predefined facial expressions, and recorded the results from thirteen camera viewpoints.

The presence of 13 extreme facial expressions in the curated source training dataset means that even if an 'extreme' expression such as this is not present in the 1-5 source images, it can be plausibly rendered at inference time.
The presence of 13 extreme facial expressions in the curated source training dataset means that even if an 'extreme' expression such as this is not present in the 1-5 source images, it can be plausibly rendered at inference time.

The authors note that the FFHQ dataset features a more limited number of expressions, and comment:

‘Our dataset includes a limited number of distinctive subjects compared to the existing data, e.g., 70K in [FFHQ]. Nonetheless, it includes a wide range of facial expressions, which plays a key role in learning a generative prior model.’

Examples from the FFHQ dataset, which is characterized by frontal, smiling face shots, and in many ways unsuited as training material for more flexible systems, which require extreme angles and a good distribution of expressions, both of which are unlikely to be found in the kind of ad hoc, web-scraped publicity sources that power FFHQ.
Examples from the FFHQ dataset, which is characterized by frontal, smiling face shots, and in many ways unsuited as training material for more flexible systems, which require extreme angles and a good distribution of expressions, both of which are unlikely to be found in the kind of ad hoc, web-scraped publicity sources that power FFHQ.

Each of the resulting images was fitted to a 3DMM parametric facial model (find out more about 3DMMs in our comprehensive article on them). The coordinates of the various points on this obtained 3DMM representation are easy to access, and easy to use as an intermediary – a precise layer of instrumentality in a process that is otherwise rather harder to wrangle.

These 3DMM coordinates are then transposed into a NeRF representation, in accordance to (some of) the authors’ prior work with MonoAvatar.

The MonoAvatar project, from several of the same authors as the new work, devised a method for obtaining NeRF representations from 3DMM coordinates. Source: https://augmentedperception.github.io/monoavatar/
The MonoAvatar project, from several of the same authors as the new work, devised a method for obtaining NeRF representations from 3DMM coordinates. Source: https://augmentedperception.github.io/monoavatar/

The authors state:

‘[Instead] of encoding all rendering information into a high-capacity neural network, local features are attached on the vertices of the 3DMM mesh scaffold reconstructed for the target identity and expression. During rendering, each query point aggregates the features from k-Nearest-Neighbors (kNN) in the 3DMM vertices and sends it into a MLP network to predict color and density.

‘To simplify learning using existing 2D CNNs, the 3DMM vertex-attached features can be learned in the unified UV space and sampled using texture coordinates.’

The conceptual schema for One2Avatar. A NeRF representation, conditioned by 3DMM coordinates, is used as the prior avatar representation.
The conceptual schema for One2Avatar. A NeRF representation, conditioned by 3DMM coordinates, is used as the prior avatar representation.

In the identity branch of the workflow (lower left in the illustration above), NVIDIA’s StyleGAN2 framework is used to synthesize a facial feature map for the training identity. Similar to (some of) the authors’ prior MonoAvatar system, a U-Net is trained to facilitate a facial expression feature map.

The sum of these data flows – identity and expression – are then used to create the 3DMM-anchored neural radiance field.

The paper states*:

‘With plenty of variation from various expression and camera viewpoints in our data, we directly apply the 1 loss between the rendered images with the ground truth images without applying any latent space regularization loss.

‘We observe that with a subject count exceeding approximately 2,000, the learned latent space exhibits smoothness and enables natural interpolation.’

Though the final system can make effective use of sparse source data, the training of the prior model itself is burdensome, requiring six days across 16 (unspecified) GPUs, for one million steps. To avoid memorization and aid generalization, the system randomly samples 256px images from eight views of four expressions of 16 subjects. This leads to an extraordinarily high batch size of 131,072, which explains the need for splitting the workload across GPUs in this way.

The final prior model is optimized as an auto-decoder (the latter stage of an autoencoder model, which performs the initial processing operations in a very different way), with each constituent identity comprising a latent code of 512 dimensions.

The prior system was therefore trained on a total of 2,407 subjects, each with 13 views of 13 different facial expressions. The model was optimized under Adam, at a learning rate of 0.0005.

Data and Tests

For a test dataset, the researchers used a curated video collection from the MonoAvatar project, where the participants had rotated their heads while performing a variety of facial expressions, providing a gamut of pose/expression data that FFHQ and similar datasets cannot supply.

The background was then filtered out, and each obtained frame was fitted to a 3DMM, with estimated camera parameters (i.e., the system has some idea of the potential orientation and disposition of the virtual camera).

For metrics, the researchers used Learned Perceptual Similarity Metrics (LPIPS), Peak Signal-to-Noise Ratio (PSNR), and Structural Similarity Index (SSIM).

Five baselines were used for the experiments, three of which were ablation-style variants on the new system. The frameworks used were MonoAvatar, which shares a backbone with One2Avatar; Next3D, a conditional 3D-aware tri-plane-based model that also uses 3DMMs; a One2Avatar variant that uses tri-planar representation, in the style of Next3D (denoted in the paper as ‘Ours-TP’); a One2Avatar variant trained on FFHQ instead of the bespoke dataset, using an off-the-shelf 3DMM-fitting framework (referred to in the paper as ‘Ours-FFHQ’); and a variant of One2Avatar using only single-view data, with the source image picked randomly from the thirteen possible images (denoted in the paper as ‘Ours-SV’).

The MonoAvatar models were trained from scratch, while Pivotal Tuning for Latent-based Editing of Real Images (PTI) was used for the Next3D model.

For each method, avatars were created for each subject using a varying number of images, ranging from 1% to 100% of the training data.

Quantitative comparison across the five methods (with One2Avatar implemented both natively and in – arguably – 'hobbled' versions, in an ablative approach.
Quantitative comparison across the five methods (with One2Avatar implemented both natively and in – arguably – 'hobbled' versions, in an ablative approach.

Regarding these results, the researchers state:

‘Our proposed method consistently outperforms the state-of-the-art approaches, particularly at the low data regime (e.g., one image). Even as the amount of training data increases, our method maintains its superior performance […]

‘…Our method significantly outperforms the other methods especially at few-shot settings (e.g. number of views is smaller than 10). Both our method and MonoAvatar use more training images effectively with consistently better performance, but our method keeps a consistent performance gain over MonoAvatar even when using 100% of the data.

‘This indicates that our generative prior also provides a good initialization of the network weights for learning the avatar.’

Qualitative generations were also produced for the tests. Though we are unable to include the complete range of sample videos for these that are provided at the project site, some examples are shown below:

Click to play. Some of the qualitative comparisons available at the project site. For better resolution and a greater number of examples, please refer to https://zhixuany.github.io/one2avatar_webpage/

The paper states:

‘Our method produces avatar results with more accurate expression, more consistent identity, and less artifacts compared to Next3D and MonoAvatar.’

Conclusion

Though not its primary goal, the One2Avatar project demonstrates above all the extent to which casually web-scraped hyperscale face datasets may not be the ideal foundation for the needs of modern generative models.

Collections such as FFHQ and the CelebA variants are full of red-carpet event photos, since these were the easiest to obtain. This means, effectively, that the photographer making his or her selections in Lightroom, and the PR professional or picture editor further filtering the photographers’ selections are effectively in charge of the available and dominant datasets that continue to define image synthesis.

However, as we have observed before, the cultural entropy under which the research scene labors, where arguably outdated reference datasets become ‘gold standards’, regardless of whether or not they are apposite for current needs, remains perhaps one of the greatest obstacles to significant progress in generative facial synthesis – together with outdated loss functions that likewise depend on these gargantuan and indiscriminately curated collections.

Many new research projects are hindered by the cost of curating custom datasets in the way that the One2Avatar researchers have done, and one can presume that Google’s involvement has made this feasible for the new project.

But if efforts such as these can demonstrate that the state-of-the-art can be advanced by generating a modest number of highly targeted source images, with minimal duplication, and with each image earning its keep, there may be hope that a move towards such ‘in house’ systems can break the hold of current practices.

* My reference hyperlinks, not included in the original paper.

More To Explore

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
AI ML DL

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
AI ML DL

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle