Better Human Facial Synthesis With Gaussian Splatting and Parametric Heads

NPGA: Neural Parametric Gaussian Avatars - https://arxiv.org/pdf/2405.19331
NPGA: Neural Parametric Gaussian Avatars - https://arxiv.org/pdf/2405.19331

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

New research from Germany and the United Kingdom, backed by the increasingly preeminent British synthetic media generation company Synthesia, has produced a new state of the art in the use of Gaussian Splatting to create moving simulations of humans – and even to use the motion of one person to drive another identity, with impressive accuracy and versatility:

Examples of the new NPGA system creating hyper-realistic head avatars based on motion obtained from a single camera. Please refer to the source project site for better resolution. Source: https://simongiebenhain.github.io/NPGA/

The system, titled Neural Parametric Gaussian Avatars (NPGA), makes use of a prior neural parametric head model (NPHM) created by four of the new paper’s authors, to exceed the accuracy of other approaches that use 3D Morphable Models (3DMMs) and FLAME-based CGI heads, obtaining superior detail.

Click to play. The neural head model is adapted to real-world captured video footage. Please refer to the source project site for better resolution.

In tests, this new method, which adds trained ancillary systems to the usually non-neural 3D Gaussian Splat (3DGS) rasterization process, was able to improve on the state of the art, when tested against former approaches.

Click to play. A comparison with rival prior methods, with the scope of recreating the ground truth original video (right-most). Please refer to the source project site for better resolution.

The system also augments the individual Gaussians during a later stage, leading to some revealingly rustic illustrations, indicating where and how these ‘Gaussian entities’ exist and operate within a 3DGS representation:

The underlying topology of the canonical gaussian point clouds are made explicit in this illustration from the new paper. Source: https://arxiv.org/pdf/2405.19331
The underlying topology of the canonical gaussian point clouds are made explicit in this illustration from the new paper. Source: https://arxiv.org/pdf/2405.19331

This work appears to improve on the quality of a previous project undertaken by several of the new paper’s authors, DiffusionAvatars, and consolidates the vanguard status of the Technical University of Munich, among other contributors to this stream, and to the driving force of Synthesia, which is producing some of the most interesting and advanced output in facial synthesis –  neural or otherwise.

Results published in the new work, and its project site, indicate that the approach is able to better render fine detail such as hairs and skin texture, though at the cost of some extra architectural complexity, including the use of the NeRSemble dataset (aimed at deployment in Neural Radiance Field-based projects), among other sources, and the requirement to perform machine learning training, as well as the use of Multi-Layer Perceptrons (MLPs) to increase detail.

Nonetheless, the results seem comparable, if not better , than the ASH system published last December, which we covered in some detail on release.

The new paper is titled NPGA: Neural Parametric Gaussian Avatars, and comes from five researchers across the Technical University of Munich, the UK’s University College London, and Synthesia.

Method

An NPGA head is composed of a canonical Gaussian point cloud, and a dynamics module which deforms the individual Gaussians when an expression code is passed to them.

Why Do You Need a Canonical Head?

A canonical representation of any kind, in this line of research, as we have noted before, indicates that the base representation of the model is ‘neutrally posed’. In the case of full-body canonical models, this more or less puts the model into the base default pose of Da Vinci’s Vitruvian Man.

Da Vinci’s 'Vitruvian Man', resolved into an equally 'default' canonicalization in Vid2Avatar.
Da Vinci’s 'Vitruvian Man', resolved into an equally 'default' canonicalization in an avatar-based system called Vid2Avatar. Partial source: https://arxiv.org/abs/2302.11566

In the case of a hand-based model, every digit’s movement away from a ‘resting’ hand likewise represents a degree of warping, or deviation from the canonical ‘norm’, where the entire pose becomes the sum of the extent to which the components in the neutral hand have been deformed.

Per-bone coordinate mapping in LISA. Source: https://arxiv.org/pdf/2204.01695.pdf
This hand pose is significantly warped away from a neutrally-posed hand model, with each deviation representing mutable variables. When all the variables are zero, the hand is back at rest. Source: https://arxiv.org/pdf/2204.01695.pdf

In the same way, the utility of using a reference canonical head is that it provides the base geometry of a human head, with all the components roughly where they should be; and onto this ‘canvas’ can be pasted the deviations from canon that can be captured, for instance, by observing facial movement. Without a canonical head, every deformation of the face (i.e., every facial expression) would effectively be a new entity, without established baselines.

The RigNeRF system imposes captured facial motility and uses these 'deviations' as variants on a canonical 'neutral' CGI head (figured towards middle right in above-image). Source: https://ar5iv.labs.arxiv.org/html/2206.06481
The RigNeRF system imposes captured facial motility and uses these 'deviations' as variants on a canonical 'neutral' CGI head (figured towards middle right in above-image). Source: https://ar5iv.labs.arxiv.org/html/2206.06481

Parametric Approach

MonoNPHM, the NPHM parametric head model variant used for the new system, generates a canonical representation of a head using Signed Distance Functions (SDFs) rather than pure mesh polygons, as with 3DMM and similar schemes. The authors state:

‘NPGA utilizes NPHM as it offers several beneficial characteristics: It models the face densely including eyes, hair, and teeth, it captures local details well and it disentangles shape from expressions.’

As with 3DMM and FLAME, NPHM tracking transliterates captured real-world video into recorded deviations from the canonical state of the NPHM head:

Tracking a human face, complete with conventionally challenging areas such as inner mouth, with MonoNPHM. Source: https://simongiebenhain.github.io/MonoNPHM/
Tracking a human face, complete with conventionally challenging areas such as inner mouth, with MonoNPHM. Source: https://simongiebenhain.github.io/MonoNPHM/

This phase is represented in the schema of the new system, in the lower left corner of the image below:

Workflow for NPGA.
Workflow for NPGA.

At this stage the Gaussian is rendered using the original 3DGS architecture, before the application of a screen-space refinement pass, using the CNN-based framework from the 2023 LatentAvatar project.

After this, the backward deformation field used in MonoNPHM has to be carefully reversed, since Gaussian Splatting is a forward-deforming method, and cannot retrospectively consider new variants on the canonical position. This transformation of incompatible variables is accomplished by cycle-consistency distillation (upper left in image above).

A dynamics module is used to transpose facial expressions from source footage, composed of two MLPs – a ‘coarse’ and a fine pass.

The system is then trained using the tracked sequences of 20 subjects from the NeRSemble dataset, with  each subject receiving a dedicated triplane.

With the forward deformation field obtain by cycle consistency loss, the output is jointly optimized for the canonical parameters of the base head. The paper states:

‘To initialize the canonical Gaussians [centers] we sample 30.000 points uniformly on the iso-surface of the tracked MonoNPHM model. The per Gaussian features are initialized by querying [TriPlane] at the sampled Gaussian centers. All remaining attributes are initialized using the default 3DGS procedure.

‘In practice, we observed that keeping F frozen results in sub-optimal performance, which is likely caused through topological issues during distillation in the mouth region.’

The canonical representation is regularized with Laplacian smoothing.

Gaussian Splatting features a facility called Adaptive Density Control (ADC), which assigns and divides splats according to the areas of greatest need, defined by a heuristic analytical process. The authors state:

‘The rules of ADC have been designed with static scenes in mind and we find the default settings to be suboptimal for our avatar creation. In the dynamic scenario, there can be areas that remain hidden for large parts of the training sequence, such as the mouth interior.

‘Therefore, we adjust the ADC by employing a generalized mean to aggregate the view-space [gradients] of the [iterative] primitive of all [frames] between invocations of the ADC mechanism.’

Data and Tests

The tests conducted for the project make use of the NeRSemble dataset. Six subjects were selected, each performing a variety of facial expressions.

The entire dataset contains 16 synchronized videos. The authors chose 15 cameras for training, and one frontal camera for evaluation.

Rival baselines evaluated were GaussianAvatars, which is a more lightweight project, but which, however, is limited to the expressiveness available in a base FLAME model;  GaussianHeadAvatar (GHA),  a Chinese paper from March of this year; and Mixture of Volumetric Primitives (MVP), a Facebook framework from 2021.

The authors note that GaussianHeadAvatar learns deformation from multi-view video through custom tracking, and comment:

‘While GHA also uses per Gaussian features, we allow these features to influence the predicted movement and all other attributes, while GHA restricts them to influence only dynamic appearance changes. Another difference to our work is that we use [Adaptive Density Control ], while GHA assumes a fixed set of Gaussians, and GHA performs super-resolution.

‘For computational reasons we remove one upsampling layer, resulting in a training resolution of 1024×1024 for GHA.’

Metrics used were Peak Signal-to-Noise Ratio (PSNR), Structural Similarity Index (SSIM), and Learned Perceptual Image Patch Similarity (LPIPS).

The tests focused on the facial region, with the authors using segmentation masks to isolate faces, not least since the underlying canonical head model has no provision for these satellite areas around the head. Segmentation was provided by the Facer network.

Metrics were computed at a resolution of 550x802px, with the native GHA 1024x1024px output consequently down-scaled to this resolution.

Hyper-parameters used for both of the deformation networks were a learning rate of 4e-5 and 2e-3, respectively. The network parameters for the first of these were frozen for the first 5,000 optimization steps.

Both learning rates were subject to a decay rate of a factor of 2, across 800,000 optimization steps, with additional weight decay on each, at a weight of 0.1, for regularization.

Though the authors avow that inference optimization was not a priority during development, they also state that the unoptimized implementation of the system can perform renders at 31fps at a 550x812px resolution on a NVIDIA RTX3080 GPU, and 18fps at 1100x1604px resolution –  including all operations, such as deformation, rendering, and running of the CNN.

With the CNN excluded (which affects detail resolution to some extent), the system achieves, respectively, 43fps and 38fps. The authors note that native GHA runs at 22fps at 1024x1024px on the same hardware.

All avatars and baselines were trained to convergence, and the paper notes that this takes 7 hours on an RTX2080 (8GB), 30 hours on the older 3090 (24GB of VRAM), and 60 hours on an RTX2080 for MVP.

It’s worth noting that all of these cards fall well within the ‘hobbyist’ category of ‘home AI’, in contrast to the general run of recent papers, which seem to default to an array of 8x A100 GPUs (often with the higher 80GB VRAM allocation).

For the data, the authors obtained MonoNPHM tracking on the NeRSemble dataset, using a geometric constraint between the NPHM predicted surface and a point cloud reconstructed using COLMAP.

For qualitative results, the avatars were fine-tuned on 1100x1604px resolution for an additional five hours of training time, with the torso areas masked out.

Initially, tests for self-reenactment were conducted, with the avatars animated by the single held-out sequence from the test split, so that the trained system was operating on unseen data.

Quantitative metric evaluation against former frameworks.
Quantitative metric evaluation against former frameworks.

The authors state:

‘Our predicted self-reenactments portray the unseen expression more accurately and contain sharper details in relatively static areas like the hair region. Interestingly, GHANPHM performs slightly worse than GHA, indicating that MonoNPHM expression codes alone do not immediately boost performance.

‘Instead, we hypothesize that without NPHM’s motion prior as initialization, NPHM’s latent expression distribution might provide a more complicated training signal compared to the linear blendshapes of BFM.’

(To really appreciate the quality of results obtained for these tests, we recommend viewing the original videos at full resolution at the project site and in the YouTube Video, which contains many more high-resolution examples from the testing phase than we can represent here. The accompanying YouTube video is embedded at the end of this article)

Click to play. Partial results from the self-reenactment trials. Please refer to YouTube video (embedded at end of article) for better resolution. Source: https://www.youtube.com/watch?v=NGRxAYbIkus

The next tests were for cross-subject reenactment, where movement extracted from a driving video was used to power the trained avatars. For this, there is no possible ground truth, and the results are necessarily qualitative. The reenactments were taken from monocular RGB video from a ‘commodity camera’, using MonoNPHM. Though the paper provides static results, video is more informative here:

Click to play. Partial results from the cross-reenactment trials. Please refer to YouTube video (embedded at end of article) for better resolution. Source: https://www.youtube.com/watch?v=NGRxAYbIkus

Here the authors comment:

‘We observe that all methods successfully disentangle identity and expression information, allowing for effective cross-reenactment. Our avatars, however, preserve the most details from the driving expressions.’

Conclusion

Though the results are very impressive, and do indeed overshadow prior efforts, the amount of ancillary support needed to achieve this could, arguably, be added to almost any moderately effective synthesis system in order to achieve state-of-the-art results.

At the same time, the authors have at least challenged themselves to perform training on consumer-level GPUs, rather than the more unobtainable A100 arrays that are becoming  common to these strands of research.

The chief value of the work is in advancing the possibilities of 3D Gaussian Splatting as a human synthesis technology, and bringing it to the attention of the tech scene (as has already occurred, since the accompanying videos are an easy sell); but we must hope for far more svelte architectures and less arduous methodologies in future expansions on the aims of this project.

One of the main attractions of 3DGS, for the VFX scene, is that it is a non-neural, rasterization-based architecture to which practitioners might add their own neural components, such as neural pushing, to affect areas of the 3DGS render (such as eye-direction); but if the base framework itself is to be as complex as this, augmenting it further with additional neural facets could make such a prospect unfeasibly burdensome.

The authors have explicitly stated that lightweight systems and optimized inference were not their aims; but if 3DGS is not to get caught in the quagmire of bolt-on systems that dragged down NeRF and GAN research in the years after the debuts of those technologies, a little more time needs to be expended on what more native projects around Gaussian Splatting can accomplish in order to improve output.

More To Explore

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
AI ML DL

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
AI ML DL

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle