Controllable Deepfakes With Gaussian Avatars

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

A new collaboration between the University of Munich and Toyota offers perhaps the most convincing example to date of a long-cherished goal in deepfake generation and general facial synthesis: complete control over facial neural representations, including expressions and identity.

With GaussianAvatars, photorealistic Gaussian splat nodes are attached to each vertex of a CGI mesh. The texture and content is directly from real life, but the movement is completely under the control of the operator. Source: https://www.youtube.com/watch?v=lVEY78RwU_I

This facility alone could prove of inestimable value to visual effects pipelines, since the underlying CGI model offers a huge range of potential parameters for the target subject’s facial expression and disposition, in contrast to similar interfaces of recent years, which have offered far less explicit control over such characteristics trained into the latent space of AI models.

Additionally, since CGI polygon mesh parameters are consistent, like-for-like identity retargeting is also well within the scope of the new approach, allowing source actors to puppet ‘guest’ identities:

Deepfake puppetry enabled by GaussianAvatars, with gender and wildly differing facial topography accounted for in the translation.

This level of precision is made possible by the nature of Gaussian Splatting, a German/French technique which emerged in August, and which (following a code release) has quickly captured the imagination of the image synthesis research sector, which now outputs a continuous stream of innovations and developments of the approach.

Compared to Neural Radiance Fields (NeRF), Gaussian Splatting has a number of innate optimizations. Though it draws material from static photos, like NeRF (or any other type of photogrammetry), it throws away pixels at non-useful levels of transparency, and can, as necessary, split its particulate components into two or more sub-components as necessary, instead of laboriously tracing and calculating rays from the objective source image.

Being in itself a rasterization method, which contains an innate rather than estimated knowledge of 3D space, Gaussian Splatting is unusually compatible with the kind of traditional methodologies and tools that VFX artists have been using for the last 20-30 years to describe volume.

CGI meshes are networks of mathematically-defined geometry. They are usually comprised of triangular nodes, similar to a fisherman’s net, and can be ‘warped’ from their canonical or default position in order to simulate movement (such as walking or breathing in and out). Alternatively, parametric skeletons are in themselves formed of primitive mathematical operators (such as a Boolean operator, which may cut a hole out of a CGI doughnut, for instance, by placing ‘negative’ space within a rounded disk), and these can also be used as the basis for a model.

By now the poster-beast for CGI, a bitmapped texture is conformed to the CGI mesh of a dinosaur. The wrapped-around texture is only permitted to move a limited distance from the vertices to which it has been mapped, in accordance with the limits of skin-musculature binding in nature. Adapted from material at https://3dexport.com/3dmodel-jurassic-park-t-rex-206789.htm
By now the poster-beast for CGI, a bitmapped texture is conformed to the CGI mesh of a dinosaur. The wrapped-around texture is only permitted to move a limited distance from the vertices to which it has been mapped, in accordance with the limits of skin-musculature binding in nature. Adapted from material at https://3dexport.com/3dmodel-jurassic-park-t-rex-206789.htm

As in the example image above, ordinary pictures (and sometimes moving pictures, since videos can be used as textures) are mapped to the base skeleton so that it has a substantial appearance of reality.

In the same way as pixels create images in 2D space, and voxels can create volumetric presence in 3D space, Gaussian Splats are essential neural equivalents to these kinds of small units that are capable, in great numbers, of comprising complete images.

Since this is the case, and since they have a freedom of movement lacking in NeRF-style representations, as well as a native ability for facile placement in X/Y/Z space, Gaussian Splats can be attached directly to the vertices of CGI mesh, in much the same way a part of a bitmapped texture can be attached to a parametric dinosaur skeleton; and this is how the new method, titled GaussianAvatars, becomes capable of projecting real-life content, such as skin and facial sections, into specific and equivalent areas of a CGI face.

Adapted from the new paper: precise placement of a Gaussian content into a related vertex creates a correlation between a 'dead' CGI mesh and living, real image content. Source: https://arxiv.org/pdf/2312.02069.pdf
Adapted from the new paper: precise placement of a Gaussian content into a related vertex creates a correlation between a 'dead' CGI mesh and living, real image content. Source: https://arxiv.org/pdf/2312.02069.pdf

The crucial advantage of using CGI in this way is that a truly neural representation can be obtained directly from human movement, obviating the need for VFX artists to attempt to simulate such movements, while at the same time giving them parameter-driven controls that allow for tweaks and interventions.

One central problem with Gaussian Splatting until now, besides its rather onerous computing and storage needs, is that it does not natively support temporal movement (a hindrance that initially also beset NeRF, though this was later addressed in numerous initiatives).

The new system, however (as we can see in the video samples above) has more effectively overcome these limitations, and may offer a conceptual approach for a system that’s controllable and reliable enough to be considered for mission-critical VFX workflows – either in its current form, or in work that may evolve from it over time.

Method

The CGI framework used in GaussianAvatars is the FLAME model originated in 2017 as an industry/academic collaboration, inevitably with significant participation from the Max Planck Institute for Intelligent Systems, which has held a vanguard position in CGI/neural facial and body reproduction methodologies for over twenty years

The FLAME model interprets real-life video into approximated equivalent poses in a parametric head. Many other types of CGI/neural models are used in human image synthesis, some of which (such as SMPL) can simulate entire bodies. Source: https://ps.is.mpg.de/uploads_file/attachment/attachment/400/paper.pdf
The FLAME model interprets real-life video into approximated equivalent poses in a parametric head. Many other types of CGI/neural models are used in human image synthesis, some of which (such as SMPL) can simulate entire bodies. Source: https://ps.is.mpg.de/uploads_file/attachment/attachment/400/paper.pdf

Input for the GaussianAvatars system originates with multi-view video recordings of real people performing diverse facial and had poses. A photometric head tracker based on the 2020 Face2Face project is used to fit the recorded movement to the FLAME model.

The Face2Face project converts a monocular stream of images to 3D. Source: https://arxiv.org/pdf/2007.14808.pdf
The Face2Face project converts a monocular stream of images to 3D. Source: https://arxiv.org/pdf/2007.14808.pdf

Since the FLAME model shares the same topology as this system, it’s possible to create a consistent mesh to which the Gaussian Splats can conform. The Splats are rendered into images by the differentiable tile rasterizer from the source project:

The differentiable tile rasterizer from the original paper for Gaussian Splatting. A Structure-from-Motion (SfM) point cloud is initialized into a set of 3D Gaussians, and then optimized, before density is optimized. Source: https://arxiv.org/pdf/2308.04079.pdf
The differentiable tile rasterizer from the original paper for Gaussian Splatting. A Structure-from-Motion (SfM) point cloud is initialized into a set of 3D Gaussians, and then optimized, before density is optimized. Source: https://arxiv.org/pdf/2308.04079.pdf

As with traditional textures, it’s important that the Splats stay within a pre-defined distance from their assigned vertices, and to this end the authors have developed a binding inheritance strategy to tether the Gaussians to the FLAME topology. They explain:

‘Only having the same numbers of Gaussian splats as the triangles is insufficient to capture details. For instance, representing a curved hair strand requires multiple splats, while a triangle on the scalp may intersect with several strands.

‘Therefore, we also need the adaptive density control [strategy], which adds and removes splats based on the view-space positional gradient and the opacity of each Gaussian.

‘For each 3D Gaussian with a large view-space positional gradient, we split it into two smaller ones if it is large or clone it if it is small. We conduct this in the local space and ensure a newly created Gaussian is close to the old one that triggers this densification operation. Then, it is reasonable to bind a new 3D Gaussian to the same triangle as the old one because it was created to enhance the fidelity of the local region.’

Conceptual overview for the workflow for GaussianAvatars.
Conceptual overview for the workflow for GaussianAvatars.

As seen in the workflow visualized above, Gaussian Splats are transformed from local to global space (a familiar concept to traditional CGI practitioners and enthusiasts) before cohesive rendering with Gaussian Splatting. All the splats maintain their rigged status throughout the procedure, and their position and scaling is regularized to minimize potential distracting artifacts during rendering.

The density control strategy derived from the original work ensures that the optimal number of splats are created for the purpose, while also ensuring that each triangle in the FLAME mesh has at least one splat assigned.

For training, a single optimization pass is used across all subjects, and the Adam optimizer used. The learning rate is set to a broad 5e-3 value (typically the highest and least precise value in a training round, and one which will generally garner the widest and highest-level features from the source data).

In addition to training the Gaussian Splats themselves, the parameters of the FLAME model are fine-tuned, so that the translation, expression and joint (articulation) parameters are optimized for each time-step. For these processes, a fine learning rate of 1e-6 is used for translation, 1e-5 for rotation, and 1e-3 for expression.

Training continues for a formidable 600,000 iterations, with exponential decay of the learning rate for the positioning of the splats until a converged value of 0.01 is reached.

Data and Tests

To test the system, the researchers used the recordings from the Dynamic Neural Radiance Fields using Hash Ensembles (NeRSemble) dataset.

The recordings each contain 16 frontal and profile views of a subject, from which 11 video sequences were chosen for the tests, each with around 200 time-steps. The images were downsampled to 802 x 550px. In the videos, participants were asked to recreate target expressions in ten of the sequences, and to perform at will in the final session. A tenth of the material was held back as a validation split.

The three criteria for the creation of head avatars were novel view synthesis, where the output features a notably different head pose to the source material; self-reenactment, where an avatar would be driven by novel poses that were not trained into the model; and cross-identity reenactment – which is effectively deepfake puppeteering on unseen poses and expressions from an alternate identity, transposed to another identity.

For baseline comparison, GaussianAvatars was compared to three former analogous systems: Instant Volumetric Head Avatars (INSTA), which warps points nearest to any given triangle in a FLAME mesh, and makes use of NVIDIA’s InstantNGP (I-NGP) NeRF-based framework; PointAvatar, which does not use FLAME, but employs a points-based representation system quite similar in concept to Gaussian Splats, and which uses deformation fields to create movement from a canonical default; and AvatarMAV, which makes use of voxel grids, which again uses deformations from a canonical stance, this time a 3D morphable model (3DMM).

Initially, qualitative comparisons were run for novel view synthesis in a self>self identity reenactment.

Results from qualitative tests for novel-view synthesis in a self-reenactment scenario. Please refer to source for further examples and better resolution.

Selected examples from static images in the paper. Please refer to source for further examples and better resolution.
Selected examples from static images in the paper. Please refer to source for further examples and better resolution.

The researchers assert that PointAvatar’s results demonstrate dotted artifacts. They also state:

‘Our method outperforms state-of-the-art methods by producing significantly sharper rendering outputs. We obtain precise reconstruction of details such as reflective light on eyes, hair strands, teeth, etc. Our results for self-reenactment show more accurate expressions compare to baselines.’

The authors indicate that some of the misalignments evident in the INSTA results are due to incorrect FLAME tracking, which their system can obviate by splitting Gaussian Splats as necessary during optimization.

A quantitative measurement of this test, using Peak signal-to-noise ratio (PSNR), Structural Similarity Index (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS) loss functions also verifies the general superiority of the new method:

Quantitative measurement of the initial test round.
Quantitative measurement of the initial test round.

The paper states:

‘Our approach outperforms others by a large margin regarding metrics for novel-view synthesis. Our method also stands out in self-reenactment, with significantly lower perceptual differences in terms of [LPIPS].

‘Note that self-reenactment is based on tracked FLAME meshes that may not perfectly align with the target images, thus bringing disadvantages to our results with more visual details regarding pixel-wise metrics such as PSNR.’

The next test was for cross-identity reenactment (though here the same examples do not feature in the supplementary video as appear in the paper itself):

Click to play. Cross-identity comparison between GaussianAvatars and prior methods.

Selected examples for cross-identity reenactment from static images in the paper. Please refer to source for further examples and better resolution.
Selected examples for cross-identity reenactment from static images in the paper. Please refer to source for further examples and better resolution.

Of these results, the authors comment:

‘Our avatars accurately reproduce eye blinks and mouth movements from source actors showing lively, complex dynamics such as wrinkles. [INSTA] suffers from aliasing artifacts when the avatars move beyond the occupancy grid of [I-NGP] optimized for training sequences.

‘The movement of results from [PointAvatar] is not precise because its deformation space is not guaranteed to be consistent with FLAME. [AvatarMAV] exhibits large degradations in reenactment due to a lack of deformation priors.’

The researchers concede that GaussianAvatars is currently limited by an inability to change lighting from that which appears in the source video (though a number of unrelated initiatives suggest that this is a solvable challenge), and that the method, like many others, lacks a way to recreate and integrate head-hair (and, again, other work, such as neural strands, offers potential solutions in this regard).

Conclusion

The results demonstrated by GaussianAvatars represent, arguably, not only the most impressive innovation on the original Gaussian Splat research, but one of the most promising methodologies for a truly instrumentalized and reproducible neural face control workflow.

The VFX industry is currently tantalized by the immense possibilities of neural technologies, goaded on by growing demand and keen investors, and eager to secure leads in key segments of visual effects pipelines that have been dominated by the constraints of CGI approach for decades.

But, at times, it seems a risky, almost alchemic pursuit, due to the intractability of latent codes, and the great effort needed to corral the power of trained models into highly-targeted and prescribed workflows, with little scope for spurious experimentation and ‘teachable moments’, in the context of tight production deadlines.

If just one method of the kind showcased here can prove itself a durable and resilient tool in neural VFX, to the point where it becomes as ‘routine’ a recourse as an established Nuke plugin, it would likely have a buoyant effect on the professional AI VFX research scene in general.

Reducing the intractability of large latent codes down to precise and obedient units such as Gaussian Splats could be the key to such a potentially profitable increase in confidence in AI’s potential for the effects industry.

More To Explore

Main image derived from https://unsplash.com/photos/mens-blue-and-white-button-up-collared-top-DItYlc26zVI
AI ML DL

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (https://xbpeng.github.io/projects/AMP/index.html)
AI ML DL

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle