A new collaboration between the University of Munich and Toyota offers perhaps the most convincing example to date of a long-cherished goal in deepfake generation and general facial synthesis: complete control over facial neural representations, including expressions and identity.
With GaussianAvatars, photorealistic Gaussian splat nodes are attached to each vertex of a CGI mesh. The texture and content is directly from real life, but the movement is completely under the control of the operator. Source: https://www.youtube.com/watch?v=lVEY78RwU_I
This facility alone could prove of inestimable value to visual effects pipelines, since the underlying CGI model offers a huge range of potential parameters for the target subject’s facial expression and disposition, in contrast to similar interfaces of recent years, which have offered far less explicit control over such characteristics trained into the latent space of AI models.
Additionally, since CGI polygon mesh parameters are consistent, like-for-like identity retargeting is also well within the scope of the new approach, allowing source actors to puppet ‘guest’ identities:
Deepfake puppetry enabled by GaussianAvatars, with gender and wildly differing facial topography accounted for in the translation.
This level of precision is made possible by the nature of Gaussian Splatting, a German/French technique which emerged in August, and which (following a code release) has quickly captured the imagination of the image synthesis research sector, which now outputs a continuous stream of innovations and developments of the approach.
Compared to Neural Radiance Fields (NeRF), Gaussian Splatting has a number of innate optimizations. Though it draws material from static photos, like NeRF (or any other type of photogrammetry), it throws away pixels at non-useful levels of transparency, and can, as necessary, split its particulate components into two or more sub-components as necessary, instead of laboriously tracing and calculating rays from the objective source image.
Being in itself a rasterization method, which contains an innate rather than estimated knowledge of 3D space, Gaussian Splatting is unusually compatible with the kind of traditional methodologies and tools that VFX artists have been using for the last 20-30 years to describe volume.
CGI meshes are networks of mathematically-defined geometry. They are usually comprised of triangular nodes, similar to a fisherman’s net, and can be ‘warped’ from their canonical or default position in order to simulate movement (such as walking or breathing in and out). Alternatively, parametric skeletons are in themselves formed of primitive mathematical operators (such as a Boolean operator, which may cut a hole out of a CGI doughnut, for instance, by placing ‘negative’ space within a rounded disk), and these can also be used as the basis for a model.
As in the example image above, ordinary pictures (and sometimes moving pictures, since videos can be used as textures) are mapped to the base skeleton so that it has a substantial appearance of reality.
In the same way as pixels create images in 2D space, and voxels can create volumetric presence in 3D space, Gaussian Splats are essential neural equivalents to these kinds of small units that are capable, in great numbers, of comprising complete images.
Since this is the case, and since they have a freedom of movement lacking in NeRF-style representations, as well as a native ability for facile placement in X/Y/Z space, Gaussian Splats can be attached directly to the vertices of CGI mesh, in much the same way a part of a bitmapped texture can be attached to a parametric dinosaur skeleton; and this is how the new method, titled GaussianAvatars, becomes capable of projecting real-life content, such as skin and facial sections, into specific and equivalent areas of a CGI face.
The crucial advantage of using CGI in this way is that a truly neural representation can be obtained directly from human movement, obviating the need for VFX artists to attempt to simulate such movements, while at the same time giving them parameter-driven controls that allow for tweaks and interventions.
One central problem with Gaussian Splatting until now, besides its rather onerous computing and storage needs, is that it does not natively support temporal movement (a hindrance that initially also beset NeRF, though this was later addressed in numerous initiatives).
The new system, however (as we can see in the video samples above) has more effectively overcome these limitations, and may offer a conceptual approach for a system that’s controllable and reliable enough to be considered for mission-critical VFX workflows – either in its current form, or in work that may evolve from it over time.
The CGI framework used in GaussianAvatars is the FLAME model originated in 2017 as an industry/academic collaboration, inevitably with significant participation from the Max Planck Institute for Intelligent Systems, which has held a vanguard position in CGI/neural facial and body reproduction methodologies for over twenty years.
Input for the GaussianAvatars system originates with multi-view video recordings of real people performing diverse facial and had poses. A photometric head tracker based on the 2020 Face2Face project is used to fit the recorded movement to the FLAME model.
Since the FLAME model shares the same topology as this system, it’s possible to create a consistent mesh to which the Gaussian Splats can conform. The Splats are rendered into images by the differentiable tile rasterizer from the source project:
As with traditional textures, it’s important that the Splats stay within a pre-defined distance from their assigned vertices, and to this end the authors have developed a binding inheritance strategy to tether the Gaussians to the FLAME topology. They explain:
‘Only having the same numbers of Gaussian splats as the triangles is insufficient to capture details. For instance, representing a curved hair strand requires multiple splats, while a triangle on the scalp may intersect with several strands.
‘Therefore, we also need the adaptive density control [strategy], which adds and removes splats based on the view-space positional gradient and the opacity of each Gaussian.
‘For each 3D Gaussian with a large view-space positional gradient, we split it into two smaller ones if it is large or clone it if it is small. We conduct this in the local space and ensure a newly created Gaussian is close to the old one that triggers this densification operation. Then, it is reasonable to bind a new 3D Gaussian to the same triangle as the old one because it was created to enhance the fidelity of the local region.’
As seen in the workflow visualized above, Gaussian Splats are transformed from local to global space (a familiar concept to traditional CGI practitioners and enthusiasts) before cohesive rendering with Gaussian Splatting. All the splats maintain their rigged status throughout the procedure, and their position and scaling is regularized to minimize potential distracting artifacts during rendering.
The density control strategy derived from the original work ensures that the optimal number of splats are created for the purpose, while also ensuring that each triangle in the FLAME mesh has at least one splat assigned.
For training, a single optimization pass is used across all subjects, and the Adam optimizer used. The learning rate is set to a broad 5e-3 value (typically the highest and least precise value in a training round, and one which will generally garner the widest and highest-level features from the source data).
In addition to training the Gaussian Splats themselves, the parameters of the FLAME model are fine-tuned, so that the translation, expression and joint (articulation) parameters are optimized for each time-step. For these processes, a fine learning rate of 1e-6 is used for translation, 1e-5 for rotation, and 1e-3 for expression.
Data and Tests
The recordings each contain 16 frontal and profile views of a subject, from which 11 video sequences were chosen for the tests, each with around 200 time-steps. The images were downsampled to 802 x 550px. In the videos, participants were asked to recreate target expressions in ten of the sequences, and to perform at will in the final session. A tenth of the material was held back as a validation split.
The three criteria for the creation of head avatars were novel view synthesis, where the output features a notably different head pose to the source material; self-reenactment, where an avatar would be driven by novel poses that were not trained into the model; and cross-identity reenactment – which is effectively deepfake puppeteering on unseen poses and expressions from an alternate identity, transposed to another identity.
For baseline comparison, GaussianAvatars was compared to three former analogous systems: Instant Volumetric Head Avatars (INSTA), which warps points nearest to any given triangle in a FLAME mesh, and makes use of NVIDIA’s InstantNGP (I-NGP) NeRF-based framework; PointAvatar, which does not use FLAME, but employs a points-based representation system quite similar in concept to Gaussian Splats, and which uses deformation fields to create movement from a canonical default; and AvatarMAV, which makes use of voxel grids, which again uses deformations from a canonical stance, this time a 3D morphable model (3DMM).
Initially, qualitative comparisons were run for novel view synthesis in a self>self identity reenactment.
Results from qualitative tests for novel-view synthesis in a self-reenactment scenario. Please refer to source for further examples and better resolution.
The researchers assert that PointAvatar’s results demonstrate dotted artifacts. They also state:
‘Our method outperforms state-of-the-art methods by producing significantly sharper rendering outputs. We obtain precise reconstruction of details such as reflective light on eyes, hair strands, teeth, etc. Our results for self-reenactment show more accurate expressions compare to baselines.’
The authors indicate that some of the misalignments evident in the INSTA results are due to incorrect FLAME tracking, which their system can obviate by splitting Gaussian Splats as necessary during optimization.
A quantitative measurement of this test, using Peak signal-to-noise ratio (PSNR), Structural Similarity Index (SSIM) and Learned Perceptual Image Patch Similarity (LPIPS) loss functions also verifies the general superiority of the new method:
The paper states:
‘Our approach outperforms others by a large margin regarding metrics for novel-view synthesis. Our method also stands out in self-reenactment, with significantly lower perceptual differences in terms of [LPIPS].
‘Note that self-reenactment is based on tracked FLAME meshes that may not perfectly align with the target images, thus bringing disadvantages to our results with more visual details regarding pixel-wise metrics such as PSNR.’
The next test was for cross-identity reenactment (though here the same examples do not feature in the supplementary video as appear in the paper itself):
Click to play. Cross-identity comparison between GaussianAvatars and prior methods.
Of these results, the authors comment:
‘Our avatars accurately reproduce eye blinks and mouth movements from source actors showing lively, complex dynamics such as wrinkles. [INSTA] suffers from aliasing artifacts when the avatars move beyond the occupancy grid of [I-NGP] optimized for training sequences.
‘The movement of results from [PointAvatar] is not precise because its deformation space is not guaranteed to be consistent with FLAME. [AvatarMAV] exhibits large degradations in reenactment due to a lack of deformation priors.’
The researchers concede that GaussianAvatars is currently limited by an inability to change lighting from that which appears in the source video (though a number of unrelated initiatives suggest that this is a solvable challenge), and that the method, like many others, lacks a way to recreate and integrate head-hair (and, again, other work, such as neural strands, offers potential solutions in this regard).
The results demonstrated by GaussianAvatars represent, arguably, not only the most impressive innovation on the original Gaussian Splat research, but one of the most promising methodologies for a truly instrumentalized and reproducible neural face control workflow.
The VFX industry is currently tantalized by the immense possibilities of neural technologies, goaded on by growing demand and keen investors, and eager to secure leads in key segments of visual effects pipelines that have been dominated by the constraints of CGI approach for decades.
But, at times, it seems a risky, almost alchemic pursuit, due to the intractability of latent codes, and the great effort needed to corral the power of trained models into highly-targeted and prescribed workflows, with little scope for spurious experimentation and ‘teachable moments’, in the context of tight production deadlines.
If just one method of the kind showcased here can prove itself a durable and resilient tool in neural VFX, to the point where it becomes as ‘routine’ a recourse as an established Nuke plugin, it would likely have a buoyant effect on the professional AI VFX research scene in general.
Reducing the intractability of large latent codes down to precise and obedient units such as Gaussian Splats could be the key to such a potentially profitable increase in confidence in AI’s potential for the effects industry.