Since the advent of Gaussian Splats, in August of this year, the image synthesis research community has clearly embraced this innovative approach to neural recreation of people, things and scenery.
Currently, the daily submissions list at the Computer Vision section of Arxiv, and at other platforms, features a growing frequency of splat-related papers, as the interval lengthens between initial publication and follow-on projects.
One noteworthy development this month is a general increase in the number of research projects that attempt neural simulation of entire human bodies – a sub-strand of synthesis research that, to date, has been dominated by Neural Radiance Fields (NeRF), despite the relative rigidity of that older technology.
As with NeRF, Gaussian Splatting is, by default, only capable of creating explorable static neural scenes; and, as with NeRF, the sector is rapidly developing ways around this limitation. Last week we took a look at the first noteworthy splat-based facial deepfake system; and this week, among the slew of potential new academic projects capable of supporting full-body deepfakes, a new offering led by ETH Zurich is proposing an economical and ingenious method of generating Gaussian humans that can be controlled in real time.
From the project page of the new paper, examples of novel viewpoint rendering of humans captured from multi-view video, and interpreted via parametric meshes and Gaussian Splats into virtual humans that can not only recreate the motion depicted in the source videos, but can adapt to novel motion input by the end user. Source: https://vcai.mpi-inf.mpg.de/projects/ash/
The authors of the new work have devised a way to animate the Gaussian figures by interpreting the Splats in 2D space, through the use of a parametric human template (a common bridging method between CGI and neural workflows). Since each representative Gaussian Splat scene (i.e., each potential ‘frame’) is a complete model in itself, the alternative would be equivalent to animating the Mona Lisa by painting 24 separate masterpieces per second.
The method, called ASH, is capable of real-time translation and rendering, and, the new paper reports, achieves notably superior results in tests against similar approaches, including the only other analogous real-time approach currently in existence.
The authors have, additionally, devised a user interface that allows one to impose skeletal poses and motion into captured human data and have the Splatted representations perform novel movement that was not in the original capture data, as well as allowing the user to thoroughly explore the recreation – in a GUI that begins to resemble the CGI Poser/Daz applications that have facilitated the creation of moving CGI humans for over twenty years.
Click to play. The ASH player, currently in a rudimentary state, runs in a browser and allows the viewer to control Splatted humans even with movements that did not feature in the original footage from which their appearance was compiled – though, naturally, there are limitations to how effectively truly extraordinary novel movements could be (for instance, if the person did not remove their outer jacket in the source footage, such a movement could not convincingly be represented in this instance).
Combined with the clearly growing capacity for Gaussian Splats to become editable, projects of this nature seem set to bring the flexibility of CGI to a neural representation framework in a way that NeRF has struggled to do over the last few years.
The paper states:
‘[Our] animatable human avatar is parameterized using Gaussian splats. However, naively learning a mapping from skeletal pose to Gaussian parameters in 3D is intractable given the limited compute budget when constraining ourselves to real-time performance.
‘Thus, we propose to attach the Gaussians onto a deformable mesh template of the human. Given the mesh’s [UV] parameterization, it allows learning the Gaussian parameters efficiently in 2D texture space. Here, each texel covered by a triangle represents a Gaussian.
‘Thus, the number of Gaussians remains constant, which is in stark contrast to the original formulation.’
Click to play. Free viewpoint rendering in ASH.
The new paper is titled ASH: Animatable Gaussian Splats for Efficient and Photoreal Human Rendering, and comes from five researchers across the Max Planck Institute for Informatics, ETH Zurich, Freiburg University, and the Saarbrucken Research Center for Visual Computing, Interaction and AI.
Before we take a deeper look at this interesting new project, let’s consider the difference between Gaussian Splatting and prior neural techniques, so that an examination of the researcher’s techniques will make more sense.
A Gaussian Splat is in some ways analogous to a pixel, which is a single minimal unit of color in an image, and which has no transparency (or alpha) value in ‘flat’ formats such as JPG, but may have transparency in formats which support it, such as PNG. Since pictures are two-dimensional, a pixel only exists in X/Y space (i.e., up and down).
In 3D space, in traditional (though more recent) CGI methodologies, a voxel has been, for some time, the 3D equivalent of a pixel. A voxel is a mathematical or parametric object or entity positioned in full XYZ (3D) space. It has color and (optional) alpha transparency, and, unlike a bitmap-textured CGI mesh, it is calculated as volume, rather than simply wrapping image textures around virtual wire meshes (where the inside of the resultant object is conceptually ‘hollow’).
In Neural Radiance Fields, or NeRF, the unit in question is calculated by tracing the path from a camera’s point-of-view down to the last opaque point where the ‘ray’ can travel no further (i.e., there is no more empty space between the camera and the object, and the surface of the object is now 100% opaque).
Click to play. The NeRF capture process is similar to CGI ray-tracing, building up an interpretive neural network composed of pixel values with 3D (instead of just 2D) coordinates, and with transparency (alpha) channels, so that glass and empty or ‘cut-out’ sections of geometry can be correctly interpreted. Source: https://www.youtube.com/watch?v=JuH79E8rdKc’
A NeRF calculates such values from multiple pictures of the same object, scene or person, so that all these interpretations can be collated into a neural representation:
Click to play. Multiple photos combine to provide an explorable neural environment in NeRF. Source: https://www.youtube.com/watch?v=DJ2hcC1orc4
However, these ‘virtual pixels’ in NeRF, though they can be calculated in XYZ 3D space, and though they can have whatever transparency the scene requires, besides requiring onerous compute resources and storage, are all bound to the ray-tracing action of the multiple viewpoints from the data capture.
A Gaussian Splat, instead, is a neural* representation unit that is not limited in this way – not only can it be assigned anywhere in XYZ/3D space, but it can as necessary multiply and subdivide into additional splats, as coverage requires.
Since the latent space of trained human AI representations can be hard to control, a growing number of human synthesis projects are using specially-designed CGI interfaces as a bridge between trained data and user control. These are essentially old-school CGI humans in a canonical (or basic, default) pose that can be rigged to equivalent perceived captured data points (such as faces or full bodies). As mentioned earlier, this method has already been used to create a Splat-based deepfake process.
Among other reasons, DDC was chosen over other popular parametric models because it is unusually capable of reproducing large and flowing movement of clothes, such as the motion of a billowing dress as a woman turns dynamically.
The initial data for ASH is gathered in the form of video of individual people in movement, taken from multiple cameras, from which bitmapped data is processed and skeletal pose is inferred. The authors state:
‘Our goal is to generate motion-controllable, photorealistic renderings of humans learned solely from multi-view RGB [videos]. Specifically, ASH takes the skeletal motions and a virtual camera view as input at inference, and produces high-fidelity renderings in real-time (∼30f ps). To this end, we propose to model the dynamic character with 3D Gaussian splats, parametrized as texels in the texture space of a deformable template mesh.
‘This texel-based parameterization of 3D Gaussian splats enables us to model the mapping from skeletal motions to the Gaussian splat parameters as a 2D image-2-image translation task.’
So what is happening in ASH is that the high-compute dynamics of the interpreted Splats are being transliterated into a more flexible and lightweight environment, courtesy of the DDC human mesh, and it is this adroit implementation that allows for real-time operation.
The paper states:
‘ASH generates high-fidelity rendering given a skeletal motion and a virtual camera view. A motion-dependent, canonicalized template mesh is generated with a learned deformation network. From the canonical template mesh, we can render the motion-aware textures, which are further adopted for predicting the Gaussian splat parameters with two 2D convolutional networks, i.e., the Geometry and Appearance Decoder, as the texels in the 2D texture space.
‘Through UV mapping and DQ skinning, we warp the Gaussian splats from the canonical space to the posed space. Then, splatting is adopted to render the posed Gaussian splats.’
Click to play. Skeletal motion can be imposed upon the learned Splat representation, and the scene interactively explored.
Data and Tests
Due to the challenging nature of the proposition, training was divided into two phases: a ‘warm-up’ phase that uses data from the DynaCap dataset (from the DDC project that is enabling the CGI human instrumentality), together with footage specially captured by the researchers.
At this stage, frames are sampled evenly across the source data, and initial 3D Gaussian Splat parameters are learned, which serve as faux ground truth for the subsequent stages.
In the second phase, the motion-aware decoder (see architectural flow image above) is trained further on the entire sequence of data, with loss functions L1 and Structural Similarity Index (SSIM) used to minimize loss and make the model more accurate.
In accordance with the prior DDC methodology, the researchers held back four camera views from the obtained data, in order to use these to assess the final model’s resilience to novel data.
Regarding the original material created by the authors, two sequences were recorded showing subjects performing various dynamic actions such as jogging, dancing, and jumping, to obtain a total of 27,000 frames of training data and 7,000 frames of post-training (split) test data.
Testing for novel view synthesis (video below), where the system is required to make the neural actors perform motions that were not included in the source captures, the authors pitted ASH against four systems (not all of which are represented in video or image results, though all are accounted for elsewhere): DDC itself, the only tested method that is capable, as ASH is, of real-time operation; Template-free Animatable Volumetric Actors (TAVA), a hybrid method that manipulates implicit trained fields in canonical space (i.e., it warps a ‘default’ pose); Neural Sparse Voxel Fields (‘Neural actor’ – NA), which – unusually – explores the use of traditional inverse kinematics for a human representation instrumentalized by a parametric human mesh; and HDHumans, which likewise models neural humans based on the feature map obtained from a deformable mesh template.
Metrics used were Peak Signal-to-Noise Ratio (PSNR) and Learned Perceptual Similarity Metrics (LPIPS), assessed at 1k resolution, and averaged across every tenth frame. Results were differentiated between subjects wearing ‘clinging’ clothes (denoted as Tight Outfits, which are less challenging to reproduce), and those with more billowy apparel (denoted as Loose Outfits, where accurate reproduction is more problematic for systems of this nature).
Click to play. Tests against rival programs for novel view synthesis, where the performed motion comes from training data, but the viewpoint does not.
For a quantitative comparison for novel view synthesis (pre-trained motion, original viewpoint), the authors note that ASH outperforms the real-time DDC and the prior non-real-time methods by a considerable margin:
For novel pose, ASH was able to achieve the highest PSNR and the second-best LPIPS score (after Neural Actor, a non-real-time method) among the tested frameworks, for tight outfits – but outperformed all rivals for subjects with loose clothing:
The authors comment:
‘[DDC] is the only competing method with real-time capability. Although it captures coarse motion-aware appearances, its output tends to be blurry and lacks detail. ASH matches the real-time capability as DDC, while generating renderings with much finer details.’
Click to play. Novel pose synthesis tests.
Though the HDHumans method achieves comparable results to ASH, the authors observe, it requires extensive sampling for volume rendering, and takes seconds to render one single frame in contrast to the real-time performance that ASH is currently capable of.
In terms of limitations, the authors concede that ASH does not extract detailed geometry from the Splats, but that this could be addressed in future work through backpropagation, making the process less of a one-way street. However, presumably, this will present an optimization challenge, in terms of maintaining real-time performance.
Neither does the system model extreme topological changes, such as the opening of a jacket, which deforms the jacket and reveals previously hidden human detail. This, the researchers state, could also be addressed in later work, through the adaptive adding and removal of Gaussian Splats (native Splat functionality that is not used in ASH, which currently assigns a fixed number of Splats and does not divide or multiply them on demand).
Does this kind of thing qualify as a ‘deepfake’? Unlike last week’s Gaussian Avatars, ASH currently has no functionality for transposition of identity. However since it can manifestly make trained avatars do things that the source actors did not do in the training videos, and since movement and action can be swapped over and manually edited as necessary, simply by altering the skeletal poses, it certainly does seem to qualify.
In terms of risk, the need for multi-viewpoint video capture largely obviates the possibility of abusing such a system, which currently requires the explicit cooperation of the subject. Presumably, further down the road, the ability to infer similar functionality from static images (as NeRF does) or non-synchronous video (multiple diverse clips of the same subject) could change this situation.
The ultimate downstream objective of such research, at least for VFX professionals, is the facile capture of subjects with the minimal necessary information, and the transformation of this data (either via Gaussian Splats or later technologies or iterations) into neural-based equivalent methods of CGI’s current ability to completely model the human form – and to arrive, perhaps, at neural recreations of people that can not only perform any action on command, but which adhere adequately to motion physics – and which can survive a close-up!
In an ideal world it would be better to do this kind of thing at 24-Mona-Lisas-Per-Second – i.e., that hardware and storage resources might eventually become adequate to perform native real-time volume rendering. As it stands, we seem set for six or more months of Splat-based papers centering around sleight-of-hand proxy schemes such as ASH, and, perhaps later, for genuine optimizations of full-volume operations, such as PlenOctrees, NVIDIA NGP, and later similar projects achieved for NeRF.
* Technically it’s a rasterization unit rather than a neural unit, but in all current Splat implementations that are of any power or interest to the synthesis community, it ends up as a neural unit, passed through standard training processes.
Amended Friday, December 15, 2023 13:33:47 EET to clarify ‘neural unit’