If time were no object, Neural Radiance Fields (NeRF) might by now have made greater inroads into potential commercial implementations – particularly in the field of human avatars and facial recreation.
As it stands, even the fastest recent implementations of NeRF-based facial recreation from user-provided video come in at around twenty minutes for training time, which puts pressure on the narrow window to capture and consolidate casual consumer interest.
Therefore a great deal of effort has been expended by the NeRF-avatar research subsector over the past 18 months to speed up the process of creating usable and dynamic NeRF face/head avatars, for possible use in AR/VR environments, virtual communications, and in other applications.
The latest breakthrough, just published in a new paper from the Department of Automation at China’s Tsinghua University, offers a usable NeRF avatar in two minutes, and a state-of-the-art NeRF
Please allow time for the animated GIF below to load
With increased research along these and similar lines, near-instantaneous photoreal self-representations seem to be on the horizon within the next year or so, depending on the compromises that will need to be struck between processing/training time, quality and versatility of the final result.
The new work is titled ManVatar : Fast 3D Head Avatar Reconstruction Using Motion-Aware Neural Voxels. The speed increase achieved in the work was accomplished through the use of multiple 4D tensors in a 3D Morphable Face Model (3DMM).
As we’ve discussed before, 3DMM models are ‘regular’, parametric CGI models which are used to communicate with more problematic neural representations of faces, such as the photogrammetry-based NeRF, and the arcane latent space of Generative Adversarial Networks (GANs).
The source for the trained ManVatar representation is, as with prior works, a monocular (i.e. not 3D or dual-lensed) portrait video, such as one might take of oneself on a smartphone.
The captured head pose and facial expressions are mapped onto a 3DMM template by identification of facial landmarks (using OpenSeeFace), and pre-processed. Each expression is then converted into a voxel grid, which is similar to the pixels represented in a JPEG or other type of image, except that the mapping is 3D and volumetric:
The sum of these calculated expressions is then converted and averaged into a complete motion voxel grid, in which, obviously, the voxels (represented in the squares of the images above) may not necessarily remain at their initial fixed points.
ManVatar also calculates a ‘canonical’ appearance – a representation of the face that contains a ‘neutral’ pose and expression – a ‘base’ against which deformations (i.e. changes in facial and head pose) can be calculated.
The processed data in this workflow is finally passed to a very slim 2-layer multilayer perceptron, which facilitates the final portrait image via volume rendering. Training high-volume MLPs has been the traditional bottleneck in NeRF generation, and using such a scant layer of MLPs for ManVatar, and concentrating the locus of effort on derived voxels, is key to the speed of the system.
Please allow time for the animated GIF below to load
The optimized nature of the workflow means that at inference time (i.e., when processing is done and it’s time to animate the head), it’s now possible to generate photoreal portraits from mere expression coefficients and base head poses; and very, very quickly.
Prior approaches have not sought or been able to separate facial expressions from the base geometry of the captured subject. This has obstructed previous attempts at fast convergence, due to the high volume of data entailed in this enmeshment.
Instead, ManVatar creates pose and expression as divergences from a base ‘neutral’ default, allowing for a more lightweight implementation, where the voxel grids are doing the work previously undertaken, at greater expense of time and resources, by MLPs.
The process is further optimized by background and body/neck removal, which produces a representation from the lower neck upwards. The researchers found that the motion-aware neural voxels obtained by the process were useful not only in representing expressions, but as contributors to the ‘base’, neutral expression of the canonical pose – a further optimization of resources.
Method and Tests
The tests were conducted on a NVIDIA 3090 GPU, with the model trained for 10,000 iterations under the Adam optimizer. In an increasingly common practice in computer vision, initial training took place at 256×256 pixel resolution images, with the last 4000 iterations using 512x512px resolution.
Eight training videos were used, including four from the HDTF dataset. The remaining videos were created by the researchers using a hand-held mobile phone.
ManVatar was tested against three comparable SOTA approaches: Deep Video Portraits (DVP), which synthesizes 2D images instead of reconstructing a full head model; I M Avatar, which creates an implicit Signed Distance Field (SDF) based on a FLAME model; and NerFACE, which also reconstructs a NeRF model from images with 3DMM-based data, similar to ManVatar.
All three competing methods were trained to the same level of convergence as ManVatar (i.e., trained enough so that the models were considered usable and high-quality). Though it presents no corresponding graph, the paper reports that I M Avatar and DVP each took an entire day to converge; that NerFACE took 12 hours; and that ManVatar took five minutes.
In qualitative terms, the authors assert that NerFACE achieves comparable results to ManVatar, but at a greatly-increased training time.
The authors state:
‘The results validate that ManVatar achieves the highest render quality while the training time is far less than the other methods. IMAvatar reconstructs an implicit model based on a FLAME template, yet the expressiveness is insufficient. Therefore, they can hardly learn person-specific expression details. DVP inherits the GAN framework and relies on a 2D convolutional network to generate images. But in many cases, the generated details are not appropriate.’
Quantitative tests were also conducted on four popular metrics: Mean Squared Error (MSE); Peak-Signal-to-Noise-Ratio (PSNR); Structural Similarity Index (SSIM); and Learned Perceptual Image Patch Similarity (LPIPS). Here, ManVatar achieved comparable results to NerFACE on SSIM and LPIPS, and superior results across other metrics.
Further tests were conducted regarding training speed, this time against two directly comparable NeRF-based methods: NeRFBlend-Shape and, again, NerFACE. In this case, the finish-line was defined by the time it took ManVatar to complete convergence:
For these tests, the researchers used an official video from NeRFBlend-Shape as a guideline to its development through training.
The researchers note that despite the stated claims of 12 hours for training, NerFACE was able to converge within ‘a few hours’. Nonetheless, this is still a huge increase over the five-minute training time for ManVatar. The authors note that the final three minutes are primarily for ‘finish’, and that the ManVatar avatar is essentially usable after only two minutes.