NVIDIA Offers Real-Time Neural People Through a NeRF and GAN Pipeline

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

A new collaboration between the University of California, Stanford, and NVIDIA has proposed a system capable of turning people into neural representations in real time:

Please allow time for the animated GIF below to load

The user's webcam feed is transformed into a neural radiance fields representation in real-time. See the source videos for better resolution and coherence. Source: https://research.nvidia.com/labs/nxp/lp3d/
The user's webcam feed is transformed into a neural radiance fields representation in real-time. See the source videos for better resolution and coherence. Source: https://research.nvidia.com/labs/nxp/lp3d/

Though impressive in itself, the capability is more significant when considering its possible use in VR and AR applications, where the individual can be seen from multiple possible viewpoints, authentically represented.

Please allow time for the animated GIF below to load

The new system can accurately infer a novel and unseen point of view, in real-time.
The new system can accurately infer a novel and unseen point of view, in real-time.

Though the new paper avoids the topic, perhaps to avoid controversy (a growing trend in facial synthesis research publications), the new framework can also potentially transform individuals into other individuals, allowing them to drive synthetic or even artistic personalities, and potentially offering deepfake capabilities:

Please allow time for the animated GIF below to load

A real person is used as a 'driver' video to animate the actions of a clearly false personality, rendered in an artistic style.
A real person is used as a 'driver' video to animate the actions of a clearly false personality, rendered in an artistic style.

This transformative process is obtained from a single RGB image source, such as a webcam, and has no need of additional cameras in order to infer accurate geometry. Further, it can run effectively on a NVIDIA 3090 GPU (which, while it is a beefy and expensive piece of kit, still qualifies as ‘consumer’ hardware).

The method, according to the researchers behind it, is three orders of magnitude faster than the current state-of-the-art NeRF-based approaches for real-time neural rendering.

The system is primed on synthetic images, which are generated dynamically by a prior framework. The authors note:

‘[We] warm up the model by training over 30k iterations without the adversarial loss and continue to train the model with the full loss functions in Eqn. 5 over 220k iterations. Since we sample two camera poses per [iteration], we effectively use over 16 million images during the training, which is not obtainable from real images (nor even physically-based rendered images) in practice.’

Trialed against former methods, the (unnamed) new approach is able to improve on prior work in this area both in terms of latency and quality:

The new method offers improved geometry and identity-fidelity, in addition to a leap to real-time rendering, on consumer-level hardware (albeit very good hardware). Source: https://arxiv.org/pdf/2305.02310.pdf
The new method offers improved geometry and identity-fidelity, in addition to a leap to real-time rendering, on consumer-level hardware (albeit very good hardware). Source: https://arxiv.org/pdf/2305.02310.pdf

One interesting take on the possible application of the new work is to consider that it is above-averagely capable of generating accurate novel views at extreme angles, and that it may serve not only as a possible source of synthetic data (i.e. to train models where the data has gaps in viewpoints and coverage), but could also function as a temporally coherent rendering system for generative models such as Stable Diffusion, which has long sought such a solution.

The system does not ‘specialize’ in human faces, however, and the researchers tested it also on the generation of cat photos:

The new NVIDIA system is arbitrary, in terms of subject matter, and a wide range of possible subjects can be recreated.
The new NVIDIA system is arbitrary, in terms of subject matter, and a wide range of possible subjects can be recreated.

The authors state, furthermore, that the principles employed in the work could be applied to any kind of possible neural representation, including depiction of scenes or objects, transcending its possibilities as a deepfake-style process.

The system uses Neural Radiance Fields (NeRF) to handle the rendering end of the pipeline, and Generative Adversarial Networks (GANs), to produce the extraordinary amount of synthetic faces needed to prime the process.

The new paper is titled Real-Time Radiance Fields for Single-Image Portrait View Synthesis, and comes from six researchers spanning Stanford, UoC, and NVIDIA Research.

Approach

The new system both improves upon and incorporates the prior approach Efficient Geometry-aware 3D Generative Adversarial Networks (EG3D), a prior collaboration between Stanford and NVIDIA, launched in 2022.

From the accompanying video from the 2022 release of EG3D, we see the underlying inferred geometry powering the neural reproduction of people. Source: https://www.youtube.com/watch?v=cXxEwI7QbKg
From the accompanying video from the 2022 release of EG3D, we see the underlying inferred geometry powering the neural reproduction of people. Source: https://www.youtube.com/watch?v=cXxEwI7QbKg

This earlier publication was less wary of exploring the transformative potential of converting real-world data into transformative processes:

Please allow time for the animated GIF below to load

Once the original stream is converted into a neural representation, the parameters are known and can be manipulated. Arguably, 'neuralization' is the new take on the 'digital revolution' of the 1990s and 2000s.
Once the original stream is converted into a neural representation, the parameters are known and can be manipulated. Arguably, 'neuralization' is the new take on the 'digital revolution' of the 1990s and 2000s.

EG3D uses a triplane-based 3D GAN from a collection of 512x512px single-view images. A NVIDIA StyleGAN2 generator is used to map a noise vector and a conditioning (i.e., virtual) camera to a triplane representation

Any point within a neural depiction is characterized by its location in X/Y/Z coordinates (similar to CGI-based methods) and by its direction – the viewpoint of the end user, a ‘window’ on the scene.

Source: https://arxiv.org/pdf/2112.07945.pdf via marktechpost.com
Source: https://arxiv.org/pdf/2112.07945.pdf via marktechpost.com

In full-fledged 3D representations, memory usage is very high, since the scene must be entirely (explicitly) rendered in order to become explorable. By contrast, implicit representations frame this geometry as a continuous function, which occupies little memory, but which is far more effortful to infer and explore. Effectively, the new system finds a way round these twin bottlenecks, primarily through pre-processing.

The EG3D framework facilitates rendering at 42fps from a triplane representation on an RTX 3090 GPU, equaling SOTA GAN methods, in terms of the realism of faces obtained, and thus becomes the core engine of the new method.

The innovation with the new system is that it can directly map an unposed (i.e., arbitrary) image to a canonical triplane 3D representation that is then ‘decoded’ through NeRF. This requires only a single feedforward neural network. Ordinarily, this goal is accomplished by the far more time/resource-intensive process of GAN inversion, which ‘projects’ source images into the latent space of the network.

The challenge at hand is to create a canonical 3D representation from a single image that is both flexible and accurate to the source identity (or object characteristics). The authors note that to date, prior pipelines have found these two goals to be at odds with each other.

The inference phase of the new system.
The inference phase of the new system.

In the first inference phase, pictured above, a (Google) DeepLabV3 module extracts low-res features from the source image, which are then fed to a (NVIDIA) SegFormer vision Transformer (ViT) and a Convolutional Neural Network (CNN).

SegFormer quickly obtains a canonical 3D representation; however, it cannot accurately reproduce high-frequency details such as birthmarks or strands of hair. This is handled by the next phase of the new system’s pipeline:

The second half of the core methodology of the new system.
The second half of the core methodology of the new system.

The triplane encoder is now trained with synthetic data (in this case, face data) generated by a frozen EG3D generator (i.e., ‘frozen’ in the sense that it is not being affected by the throughput of data, but is being used as a ‘read only’ component, having already been trained, as described above).

At each gradient step, two images from the source data are synthesized (the source data is the high-resolution image outlined in red in the illustration above). The processed image is further conditioned into a total of eight images, and evaluated by a generative adversarial objective (i.e., a discriminator process).

The authors observe:

‘Note that the rendering, upsampling, and dual discriminator modules are all fine-tuned from the pretrained EG3D. However, the dual discriminator in our pipeline doesn’t rely on any real data; instead, we train this discriminator to differentiate between images rendered from our encoder model and images rendered from the frozen EG3D.’

The images are subject to on-the-fly augmentation, in order to be able to render convincing human images (because all the data that is conditioning this process is synthetic, and would, by default, otherwise render an unconvincing face).

Results on FFHQ and AFHQ on both the full-fledged and lightweight model for the new system, demonstrating novel views of conditioned geometry.
Results on FFHQ and AFHQ on both the full-fledged and lightweight model for the new system, demonstrating novel views of conditioned geometry.

The authors have developed two encoders for different use cases: Ours and Ours (LT). The former requires an A100 with 40GB of VRAM, while the latter can run on a 3090, with 24GB of VRAM. They contain 87 million and 63 million parameters, respectively, and differ only in the resolution of the intermediate feature maps. Ours runs at 22ms on an A100, and 40ms on a 3090, while the light version runs at only 16ms on a 3090.

Data and Tests

To test the new system, the researchers chose three comparable frameworks: Samsung’s Realistic One-shot Mesh-based Head Avatars (ROME), HeadNeRF and EG3D-PTI, which, like the new system, also contains an unconditional EG3D generator, as well as Pivotal Tuning Inversion (PTI).

The systems were evaluated for three factors: 2D image reconstruction, evaluated via diverse metrics, including LPIPS, Deep Image Structure and Texture Similarity (DISTS) and SSIM, as well as likeness; general image quality, as evaluated via Fréchet Inception Distance (FID); and 3D reconstruction quality.

The authors note that errors between the ground truth and synthesized output caused by the pose estimation of the off-the-shelf pose estimator used resulted in a downgrading of results for the new system under SSIM and Peak signal-to-noise ratio (PSNR-  the metric for reconstruction). Of this, the authors state:

‘Nonetheless, we report SSIM results in the main paper and include PSNR results in the supplement along with an analysis of alignment issues. In the end, our experiments qualitatively and quantitatively support that our method achieves the state-of-the-art results on in-the-wild portraits as well as multiview 3D scan datasets.’

Datasets used were FFHQ, H3DS, and AFHQv2 Cats – the latter a feline dataset.

Qualitative results from the testing rounds.
Qualitative results from the testing rounds.

Regarding these results, the authors state:

‘While HeadNeRF and ROME provide adequate shapes and images, they need image segmentation as a preprocess, and struggle with obtaining photorealistic results. Despite the 20 mins of fine tuning, EG3D-PTI does not ensure the reconstruction looks photorealistic when viewed from a non-input [view]. In contrast, our method reconstructs the entire portrait with accurate photorealistic details.’

Further qualitative results. The authors note that the competing frameworks struggle to achieve lateral views, in comparison to the better results from the new system.
Further qualitative results. The authors note that the competing frameworks struggle to achieve lateral views, in comparison to the better results from the new system.

The authors also observe that the geometry output by ROME and HeadNeRF are not as faithful as that of the new system, and often only reconstruct part of the head (see image below):

The differing quality of mesh reconstruction across the frameworks.
The differing quality of mesh reconstruction across the frameworks.

The paper features more qualitative result images than we are able to include here, so please refer to it for further details.

In the quantitative round, with the exception of SSIM (which the authors have addressed – see also below), the new system leads over the former approaches:

Results from the quantitative round.
Results from the quantitative round.

Here, the authors assert:

‘[Our] model significantly outperforms the baselines on all the metrics except SSIM; our SSIM score is only marginally lower than EG3D-PTI despite the aforementioned issue of the image misalignment and the fact that EG3D-PTI directly optimizes the pixels for the evaluation view. The geometry evaluation [on] H3DS in which we compare the depths of the ground truth from the input view as predicted by each model validates that our models produce more accurate 3D geometry.’

Conclusion

In projects such as this, we are arguably witnessing a process akin to the development of the earliest video compression codecs, or to the evolution of the digitization of information, which began in 1679, but had to wait until the 1980s to truly become an actionable and pervasive technology.

The ability to ‘neuralize’ people, as well as objects and environments, may eventually jettison traditional 2D formats (such as video and static images), as future capture devices will create neural representations natively and without the need for interstitial ‘flat’ imagery.

This landmark change will convert multimedia from a fixed and rasterized format into a medium as editable as text currently is in Word and text documents.

Besides its incremental improvement over EG3D and other prior networks challenged in the new paper, what’s notable about it is the degree to which it downplays the transformative potential of neural capture – a trend that has grown in image synthesis research since the advent of Stable Diffusion and the ChatGPT series turned AI from a cultural curiosity into an apparently existential threat to the livelihoods of multiple professions – and particularly since the artistic and creative community has developed a resistant spirit against the emerging and rapidly developing capabilities of these systems.

More To Explore

One2Avatar examples
AI ML DL

Better Neural Avatars From Just Five Face Images

Many neural avatar systems of the last 18 months require extensive training data, or even full videoclips. Others are performant, but have exorbitant training demands. However, a new system from Google and the University of Minnesota is proposing a photorealistic deepfake head system that’s trained on only five images – and can work quite well from just one image; and the new system of pretraining that the framework uses throws some of the conventions regarding hyperscale training datasets into question.

AI ML DL

The Challenge of Preventing ‘Identity Bleed’ in Face Swaps

KAIST AI has developed a new method of disentangling identity characteristics in a face-swap from secondary characteristics such as lighting, skin texture – and the original structure of the face to be ‘overwritten’ by the new identity. If such techniques can be perfected, facial replacement could be freed from having the original identity ‘bleeding through’ into the superimposed identity.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle