A new collaboration between the University of California, Stanford, and NVIDIA has proposed a system capable of turning people into neural representations in real time:
Please allow time for the animated GIF below to load

Though impressive in itself, the capability is more significant when considering its possible use in VR and AR applications, where the individual can be seen from multiple possible viewpoints, authentically represented.
Please allow time for the animated GIF below to load

Though the new paper avoids the topic, perhaps to avoid controversy (a growing trend in facial synthesis research publications), the new framework can also potentially transform individuals into other individuals, allowing them to drive synthetic or even artistic personalities, and potentially offering deepfake capabilities:
Please allow time for the animated GIF below to load

This transformative process is obtained from a single RGB image source, such as a webcam, and has no need of additional cameras in order to infer accurate geometry. Further, it can run effectively on a NVIDIA 3090 GPU (which, while it is a beefy and expensive piece of kit, still qualifies as ‘consumer’ hardware).
The method, according to the researchers behind it, is three orders of magnitude faster than the current state-of-the-art NeRF-based approaches for real-time neural rendering.
The system is primed on synthetic images, which are generated dynamically by a prior framework. The authors note:
‘[We] warm up the model by training over 30k iterations without the adversarial loss and continue to train the model with the full loss functions in Eqn. 5 over 220k iterations. Since we sample two camera poses per [iteration], we effectively use over 16 million images during the training, which is not obtainable from real images (nor even physically-based rendered images) in practice.’
Trialed against former methods, the (unnamed) new approach is able to improve on prior work in this area both in terms of latency and quality:

One interesting take on the possible application of the new work is to consider that it is above-averagely capable of generating accurate novel views at extreme angles, and that it may serve not only as a possible source of synthetic data (i.e. to train models where the data has gaps in viewpoints and coverage), but could also function as a temporally coherent rendering system for generative models such as Stable Diffusion, which has long sought such a solution.
The system does not ‘specialize’ in human faces, however, and the researchers tested it also on the generation of cat photos:

The authors state, furthermore, that the principles employed in the work could be applied to any kind of possible neural representation, including depiction of scenes or objects, transcending its possibilities as a deepfake-style process.
The system uses Neural Radiance Fields (NeRF) to handle the rendering end of the pipeline, and Generative Adversarial Networks (GANs), to produce the extraordinary amount of synthetic faces needed to prime the process.
The new paper is titled Real-Time Radiance Fields for Single-Image Portrait View Synthesis, and comes from six researchers spanning Stanford, UoC, and NVIDIA Research.
Approach
The new system both improves upon and incorporates the prior approach Efficient Geometry-aware 3D Generative Adversarial Networks (EG3D), a prior collaboration between Stanford and NVIDIA, launched in 2022.

This earlier publication was less wary of exploring the transformative potential of converting real-world data into transformative processes:
Please allow time for the animated GIF below to load

EG3D uses a triplane-based 3D GAN from a collection of 512x512px single-view images. A NVIDIA StyleGAN2 generator is used to map a noise vector and a conditioning (i.e., virtual) camera to a triplane representation
Any point within a neural depiction is characterized by its location in X/Y/Z coordinates (similar to CGI-based methods) and by its direction – the viewpoint of the end user, a ‘window’ on the scene.

In full-fledged 3D representations, memory usage is very high, since the scene must be entirely (explicitly) rendered in order to become explorable. By contrast, implicit representations frame this geometry as a continuous function, which occupies little memory, but which is far more effortful to infer and explore. Effectively, the new system finds a way round these twin bottlenecks, primarily through pre-processing.
The EG3D framework facilitates rendering at 42fps from a triplane representation on an RTX 3090 GPU, equaling SOTA GAN methods, in terms of the realism of faces obtained, and thus becomes the core engine of the new method.
The innovation with the new system is that it can directly map an unposed (i.e., arbitrary) image to a canonical triplane 3D representation that is then ‘decoded’ through NeRF. This requires only a single feedforward neural network. Ordinarily, this goal is accomplished by the far more time/resource-intensive process of GAN inversion, which ‘projects’ source images into the latent space of the network.
The challenge at hand is to create a canonical 3D representation from a single image that is both flexible and accurate to the source identity (or object characteristics). The authors note that to date, prior pipelines have found these two goals to be at odds with each other.

In the first inference phase, pictured above, a (Google) DeepLabV3 module extracts low-res features from the source image, which are then fed to a (NVIDIA) SegFormer vision Transformer (ViT) and a Convolutional Neural Network (CNN).
SegFormer quickly obtains a canonical 3D representation; however, it cannot accurately reproduce high-frequency details such as birthmarks or strands of hair. This is handled by the next phase of the new system’s pipeline:

The triplane encoder is now trained with synthetic data (in this case, face data) generated by a frozen EG3D generator (i.e., ‘frozen’ in the sense that it is not being affected by the throughput of data, but is being used as a ‘read only’ component, having already been trained, as described above).
At each gradient step, two images from the source data are synthesized (the source data is the high-resolution image outlined in red in the illustration above). The processed image is further conditioned into a total of eight images, and evaluated by a generative adversarial objective (i.e., a discriminator process).
The authors observe:
‘Note that the rendering, upsampling, and dual discriminator modules are all fine-tuned from the pretrained EG3D. However, the dual discriminator in our pipeline doesn’t rely on any real data; instead, we train this discriminator to differentiate between images rendered from our encoder model and images rendered from the frozen EG3D.’
The images are subject to on-the-fly augmentation, in order to be able to render convincing human images (because all the data that is conditioning this process is synthetic, and would, by default, otherwise render an unconvincing face).

The authors have developed two encoders for different use cases: Ours and Ours (LT). The former requires an A100 with 40GB of VRAM, while the latter can run on a 3090, with 24GB of VRAM. They contain 87 million and 63 million parameters, respectively, and differ only in the resolution of the intermediate feature maps. Ours runs at 22ms on an A100, and 40ms on a 3090, while the light version runs at only 16ms on a 3090.
Data and Tests
To test the new system, the researchers chose three comparable frameworks: Samsung’s Realistic One-shot Mesh-based Head Avatars (ROME), HeadNeRF and EG3D-PTI, which, like the new system, also contains an unconditional EG3D generator, as well as Pivotal Tuning Inversion (PTI).
The systems were evaluated for three factors: 2D image reconstruction, evaluated via diverse metrics, including LPIPS, Deep Image Structure and Texture Similarity (DISTS) and SSIM, as well as likeness; general image quality, as evaluated via Fréchet Inception Distance (FID); and 3D reconstruction quality.
The authors note that errors between the ground truth and synthesized output caused by the pose estimation of the off-the-shelf pose estimator used resulted in a downgrading of results for the new system under SSIM and Peak signal-to-noise ratio (PSNR- the metric for reconstruction). Of this, the authors state:
‘Nonetheless, we report SSIM results in the main paper and include PSNR results in the supplement along with an analysis of alignment issues. In the end, our experiments qualitatively and quantitatively support that our method achieves the state-of-the-art results on in-the-wild portraits as well as multiview 3D scan datasets.’
Datasets used were FFHQ, H3DS, and AFHQv2 Cats – the latter a feline dataset.

Regarding these results, the authors state:
‘While HeadNeRF and ROME provide adequate shapes and images, they need image segmentation as a preprocess, and struggle with obtaining photorealistic results. Despite the 20 mins of fine tuning, EG3D-PTI does not ensure the reconstruction looks photorealistic when viewed from a non-input [view]. In contrast, our method reconstructs the entire portrait with accurate photorealistic details.’

The authors also observe that the geometry output by ROME and HeadNeRF are not as faithful as that of the new system, and often only reconstruct part of the head (see image below):

The paper features more qualitative result images than we are able to include here, so please refer to it for further details.
In the quantitative round, with the exception of SSIM (which the authors have addressed – see also below), the new system leads over the former approaches:

Here, the authors assert:
‘[Our] model significantly outperforms the baselines on all the metrics except SSIM; our SSIM score is only marginally lower than EG3D-PTI despite the aforementioned issue of the image misalignment and the fact that EG3D-PTI directly optimizes the pixels for the evaluation view. The geometry evaluation [on] H3DS in which we compare the depths of the ground truth from the input view as predicted by each model validates that our models produce more accurate 3D geometry.’
Conclusion
In projects such as this, we are arguably witnessing a process akin to the development of the earliest video compression codecs, or to the evolution of the digitization of information, which began in 1679, but had to wait until the 1980s to truly become an actionable and pervasive technology.
The ability to ‘neuralize’ people, as well as objects and environments, may eventually jettison traditional 2D formats (such as video and static images), as future capture devices will create neural representations natively and without the need for interstitial ‘flat’ imagery.
This landmark change will convert multimedia from a fixed and rasterized format into a medium as editable as text currently is in Word and text documents.
Besides its incremental improvement over EG3D and other prior networks challenged in the new paper, what’s notable about it is the degree to which it downplays the transformative potential of neural capture – a trend that has grown in image synthesis research since the advent of Stable Diffusion and the ChatGPT series turned AI from a cultural curiosity into an apparently existential threat to the livelihoods of multiple professions – and particularly since the artistic and creative community has developed a resistant spirit against the emerging and rapidly developing capabilities of these systems.