NeRF Breaks Free From Being an ‘Animated Photo’

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

With all the current clamor around Stable Diffusion, Neural Radiance Fields (NeRF) is not getting a lot of love lately. Though hardly a ‘legacy’ technology (it only came on the scene in 2020), it’s arguably seen as a hamstrung approach to neural rendering, compared to the rich and explorable latent space of a latent diffusion network, or even a Generative Adversarial Network (GAN).

Resistant to editing, hard to include in a deepfakes pipeline, and more suited to urban scenes and static representations than human synthesis, NeRF is nonetheless perhaps the most accurate neural representation technology currently available for the human form – but it’s not the most imaginative.

Unlike Stable Diffusion, you can’t ‘search’ the latent space of a NeRF for hidden treasures, since a NeRF representation is pretty much limited to whatever material was present in the photos from which it derived its network. If that material is of a woman in a studio, you won’t be able to intervene in the trained model produced from those source images (at least not in any easy or meaningful way). To paraphrase John Hammond in Jurassic Park (1993), NeRF is ‘kind of a ride‘.

Please allow time for the animated GIF below to load

NVIDIA's Instant NeRF derives a complex and explorable neural scene from just four 'real' photos, complete with realistic depth of field. Source: https://www.youtube.com/watch?v=DJ2hcC1orc4
NVIDIA's InstantNeRF extrapolates an impressive range of facial views from just four images, but resolution, expression accuracy and mobility remain major challenges to high-resolution inference. Source: https://www.youtube.com/watch?v=DJ2hcC1orc4

PersonNeRF

However, a new collaboration between the University of Washington and Google Research offers hope that NeRF could be developed into a richer and more disentangled space. PersonNeRF offers a neural radiance field representation that’s trained not on several viewpoints of the same thing (usually taken at the same moment), as is the traditional pipeline with NeRF; but rather on multiple abstract photos of the same person, from varying views and wearing various different types of clothing.

PersonNeRF allows the user to explore a 'cube' of all the generalized facets from multiple variegated photo sources. Source: https://grail.cs.washington.edu/projects/personnerf/
PersonNeRF allows the user to explore a 'cube' of all the generalized facets from multiple variegated photo sources. Source: https://grail.cs.washington.edu/projects/personnerf/

In the test case for PersonNeRF, the researchers gathered together a small multi-year dataset of Swiss former professional tennis player Roger Federer, and trained it into a network capable of generalizing Federer’s image, based on the input images.

PersonNeRF generalizes well enough on even limited source data that the user can specify camera view, body pose and appearance (including clothes, so long as the clothes are exemplified in the training dataset). Source: https://arxiv.org/pdf/2302.08504.pdf
PersonNeRF generalizes well enough on even limited source data that the user can specify camera view, body pose and appearance (including clothes, so long as the clothes are exemplified in the training dataset). Source: https://arxiv.org/pdf/2302.08504.pdf

Unlike a system such as Stable Diffusion, which has millions of abstract and similar images from which to concoct new poses and configurations on a theme, PersonNeRF is limited to representations that were included in the dataset – but the proof of concept that it offers signifies that Neural Radiance Fields is amenable to a generalized and explorable space, with the potential to train a much higher and more varied set of images, in order to increase the possibilities of diverse outputs.

The new paper is titled PersonNeRF: Personalized Reconstruction from Photo Collections, and comes from four researchers variously associated with UoW and Google Research.

Approach

The starting point for PersonNeRF was the HumanNeRF project, mostly from the same research group. HumanNeRF was able to convert people depicted in YouTube videos into discrete and explorable neural representations:

Please allow time for the animated GIF below to load

Extracting a human into a NeRF space with HumanNeRF, by the same research group. Source: https://www.youtube.com/watch?v=GM-RoZEymmw
Extracting a human into a NeRF space with HumanNeRF, by the same research group. Source: https://www.youtube.com/watch?v=GM-RoZEymmw

According to the researchers, the new work is directly evolved from HumanNeRF, but with some limitations removed in order to allow the system to generalize a labeled individual from multiple and only semi-related photographs, similar to the way that latent diffusion and GAN systems broadly extract features from generic and wide-ranging source material.

The paper notes that the project’s central insight is that multiple and varied photos of a person can be resolved into a single canonical space, i.e., a single ‘reference entity’ from which desired ‘divergences’ (of pose, dress, etc.) can be made. This is arguably the closest NeRF has yet gotten to encoding truly ‘abstract’ concepts into a neural representation.

Developing the essential 'canonical space' of the Roger Federer 'entity'.
Developing the essential 'canonical space' of the Roger Federer 'entity'.

PersonNeRF removes the mapping of non-rigid components from HumanNeRF, and uses only skeletal motion in its novel regularization formula. The project borrows from ideas developed in the 2022 RegNeRF initiative, encouraging geometric smoothing by the creation of a ‘depth smoothness loss’ on rendered depth maps. The authors note that this method encourages the creation of ‘haze’ artifacts from transparent geometry (i.e., the space between depicted objects in the NeRF), which has to be remediated by a special opacity loss algorithm.

The collections of photos used in the Federer and other tests for the paper are subdivided into appearance sets that denote photos taken around the same period.

From the paper, examples of the diverse photos used in the project.
From the paper, examples of the diverse photos used in the project.

Instead of optimizing each desired facet (appearance consistency and pose consistency) in a separate network, the training centers on a single multi-layer perceptron (MLP) for canonical appearance, into which all the labeled material is passed, training on the entire set of body passes in a single workflow.

The canonical MLP is ‘inspired’ by the architecture of the 2021 NeRF in the Wild project, with each appearance set bound to sole appearance embedding vector. This vector is concatenated with a novel pose embedding vector, which conditions the system’s pose correction module for each appearance set.

Training and Development

For the central work on the Federer dataset, the researchers collected photos by searching for particular associated sporting events across a limited number of years spanning 2009 to 2020. Each event yielded 19-24 photos for each year, and each set was accordingly labeled.

Body pose and camera pose on the dataset was estimated by SPIN, though the researchers had to intervene manually in cases of occlusion, such as where part of Federer’s body was obstructed by a racket or other non-intrinsic items. Without removing such items, they would have become essentially incorporated into the ‘Federer entity’.

The system was trained with the Adam optimizer, at varying learning rates for the canonical MLP and the rest of the network. Optimization, the paper notes, takes 200,000 iterations per game, or 600,000 iterations for all games, trained into a single network.

The advantages of generalizing on a single network: pose consistency is improved, and the network is able to resolve poses with unseen appearances, because it can draw on a greater number of related embeddings across the breadth of the input data. This is a level of generalization uncommon to NeRF-based projects.
The advantages of generalizing on a single network: pose consistency is improved, and the network is able to resolve poses with unseen appearances, because it can draw on a greater number of related embeddings across the breadth of the input data. This is a level of generalization uncommon to NeRF-based projects.

Tests

The researchers compared PersonNeRF to their prior effort HumanNeRF, running tests on the compiled Federer datasets. Where HumanNeRF had more restricted parameters, the authors similarly limited PersonNeRF. Each network was trained on 200,000 iterations.

Since there is no strict ground truth for synthesized and truly novel images, the researchers resorted to Fréchet Inception Distance (FID) as an arbitrating metric for the purposes of quantitative comparison.

Of the results, the authors state:

‘[Our] method outperforms HumanNeRF on all datasets by comfortable margins. The performance gain is particularly significant when visualizing the [results]. Our method is able to create consistent geometry, sharp details, and nice renderings, while HumanNeRF tends to produce irregular shapes, distorted textures, and noisy images, due to insufficient inputs.’

HumanNeRF, the authors note of these results, produces errors in areas where the regions are occluded from the input view, while the greater generalization capabilities of PersonNeRF are able to accommodate the data gap.
HumanNeRF, the authors note of these results, produces errors in areas where the regions are occluded from the input view, while the greater generalization capabilities of PersonNeRF are able to accommodate the data gap.

Conclusion

PersonNeRF is the first Neural Radiance Field system I’ve seen with the capacity to generalize a subject in the same way that a GAN or latent diffusion representation can. Unlike most comparable GAN or LDM systems, the topic matter in the network is not trained alongside vast swathes of related and unrelated data, so explorability is limited to such data as has been chosen to be trained.

Nonetheless, it’s easy to consider that later NeRF architectures that adopt this approach could increase the amount of data and the scope of the labels to form systems where ‘editing’ (NeRF’s primary disadvantage) is enabled by simply accessing a parameter of the trained system.

More To Explore

One2Avatar examples
AI ML DL

Better Neural Avatars From Just Five Face Images

Many neural avatar systems of the last 18 months require extensive training data, or even full videoclips. Others are performant, but have exorbitant training demands. However, a new system from Google and the University of Minnesota is proposing a photorealistic deepfake head system that’s trained on only five images – and can work quite well from just one image; and the new system of pretraining that the framework uses throws some of the conventions regarding hyperscale training datasets into question.

AI ML DL

The Challenge of Preventing ‘Identity Bleed’ in Face Swaps

KAIST AI has developed a new method of disentangling identity characteristics in a face-swap from secondary characteristics such as lighting, skin texture – and the original structure of the face to be ‘overwritten’ by the new identity. If such techniques can be perfected, facial replacement could be freed from having the original identity ‘bleeding through’ into the superimposed identity.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle