NeRF Breaks Free From Being an ‘Animated Photo’

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

With all the current clamor around Stable Diffusion, Neural Radiance Fields (NeRF) is not getting a lot of love lately. Though hardly a ‘legacy’ technology (it only came on the scene in 2020), it’s arguably seen as a hamstrung approach to neural rendering, compared to the rich and explorable latent space of a latent diffusion network, or even a Generative Adversarial Network (GAN).

Resistant to editing, hard to include in a deepfakes pipeline, and more suited to urban scenes and static representations than human synthesis, NeRF is nonetheless perhaps the most accurate neural representation technology currently available for the human form – but it’s not the most imaginative.

Unlike Stable Diffusion, you can’t ‘search’ the latent space of a NeRF for hidden treasures, since a NeRF representation is pretty much limited to whatever material was present in the photos from which it derived its network. If that material is of a woman in a studio, you won’t be able to intervene in the trained model produced from those source images (at least not in any easy or meaningful way). To paraphrase John Hammond in Jurassic Park (1993), NeRF is ‘kind of a ride‘.

Please allow time for the animated GIF below to load

NVIDIA's Instant NeRF derives a complex and explorable neural scene from just four 'real' photos, complete with realistic depth of field. Source: https://www.youtube.com/watch?v=DJ2hcC1orc4
NVIDIA's InstantNeRF extrapolates an impressive range of facial views from just four images, but resolution, expression accuracy and mobility remain major challenges to high-resolution inference. Source: https://www.youtube.com/watch?v=DJ2hcC1orc4

PersonNeRF

However, a new collaboration between the University of Washington and Google Research offers hope that NeRF could be developed into a richer and more disentangled space. PersonNeRF offers a neural radiance field representation that’s trained not on several viewpoints of the same thing (usually taken at the same moment), as is the traditional pipeline with NeRF; but rather on multiple abstract photos of the same person, from varying views and wearing various different types of clothing.

PersonNeRF allows the user to explore a 'cube' of all the generalized facets from multiple variegated photo sources. Source: https://grail.cs.washington.edu/projects/personnerf/
PersonNeRF allows the user to explore a 'cube' of all the generalized facets from multiple variegated photo sources. Source: https://grail.cs.washington.edu/projects/personnerf/

In the test case for PersonNeRF, the researchers gathered together a small multi-year dataset of Swiss former professional tennis player Roger Federer, and trained it into a network capable of generalizing Federer’s image, based on the input images.

PersonNeRF generalizes well enough on even limited source data that the user can specify camera view, body pose and appearance (including clothes, so long as the clothes are exemplified in the training dataset). Source: https://arxiv.org/pdf/2302.08504.pdf
PersonNeRF generalizes well enough on even limited source data that the user can specify camera view, body pose and appearance (including clothes, so long as the clothes are exemplified in the training dataset). Source: https://arxiv.org/pdf/2302.08504.pdf

Unlike a system such as Stable Diffusion, which has millions of abstract and similar images from which to concoct new poses and configurations on a theme, PersonNeRF is limited to representations that were included in the dataset – but the proof of concept that it offers signifies that Neural Radiance Fields is amenable to a generalized and explorable space, with the potential to train a much higher and more varied set of images, in order to increase the possibilities of diverse outputs.

The new paper is titled PersonNeRF: Personalized Reconstruction from Photo Collections, and comes from four researchers variously associated with UoW and Google Research.

Approach

The starting point for PersonNeRF was the HumanNeRF project, mostly from the same research group. HumanNeRF was able to convert people depicted in YouTube videos into discrete and explorable neural representations:

Please allow time for the animated GIF below to load

Extracting a human into a NeRF space with HumanNeRF, by the same research group. Source: https://www.youtube.com/watch?v=GM-RoZEymmw
Extracting a human into a NeRF space with HumanNeRF, by the same research group. Source: https://www.youtube.com/watch?v=GM-RoZEymmw

According to the researchers, the new work is directly evolved from HumanNeRF, but with some limitations removed in order to allow the system to generalize a labeled individual from multiple and only semi-related photographs, similar to the way that latent diffusion and GAN systems broadly extract features from generic and wide-ranging source material.

The paper notes that the project’s central insight is that multiple and varied photos of a person can be resolved into a single canonical space, i.e., a single ‘reference entity’ from which desired ‘divergences’ (of pose, dress, etc.) can be made. This is arguably the closest NeRF has yet gotten to encoding truly ‘abstract’ concepts into a neural representation.

Developing the essential 'canonical space' of the Roger Federer 'entity'.
Developing the essential 'canonical space' of the Roger Federer 'entity'.

PersonNeRF removes the mapping of non-rigid components from HumanNeRF, and uses only skeletal motion in its novel regularization formula. The project borrows from ideas developed in the 2022 RegNeRF initiative, encouraging geometric smoothing by the creation of a ‘depth smoothness loss’ on rendered depth maps. The authors note that this method encourages the creation of ‘haze’ artifacts from transparent geometry (i.e., the space between depicted objects in the NeRF), which has to be remediated by a special opacity loss algorithm.

The collections of photos used in the Federer and other tests for the paper are subdivided into appearance sets that denote photos taken around the same period.

From the paper, examples of the diverse photos used in the project.
From the paper, examples of the diverse photos used in the project.

Instead of optimizing each desired facet (appearance consistency and pose consistency) in a separate network, the training centers on a single multi-layer perceptron (MLP) for canonical appearance, into which all the labeled material is passed, training on the entire set of body passes in a single workflow.

The canonical MLP is ‘inspired’ by the architecture of the 2021 NeRF in the Wild project, with each appearance set bound to sole appearance embedding vector. This vector is concatenated with a novel pose embedding vector, which conditions the system’s pose correction module for each appearance set.

Training and Development

For the central work on the Federer dataset, the researchers collected photos by searching for particular associated sporting events across a limited number of years spanning 2009 to 2020. Each event yielded 19-24 photos for each year, and each set was accordingly labeled.

Body pose and camera pose on the dataset was estimated by SPIN, though the researchers had to intervene manually in cases of occlusion, such as where part of Federer’s body was obstructed by a racket or other non-intrinsic items. Without removing such items, they would have become essentially incorporated into the ‘Federer entity’.

The system was trained with the Adam optimizer, at varying learning rates for the canonical MLP and the rest of the network. Optimization, the paper notes, takes 200,000 iterations per game, or 600,000 iterations for all games, trained into a single network.

The advantages of generalizing on a single network: pose consistency is improved, and the network is able to resolve poses with unseen appearances, because it can draw on a greater number of related embeddings across the breadth of the input data. This is a level of generalization uncommon to NeRF-based projects.
The advantages of generalizing on a single network: pose consistency is improved, and the network is able to resolve poses with unseen appearances, because it can draw on a greater number of related embeddings across the breadth of the input data. This is a level of generalization uncommon to NeRF-based projects.

Tests

The researchers compared PersonNeRF to their prior effort HumanNeRF, running tests on the compiled Federer datasets. Where HumanNeRF had more restricted parameters, the authors similarly limited PersonNeRF. Each network was trained on 200,000 iterations.

Since there is no strict ground truth for synthesized and truly novel images, the researchers resorted to Fréchet Inception Distance (FID) as an arbitrating metric for the purposes of quantitative comparison.

Of the results, the authors state:

‘[Our] method outperforms HumanNeRF on all datasets by comfortable margins. The performance gain is particularly significant when visualizing the [results]. Our method is able to create consistent geometry, sharp details, and nice renderings, while HumanNeRF tends to produce irregular shapes, distorted textures, and noisy images, due to insufficient inputs.’

HumanNeRF, the authors note of these results, produces errors in areas where the regions are occluded from the input view, while the greater generalization capabilities of PersonNeRF are able to accommodate the data gap.
HumanNeRF, the authors note of these results, produces errors in areas where the regions are occluded from the input view, while the greater generalization capabilities of PersonNeRF are able to accommodate the data gap.

Conclusion

PersonNeRF is the first Neural Radiance Field system I’ve seen with the capacity to generalize a subject in the same way that a GAN or latent diffusion representation can. Unlike most comparable GAN or LDM systems, the topic matter in the network is not trained alongside vast swathes of related and unrelated data, so explorability is limited to such data as has been chosen to be trained.

Nonetheless, it’s easy to consider that later NeRF architectures that adopt this approach could increase the amount of data and the scope of the labels to form systems where ‘editing’ (NeRF’s primary disadvantage) is enabled by simply accessing a parameter of the trained system.

More To Explore

AI ML DL

Controllable Deepfakes With Gaussian Avatars

Could Gaussian Splatting become the hottest new deepfake technology since 2017? The massive surge of interest from the research sector suggests it might – and the latest innovation not only brings full controllability to neural or deepfaked faces, but also lets you become someone else at an unprecedented level of photorealism and efficiency.

AI ML DL

Badly-Compressed Images Affect CLIP’s Performance, New Research Contends

CLIP is the new darling of the computer vision research, and of image-based generative AI, with wide uptake of the image/text analysis framework across the sector. However, new research indicates that CLIP’s efficiency and usefulness is negatively affected by badly-compressed images. Though this should not be a problem in the modern high-speed broadband age, it is – because so much essential data and methodologies still in use data back several decades.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle