Editing Neural Radiance Fields with DreamBooth

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Though latent diffusion models – such as Stable Diffusion – have extraordinary generative capabilities, they have no native ability to produce temporally coherent video. Once a text-to-image text prompt is set down a random path by a random seed in a latent diffusion system, it leads to one unique frame – but the next frame could be notably different. Even if the next frame is just slightly different, the results are inevitably a little ‘psychedelic’.

Temporal inconsistency is a native bugbear in image-to-image procedures in Stable Diffusion. Source: https://old.reddit.com/r/StableDiffusion/comments/14k2zxm/harry_squatter/

Despite the general public expectation that these roadblocks will be solved imminently, and despite the sleight-of-hand involved in ‘workaround’ text-to-video techniques such as Runway’s Gen 2 system, the research sector is beginning to consider latent diffusion models less in the light of standalone systems, and more as possible adjunct technologies for existing neural systems that do have the capacity to produce temporally coherent video content.

The sector went through exactly the same process of premature excitement, disappointment and adaptation 4-5 years ago, when it became clear that Generative Adversarial Networks (GANs) had the same innate drawback. Since then, GANs have become components in broader systems, many based around CGI control interfaces.

One such ‘stable’ technology is Neural Radiance Fields (NeRF), which can compose complex 3D neural representations from a small handful of images.

Please allow time for the animated GIF below to load

NVIDIA's Instant NeRF derives a complex and explorable neural scene from just four 'real' photos, complete with realistic depth of field. Source: https://www.youtube.com/watch?v=DJ2hcC1orc4
The NeRF capture process is similar to CGI ray-tracing, building up an interpretive neural network composed of pixel values with 3D (instead of just 2D) coordinates, and with transparency (alpha) channels, so that glass and empty or 'cut-out' sections of geometry can be correctly interpreted. Source: https://www.youtube.com/watch?v=JuH79E8rdKc

Though NeRF is well capable of producing consistent static or moving neural scenes, it comes with its own set of drawbacks – not the least of which is that it is extraordinarily difficult to edit a NeRF. This is due to the way that the scene’s volume is created by ray-tracing estimated pixel positions, which ‘bakes’ texture into the same space as the geometry for the evaluated 3D coordinates. Essentially, unlike CGI, the texture is the geometry in NeRF.

The NeRF capture process is similar to CGI ray-tracing, building up an interpretive neural network composed of pixel values with 3D (instead of just 2D) coordinates, and with transparency (alpha) channels, so that glass and empty or ‘cut-out’ sections of geometry can be correctly interpreted. Source: https://www.youtube.com/watch?v=JuH79E8rdKc

There is no way to stratify these two processes at creation time – they’re intrinsically bound together, leading to deeply entangled neural representations. By default, changing anything about the geometry or texture is akin to removing salt from soup.

In theory, there is nothing to prevent ‘pre-modifying’ NeRF content by altering the contributing images before they are ingested into the NeRF training stage. However, as we have observed, Stable Diffusion cannot produce truly consistent imagery, which means that the resulting NeRF would also ‘shimmer’ and demonstrate inconsistencies. The same applies to GAN image editing.

That only leaves the use of traditional CGI image-editing – an arguably pointless procedure, since CGI itself can accomplish all that NeRF can without needing to navigate the mysteries of a latent space.

DreamEditor: DreamBooth as a NeRF-Editing System

A new collaboration between China and the US is seeking to create a better bridge between the stability of NeRF and the generative versatility of LDMs, in the form of DreamEditor – a framework that uses the Google-designed DreamBooth interpretative system to help select and transform specific areas of a NeRF.

DreamEditor allows NeRF facets to be edited with text-prompts, via Stable Diffusion and DreamBooth. Source: https://arxiv.org/pdf/2306.13455.pdf
DreamEditor allows NeRF facets to be edited with text-prompts, via Stable Diffusion and DreamBooth. Source: https://arxiv.org/pdf/2306.13455.pdf

The system utilizes the text-encoder of a pretrained LDM to isolate the areas of the NeRF that will be replaced, based only on the contents of text instructions.

Selecting areas with text.
Selecting areas with text.

Though this may seem a simple achievement to those familiar with selection tools in image-editing applications such as Photoshop, or CGI applications such as Maya or Cinema4D, this is a notable achievement for a purely neural procedure, where the necessary areas to be selected do not exist in a purely linear space, but rather in a concept/pixel construct which has to be addressed semantically.

The new method achieves consistency by fine-tuning the LDM with DreamBooth, and through an elaborate system of optimization and targeting. In qualitative and quantitative tests, including a human survey, DreamEditor was able to generally improve on analogous prior systems, in some cases by a notable degree.

Further examples of specific aspect-editing.
Further examples of specific aspect-editing.

Previous systems have achieved a similar scope by entirely reinterpreting images, instead of selecting sections and changing the contents; but this approach comes at a cost of coherence of continuity as the user explores the NeRF, or else risks a well-generalized but poorly-executed edit.

Though the new work mentions a number of supplementary videos, including 360-degree traversals, these are not currently available. We have contacted the authors regarding this.

The new paper is titled DreamEditor: Text-Driven 3D Scene Editing with Neural Fields, and comes from five authors across China’s Sun Yat-sen University, and the University of Pennsylvania in the USA.

Approach

After the creation of the NeRF, the DreamEditor framework is divided into three stages. First, the original Neural Radiance Field created is converted into a mesh-based neural field, via the Marching Cubes method, which ‘polls’ the estimated x/y/z coordinates of the NeRF and translates them into a conventional coordinate space, such as might be found in a traditional CGI environment. This process makes selection possible, since there are now rational and addressable places in which to make a selection.

Marching cubes capture coordinates from neural geometry and convert them into conventional mesh-based coordinates. Source: https://polycoding.net/marching-cubes/part-1/

Next, the text-to-image model is customized to the NeRF content using DreamBooth, for a sparse 500 iterations, and the generated cross-attention maps are utilized to identify the correct editing area. Lastly, the target NeRF object is directly edited via text-prompts through a latent diffusion model, using the Score Distillation Sampling (SDS) loss of DreamBooth.

Conceptual workflow for DreamEditor.
Conceptual workflow for DreamEditor.

Essential to this process is the distillation of the initial obtained NeRF. Here the authors have followed the basic methodology of the prior method NeuMesh, which adopts the Marching Cube conversion mechanism. During this process, the original NeRF becomes the teacher in a teacher-student model, where the original NeRF’s rays are randomly sampled and converted into exemplar coordinates in conventional space. Eikonal loss is added on the sampled points to regularize them.

In the second stage, the input views from the NeRF are fine-tuned in DreamBooth, running on Hugging Face, which adapts the model to the scene. The fine-tuned DreamBooth model is then used to generate the necessary 2D masks (remember that the 3D geometry consists of concatenated ‘flat’ pictures) for each rendered view.

The noisy generated images from DreamBooth are fed back into the diffusion model (i.e., Stable Diffusion) for denoising, which is the native and core functionality of an LDM. Attention maps are extracted for the target keywords (i.e., ‘giraffe’) which will form part of the transformative process. The various attention maps are optimized and regularized, to eventually form a composite aggregated attention map.

Further examples of DreamEditor's transformative capacities.
Further examples of DreamEditor's transformative capacities.

At 45-degree intervals, all elevation and azimuth angles are intermittently sampled, in order to provide enough representative coverage. The system back-projects all the areas that will be affected by the edit straight into the mesh, and the editing area is thus delineated.

In the last step, the optimization of the editing region is improved by the use of the SDS loss from the 2022 DreamFusion Text-to-3D initiative, which also leverages NeRF. This helps the scene to conform to the injected text prompt. To accomplish this, random renders are passed to the LDM (since the objective is representative generalization and not absolute conformity) together with the selected text prompt, and the results backpropagated to the gradients in the neural field (i.e., the original NeRF environment). Since DreamFusion uses Google’s unreleased Imagen framework, the SDS loss is calculated via the open source Stable Diffusion instead.

Data and Tests

The researchers conducted various rounds of quantitative and qualitative tests on a variety of subjects, including single-object scenes, human faces and indoor environments. To test the general applicability of the system, the scenes were drawn from diverse datasets featuring different levels of complexity: Large Scale Multi-view Stereopsis Evaluation (DTU); BlendedMVS; Facebook’s Common Objects In 3D (CO3D); and Geometric Learning with 3D Reconstruction (GL3D). To evaluate edit quality, 50 viewpoints around the editing scene were sampled.

Baselines used for comparison were D-DreamFusion; Stable Diffusion and DreamBooth; Instruct-NeRF2NeRF (IN2N). The researchers created a dedicated metric for the study, titled CLIP Text-Image directional similarity, and used it in a human evaluation (this basic approach was formerly used in IN2N). Four scenes and 20 editing operations were covered in the tests.

In all tests, the 2023 NeuS framework was used to generate the vanilla NeRF neural environment/object. Unusually, the version of Stable Diffusion employed was the base V2 model (which garnered the opprobrium of the user community at the time of release, and was quickly succeeded by V2.1).

During the distillation process, the Adam optimizer was used over 100,000 iterations, implemented in PyTorch.

Images from the qualitative results are featured in the above panel illustrations in this article.

Regarding the qualitative round, the authors state:

‘Results demonstrate that our method can effectively perform targeted editing of neural fields in various scenes…our method can accurately determine the horse sculpture [image below] as the editing region, subsequently turning it into a deer or giraffe with high-quality textures and geometry.’

In quantitative tests, for both the novel CLIP-based metric and the user study, the new system obtained notably higher results than the rival systems:

Results from the quantitative tests.
Results from the quantitative tests.

Regarding DreamEditor’s superiority in the human-evaluated round, the authors observe:

‘[Our] method receives over 81.1% of the votes, surpassing the other methods by a significant margin. This further demonstrates DreamEditor can achieve much higher user satisfaction across various scenes.’

Editing results from the tests.
Editing results from the tests.

The authors concede that the method has some limitations, chiefly in the form of the ‘Janus problem’, which affects DreamFusion, and knocks onto the functionality of DreamEditor. This issue can cause multiple faces to be rendered in certain eventualities (the Janus problem is named after a Roman god with several faces). Additionally, DreamEditor cannot directly model environmental lighting, which would therefore require some consideration in the editing of the input material. Finally, self-occlusions can occasionally be problematic, since these are hard to factor into the work-flow.

Conclusion

The neural image synthesis scene is currently akin to the three wise monkeys, in that each notable framework (NeRF, GAN, and latent diffusion models) is missing at least one vital element necessary for a discrete and self-contained generative system that would be capable of producing temporally coherent and structurally consistent neural representations.

Fine-tuning methods such as DreamBooth offer at least a rudimentary approximation of the reproducibility and consistency associated with traditional mesh and texture-mapping approaches from CGI; but these represent, as with DreamEditor, bottlenecks and relatively clumsy workarounds.

That the first temporally consistent generative system will be a composite of diverse technologies, rather than a discrete and self-contained solution, seems inevitable at this point, barring some unpredictable breakthrough that uses an alternate and entirely novel fourth approach.

The main question is, arguably: which combination of the existing approaches is likely to be the most rational and effective?

NeRF remains the likeliest ‘host’ environment for the superior editability and versatility of text-to-image systems, yet is hamstrung by the extent to which its output is ‘baked’ and difficult to penetrate, edit or even affect; GAN has become eminently editable in a series of impressive bold research initiatives over the last 18 or so months, but has no ability to understand time; and latent diffusion models, including Stable Diffusion, are in a similar position, in addition to lacking some of the fidelity that’s achievable in GAN.

Most solutions that seek to integrate superior generative systems such as LDMs into a NeRF workflow achieve the integration through quite tortuous bridging methods, such as 3DMM-based parametric CGI meshes. If a version of NeRF could be developed that natively addressed the need to incorporate ‘standard’ and addressable X/Y/Z space at training time, a new and potentially fruitful canvas would be available to ancillary image-editing systems. Time will tell if this kind of less ‘hermetic’ version of NeRF is achievable, or if complex bridging systems will continue to be needed.

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle