Over the last several months, a growing trend has emerged in neural synthesis papers towards the increasing use of ‘traditional’ CGI techniques as a layer of user-friendly and familiar instrumentality, for AI-based VFX pipelines that wish to take advantage of the extraordinary realism of generative AI – but to retain a greater level of control in production scenarios with demanding deadlines.
This hijacking of old-school CGI for AI purposes is not new – for well over twenty years, parametric models such as 3D Morphable Models (3DMMs), Skinned Multi-Person Linear Model (SMPL) and – more recently – the 2017 FLAME framework, have been used in neural rendering workflows and experiments.
Traditional CGI powers a neural approach with deepfake human avatars. Source: https://www.youtube.com/watch?v=lVEY78RwU_I
Many people do not understand the difference between traditional CGI human representations and AI-generated faces and bodies. Though we’ll be addressing this in a dedicated article soon, the core difference is that traditional CGI uses mathematical techniques that date back 40-50 years to ‘glue’ textures of people’s faces and clothes, etc., to a math-based wire-frame.
If you can imagine sculpting a person out of clothes-hanger wire, covering it in a paper-mâché and painting it like a person, old-school CGI is the digital equivalent of that.
By contrast, modern generative systems such as Generative Adversarial Networks (GANs), Neural Radiance Fields (NeRF) and Stable Diffusion, train neural models on hundreds, thousands, or even millions of images, sometimes attaching word-based concepts to this training data.
If you can imagine being asked to make a drawing of ‘a man’ or ‘a woman’, you can imagine also how many people that you have actually seen in your life will have contributed to that non-specific drawing; which is equivalent to how AI has learned to do this too.
So, the resulting trained AI model represents a latent space in which all this data has been deeply assimilated. The hyper-real data in the latent space can then be exploited for human representations that are far more realistic than older CGI techniques can achieve.
The trouble is that getting exactly what you want out of a trained latent space can be like attempting a three-point turn in an oil-tanker, since a latent space has no native instrumentality. It is essentially a lake of data with no boats, and no really good maps that show where the fish are.
Conversely, you can control the hell out of CGI, down to the finest motion and the very last pixel – but no matter how hard VFX pipelines try, the results frequently end up in the ‘uncanny valley’, because (among other reasons) the processes that lead to the result are labored and unnatural.
Using CGI to Edit Generative AI
Though there is currently a growing amount of interest in using CGI-based heads and bodies to control or interpret neural data, one recent paper has gone a step further – by proposing a workflow that transforms an AI-generated image into a CGI-style mesh/model, lets the user change many things about the model, and then re-inserts the result back into the image.
The new approach, which comes from New York University and Intel Labs, is called Image Sculpting, and uses an extraordinary variety of techniques, including NeRF and DreamBooth, to accomplish this highly flexible interstitial stage.
The paper states:
‘The framework supports precise, quantifiable, and physically-plausible editing options such as pose editing, rotation, translation, 3D composition, carving, and serial addition. It marks an initial step towards combining the creative freedom of generative models with the precision of graphics pipelines.’
Click to play. Examples of ways in which Image Sculpting can operate on 2D generations that have been interpreted into a standard 2D space. See full video at end of article for more. Source: https://www.youtube.com/watch?v=qdk6sVr47MQ
The paper does not cover the generation of the source image, which could be from any source (and which, logically, could also be a real photo). Once the image is chosen, the central subject matter (such as an astronaut, in the main examples provided) is isolated using semantic segmentation – specifically with the wildly popular 2023 Segment Anything (SAM) framework, from Meta AI Research and FAIR.
Since the resulting NeRF object is not suitable for subsequent deformation or editing processes, its volume is converted into a mesh using the Three Studio project, which provides a Signed Distance Function (SDF).
This process involves Marching Cubes, which evaluate occupancy and vacancy in the target space, and provides an isosurface that can be texture-mapped.
The discrete surface is provided by the 2020 project Modular Primitives for High-Performance Differentiable Rendering. The texture itself is obtained by techniques developed for the 2020 NVIDIA publication Modular Primitives for High-Performance Differentiable Rendering.
By this stage, we are truly in the pre-AI era, as we now have an old-school mesh that can be manipulated via a number of venerable techniques, some of which date back certainly to the early 1990s, if not earlier. One such is space deformation, where a ‘halo’ of control area surrounds the mesh. As the halo is warped, the mesh comes along for the ride, and is likewise warped.
Once this rigging is set up, each bone will have a local effect on its nearest geometry as it is manipulated.
In this way, moving the bones in the arm will likewise move the arm, while rotating any neck bones that have been set up will rotate the entire head (and not just the neck, because bones need a hierarchy, and the head bone will necessarily be a ‘child’ of the neck bone, and will follow its movements, as with the structure of a real body).
However, as we can see in the image on the left below, the resultant deformations are far from the quality of the original image, and far from the quality that the new system can ultimately obtain (image right, below). There is a lot of procedure to follow yet.
Since the user may have deformed the original image significantly, image-to-image techniques are unlikely to be able to transfer the original quality back to the edited image. Instead, one needs a NeRF-style procedure that can evaluate the potential 3D properties of a single image – bit which can provide better ‘hallucination’ and abstraction of detail.
In the case of Image Sculpting, the authors used DreamBooth, which was the premier method of inserting custom content into Stable Diffusion until the advent of Low Rank Adaptation (LoRA).
With DreamBooth (and, usually, with LoRA), the user provides a handful of images, which are used to train a model. The model, usually aided by text annotations and/or ‘trigger’ words, can associate the features learned from the images with similar examples from the millions trained into the host diffusion model, and in this way learn to create accurate images of the subject from novel angles that were not present in the training data.
Here the paucity of data is extreme, since DreamBooth is being given one sole image – the crudely-manipulated user content illustrated above. Nonetheless, this is enough for the system to attempt to recreate the detail from the original image.
The authors note:
‘In our application, we train DreamBooth using just a single example, which is the input image. Notably, this one-shot approach with DreamBooth also effectively captures the detailed texture, thereby filling in the textural gaps present in the coarse rendering.’
The researchers experimented with a variety of techniques for the improvement of the crude manipulation, including SDEdit (see comparison images below), but in the end opted for a combination of DreamBooth and the use of depth data obtained by the Depth functionality in the popular Stable Diffusion ancillary system ControlNet.
Regarding the role of ControlNet’s depth functionality, the paper states:
‘We use depth [ControlNet] to preserve the geometric information of user editing. The depth map is rendered directly from the deformed 3D model, bypassing the need for any monocular depth estimation. For the background region, we don’t use the depth map. This depth map serves as a spatial control signal, guiding the geometry generation in the final edited images.’
The researchers observe, however, that this use of Depth is not adequate to refine certain detail aspects of an image, and for this reason the process is further augmented with the use of DDIM inversion, which re-projects the image back into the trained DreamBooth latent space, so that the image is once again passed through the same kind of denoising process that originally produced the source image.
The authors comment:
‘Note that our method differs from the original Plug-and-Play use cases: we use feature injection to preserve the geometry during the coarse-to-fine process rather than translating the image according to a new text prompt.’
Finally, the improved image needs to be effectively reintegrated into the context of the original and complete source image. If the user has made notable changes, there are going to be holes that used to be occupied by the central subject, and anything less than assiduous integration is likely to appear in the light of a crude copy-and-paste operation in Photoshop.
This challenge is overcome in the new project by masking the central subject during the denoising steps mentioned above, and through the use of the most recent iteration of Stable Diffusion, SDXL, which operates at a native 1024px2 resolution. SDXL, unlike its predecessors, includes a refiner module, which is retained throughout the entire pipeline of the new project, and which, the authors claim, reduces artefacts at the final stage.
Data and Tests
Image Sculpting is the first project to attempt to transform neural data into manipulable, traditional 3D data as an interstitial stage, and therefore does not have the usual slew of competitors to contend against in tests. Its interest to the community stands, arguably, rather as a potential proof-of-concept, in a field which is still attempting to steer that three-point turn with less precise methods such as semantic manipulation (i.e., by attempting to interfere with and control the way that systems such as Stable Diffusion associate words with images in the denoising process).
Nonetheless, the paper’s authors conducted a round of tests against prior frameworks with broadly similar scope, if rather different approaches.
For the tests, as stated above, the authors adopted the methodology of Three Studio for the initial NeRF representation, and used NVIDIA Instant-NGP to extract a usable model from the NeRF. For the one-shot DreamBooth stage, SDXL-1.0 was fine-tuned using LoRA for 800 steps at a learning rate of 1e-5 (generally the lowest and most fine-grained learning rate practicable).
For a feature injection stage, the researchers used all the self-attention layers of the SDXL decoder, and only the first block of its upsampling decoder. Background inpainting was provided by the Adobe Firefly engine’s generative fill capability.
We do not have space here to reproduce the complete range of general qualitative tests provided in the paper, but present some select examples below, and refer the reader to page six of the source paper for better resolution and greater detail:
The authors state:
‘Qualitatively, our method combines the creative freedom of generative models with the precision of graphics pipelines to achieve precise, quantifiable, and physically plausible outcomes for object editing across a variety of scenarios.’
The authors contend that 3DIT is less effective on real and complex images because the system is trained on synthetic data, and less able to straddle domains.
The paper states:
‘This comparison reveals that these methods encounter difficulties with complex pose manipulations because they are constrained to the 2D domain.’
They also created a new dataset called Sculpting Bench, to evaluate the editing capabilities of Image Sculpting. The dataset consists of 28 images across six categories of editing: carving; pose editing; rotation; translation; composition editing; and serial addition. Metrics used were DINO and DreamBooth’s CLIP-I.
To further evaluate the geometric accuracy of the system, a new metric was devised, called D-RMSE, which measures discrepancies between the depth maps of the rough renderings (prior to enhancement) and the enhanced renders.
The authors assert:
‘[Without] any enhancement, the textural quality metrics (DINO and CLIP-I scores) are quite low. SDEdit effectively preserves the edited geometry with a low D-RMSE, yet the visual quality significantly deteriorates compared to the original [image].
‘Our method offers a more advantageous balance, significantly enhancing texture quality as demonstrated by higher DINO and CLIP-I scores, while preserving geometric consistency, evidenced by a low D-RMSE score. We observe that both feature injection and depth control contribute to enhanced geometric consistency and can lead to further improvement when used together.’
The paper concludes by noting that the system has some limitations, such as its inability to render, as of yet, at higher resolutions, and observes that the addition of upscaling routines (a common measure in many generative workflows) could alleviate this.
For further visual examples an explanations of the system, please see the official project video embedded at the end of this article.
The primary obvious use for the system proposed would seem to be the generation of synthetic data, where multiple views of a single generated image would enable systems such as DreamBooth and LoRA to develop a more rounded visual concept from a single source image.
In one sense, Image Sculpting is analogous to the much older organic modeling system ZBrush, which allows users to operate on billions of polygons, and then translates this ungovernable amount of data into something that a CGI pipeline can actually handle.
This trend towards the deep integration of traditional CGI approaches into text/image AI-based approaches can only be stopped, it seems, by new breakthroughs in the ability of Large Language Models and multimodal trained systems to more explicitly follow instructions, and to restrict their hallucinative capabilities to the ground truth presented them (i.e., in a source image), instead of attempting to ‘improve’ the source data with related data from the enormous and variegated latent spaces of systems such as Stable Diffusion.