Using Traditional CGI to Re-Sculpt Generative AI

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Over the last several months, a growing trend has emerged in neural synthesis papers towards the increasing use of ‘traditional’ CGI techniques as a layer of user-friendly and familiar instrumentality, for AI-based VFX pipelines that wish to take advantage of the extraordinary realism of generative AI – but to retain a greater level of control in production scenarios with demanding deadlines.

This hijacking of old-school CGI for AI purposes is not new – for well over twenty years, parametric models such as 3D Morphable Models (3DMMs), Skinned Multi-Person Linear Model (SMPL) and – more recently – the 2017 FLAME framework, have been used in neural rendering workflows and experiments.

One impressive recent example of this is the generation of controllable deepfake human avatars, which uses a FLAME model to control a texture comprised of numerous individual Gaussian Splats.

Traditional CGI powers a neural approach with deepfake human avatars. Source: https://www.youtube.com/watch?v=lVEY78RwU_I

CGI-AI…?

Many people do not understand the difference between traditional CGI human representations and AI-generated faces and bodies. Though we’ll be addressing this in a dedicated article soon, the core difference is that traditional CGI uses mathematical techniques that date back 40-50 years to ‘glue’ textures of people’s faces and clothes, etc., to a math-based wire-frame.

If you can imagine sculpting a person out of clothes-hanger wire, covering it in a paper-mâché and painting it like a person, old-school CGI is the digital equivalent of that.

By contrast, modern generative systems such as Generative Adversarial Networks (GANs), Neural Radiance Fields (NeRF) and Stable Diffusion, train neural models on hundreds, thousands, or even millions of images, sometimes attaching word-based concepts to this training data.

If you can imagine being asked to make a drawing of ‘a man’ or ‘a woman’, you can imagine also how many people that you have actually seen in your life will have contributed to that non-specific drawing; which is equivalent to how AI has learned to do this too.

So, the resulting trained AI model represents a latent space in which all this data has been deeply assimilated. The hyper-real data in the latent space can then be exploited for human representations that are far more realistic than older CGI techniques can achieve.

The trouble is that getting exactly what you want out of a trained latent space can be like attempting a three-point turn in an oil-tanker, since a latent space has no native instrumentality. It is essentially a lake of data with no boats, and no really good maps that show where the fish are.

Conversely, you can control the hell out of CGI, down to the finest motion and the very last pixel – but no matter how hard VFX pipelines try, the results frequently end up in the ‘uncanny valley’, because (among other reasons) the processes that lead to the result are labored and unnatural.

Degrees of 'uncanniness' in CGI representations. Source: https://arxiv.org/pdf/2306.16233.pdf
Degrees of 'uncanniness' in CGI representations. Source: https://arxiv.org/pdf/2306.16233.pdf

Using CGI to Edit Generative AI

Though there is currently a growing amount of interest in using CGI-based heads and bodies to control or interpret neural data, one recent paper has gone a step further – by proposing a workflow that transforms an AI-generated image into a CGI-style mesh/model, lets the user change many things about the model, and then re-inserts the result back into the image.

The new system uses multiple technologies to extract the core subject and transform it into a deformable 3D mesh, before replacing the original content in a new image. Source: https://arxiv.org/pdf/2401.01702.pdf
The new system uses multiple technologies to extract the core subject and transform it into a deformable 3D mesh, before replacing the original content in a new image. Source: https://arxiv.org/pdf/2401.01702.pdf

The new approach, which comes from New York University and Intel Labs, is called Image Sculpting, and uses an extraordinary variety of techniques, including NeRF and DreamBooth, to accomplish this highly flexible interstitial stage.

The paper states:

‘The framework supports precise, quantifiable, and physically-plausible editing options such as pose editing, rotation, translation, 3D composition, carving, and serial addition. It marks an initial step towards combining the creative freedom of generative models with the precision of graphics pipelines.’

Click to play. Examples of ways in which Image Sculpting can operate on 2D generations that have been interpreted into a standard 2D space. See full video at end of article for more. Source: https://www.youtube.com/watch?v=qdk6sVr47MQ

The new work is titled Image Sculpting: Precise Object Editing with 3D Geometry Control. It comes with a project site and an accompanying video (embedded at the end of this article). Four of the five contributing researchers are from NYU, and the other from Intel Labs.

Method

The paper does not cover the generation of the source image, which could be from any source (and which, logically, could also be a real photo). Once the image is chosen, the central subject matter (such as an astronaut, in the main examples provided) is isolated using semantic segmentation – specifically with the wildly popular 2023 Segment Anything (SAM) framework, from Meta AI Research and FAIR.

Simulation of the segmentation process provided by the Segment Anything (SAM) framework.
Simulation of the segmentation process provided by the Segment Anything (SAM) framework.

Once a discrete and matted image has been obtained, it is interpreted into a NeRF object using the Score Distillation Sampling (SDS) employed in the Magic123 framework, implemented via NVIDIA’s InstantNGP technology.

The Magic123 pipeline for extracting 3D-aware NeRF objects from images. Source: https://arxiv.org/pdf/2306.17843.pdf
The Magic123 pipeline for extracting 3D-aware NeRF objects from images. Source: https://arxiv.org/pdf/2306.17843.pdf

Since the resulting NeRF object is not suitable for subsequent deformation or editing processes, its volume is converted into a mesh using the Three Studio project, which provides a Signed Distance Function (SDF).

This process involves Marching Cubes, which evaluate occupancy and vacancy in the target space, and provides an isosurface that can be texture-mapped.

The SDS-evaluated image/object is transformed first into a NeRF and then into an SDF, where coarse geometry is estimated.
The SDS-evaluated image/object is transformed first into a NeRF and then into an SDF, where coarse geometry is estimated.

The discrete surface is provided by the 2020 project Modular Primitives for High-Performance Differentiable Rendering. The texture itself is obtained by techniques developed for the 2020 NVIDIA publication Modular Primitives for High-Performance Differentiable Rendering.

By this stage, we are truly in the pre-AI era, as we now have an old-school mesh that can be manipulated via a number of venerable techniques, some of which date back certainly to the early 1990s, if not earlier. One such is space deformation, where a ‘halo’ of control area surrounds the mesh. As the halo is warped, the mesh comes along for the ride, and is likewise warped.

With space deformation, manipulation of the exaggerated catchment area will affect the source mesh.
With space deformation, manipulation of the exaggerated catchment area will affect the source mesh.

Another technique, and one that is more effective for character motion enactment, is Linear Blend Skinning, where the user creates a basic bone structure inside the mesh.

Bones control related local topography. Source: https://web.stanford.edu/class/cs248/pdf/class_13_skinning.pdf
Bones control related local topography. Source: https://web.stanford.edu/class/cs248/pdf/class_13_skinning.pdf

Once this rigging is set up, each bone will have a local effect on its nearest geometry as it is manipulated.

From the new paper, an example of a 'rigged' creature extracted from the NeRF that was extracted from the single AI-generated image.
From the new paper, an example of a 'rigged' creature extracted from the NeRF that was extracted from the single AI-generated image.

In this way, moving the bones in the arm will likewise move the arm, while rotating any neck bones that have been set up will rotate the entire head (and not just the neck, because bones need a hierarchy, and the head bone will necessarily be a ‘child’ of the neck bone, and will follow its movements, as with the structure of a real body).

However, as we can see in the image on the left below, the resultant deformations are far from the quality of the original image, and far from the quality that the new system can ultimately obtain (image right, below). There is a lot of procedure to follow yet.

On the left, the outcome of user manipulation, which does not yet have the requisite quality of the original AI-generated image that was the source for the subject.
On the left, the outcome of user manipulation, which does not yet have the requisite quality of the original AI-generated image that was the source for the subject.

Since the user may have deformed the original image significantly, image-to-image techniques are unlikely to be able to transfer the original quality back to the edited image. Instead, one needs a NeRF-style procedure that can evaluate the potential 3D properties of a single image – bit which can provide better ‘hallucination’ and abstraction of detail.

In the case of Image Sculpting, the authors used DreamBooth, which was the premier method of inserting custom content into Stable Diffusion until the advent of Low Rank Adaptation (LoRA).

With DreamBooth (and, usually, with LoRA), the user provides a handful of images, which are used to train a model. The model, usually aided by text annotations and/or ‘trigger’ words, can associate the features learned from the images with similar examples from the millions trained into the host diffusion model, and in this way learn to create accurate images of the subject from novel angles that were not present in the training data.

Here the paucity of data is extreme, since DreamBooth is being given one sole image – the crudely-manipulated user content illustrated above. Nonetheless, this is enough for the system to attempt to recreate the detail from the original image.

The authors note:

In our application, we train DreamBooth using just a single example, which is the input image. Notably, this one-shot approach with DreamBooth also effectively captures the detailed texture, thereby filling in the textural gaps present in the coarse rendering.’

The researchers experimented with a  variety of techniques for the improvement of the crude manipulation, including SDEdit (see comparison images below), but in the end opted for a combination of DreamBooth and the use of depth data obtained by the Depth functionality in the popular Stable Diffusion ancillary system ControlNet.

Diverse results of the final retouching method across varying approaches, including the use of prior technique SDEdit. Please see original paper for better resolution.
Diverse results of the final retouching method across varying approaches, including the use of prior technique SDEdit. Please see original paper for better resolution.

Regarding the role of ControlNet’s depth functionality, the paper states:

‘We use depth [ControlNet] to preserve the geometric information of user editing. The depth map is rendered directly from the deformed 3D model, bypassing the need for any monocular depth estimation. For the background region, we don’t use the depth map. This depth map serves as a spatial control signal, guiding the geometry generation in the final edited images.’

An example of ControlNet's ability to extract depth information from an image. Source: https://huggingface.co/lllyasviel/sd-controlnet-depth
An example of ControlNet's ability to extract depth information from an image. Source: https://huggingface.co/lllyasviel/sd-controlnet-depth

The researchers observe, however, that this use of Depth is not adequate to refine certain detail aspects of an image, and for this reason the process is further augmented with the use of DDIM inversion, which re-projects the image back into the trained DreamBooth latent space, so that the image is once again passed through the same kind of denoising process that originally produced the source image.

The authors comment:

‘Note that our method differs from the original Plug-and-Play use cases: we use feature injection to preserve the geometry during the coarse-to-fine process rather than translating the image according to a new text prompt.’

Finally, the improved image needs to be effectively reintegrated into the context of the original and complete source image. If the user has made notable changes, there are going to be holes that used to be occupied by the central subject, and anything less than assiduous integration is likely to appear in the light of a crude copy-and-paste operation in Photoshop.

This challenge is overcome in the new project by masking the central subject during the denoising steps mentioned above, and through the use of the most recent iteration of Stable Diffusion, SDXL, which operates at a native 1024px2 resolution. SDXL, unlike its predecessors, includes a refiner module, which is retained throughout the entire pipeline of the new project, and which, the authors claim, reduces artefacts at the final stage.

Data and Tests

Image Sculpting is the first project to attempt to transform neural data into manipulable, traditional 3D data as an interstitial stage, and therefore does not have the usual slew of competitors to contend against in tests. Its interest to the community stands, arguably, rather as a potential proof-of-concept, in a field which is still attempting to steer that three-point turn with less precise methods such as semantic manipulation (i.e., by attempting to interfere with and control the way that systems such as Stable Diffusion associate words with images in the denoising process).

Nonetheless, the paper’s authors conducted a round of tests against prior frameworks with broadly similar scope, if rather different approaches.

For the tests, as stated above, the authors adopted the methodology of Three Studio for the initial NeRF representation, and used NVIDIA Instant-NGP to extract a usable model from the NeRF. For the one-shot DreamBooth stage, SDXL-1.0 was fine-tuned using LoRA for 800 steps at a learning rate of 1e-5 (generally the lowest and most fine-grained learning rate practicable).

For a feature injection stage, the researchers used all the self-attention layers of the SDXL decoder, and only the first block of its upsampling decoder. Background inpainting was provided by the Adobe Firefly engine’s generative fill capability.

We do not have space here to reproduce the complete range of general qualitative tests provided in the paper, but present some select examples below, and refer the reader to page six of the source paper for better resolution and greater detail:

Qualitative sample results from tests purely on the new system. See the paper for more examples at better resolution.
Qualitative sample results from tests purely on the new system. See the paper for more examples at better resolution.

The authors state:

‘Qualitatively, our method combines the creative freedom of generative models with the precision of graphics pipelines to achieve precise, quantifiable, and physically plausible outcomes for object editing across a variety of scenarios.’

The researchers additionally tested Image Sculpting against the prior Object 3DIT approach, which attempts similar types of transformations based solely on language-based instructions.

Comparison of the new approach to Object 3DIT.
Comparison of the new approach to Object 3DIT.

The authors contend that 3DIT is less effective on real and complex images because the system is trained on synthetic data, and less able to straddle domains.

Additionally, the new approach was trailed against the pose-editing capabilities of the DragDiffusion framework, bolstered by the OpenPose module in ControlNet.

Comparison with DragDiffusion.
Comparison with DragDiffusion.

The paper states:

This comparison reveals that these methods encounter difficulties with complex pose manipulations because they are constrained to the 2D domain.’

The authors also compared their method to the InstructPix2Pix framework, and OpenAI’s DALL-E 3 architecture, and found that these alternate methods struggled to follow instructions in the same way as could be achieved by direct manipulation in the new approach:

Comparisons with InstructPix2Pix and Dall-E 3. We can see that the prior methods are incapable of adding just a single cherry, as per the instruction.
Comparisons with InstructPix2Pix and Dall-E 3. We can see that the prior methods are incapable of adding just a single cherry, as per the instruction.

They also created a new dataset called Sculpting Bench, to evaluate the editing capabilities of Image Sculpting. The dataset consists of 28 images across six categories of editing: carving; pose editing; rotation; translation; composition editing; and serial addition. Metrics used were DINO and DreamBooth’s CLIP-I.

To further evaluate the geometric accuracy of the system, a new metric was devised, called D-RMSE, which measures discrepancies between the depth maps of the rough renderings (prior to enhancement) and the enhanced renders.

Quantitative comparison against SDEdit, including ablative comparisons.
Quantitative comparison against SDEdit, including ablative comparisons.

The authors assert:

[Without] any enhancement, the textural quality metrics (DINO and CLIP-I scores) are quite low. SDEdit effectively preserves the edited geometry with a low D-RMSE, yet the visual quality significantly deteriorates compared to the original [image].

‘Our method offers a more advantageous balance, significantly enhancing texture quality as demonstrated by higher DINO and CLIP-I scores, while preserving geometric consistency, evidenced by a low D-RMSE score. We observe that both feature injection and depth control contribute to enhanced geometric consistency and can lead to further improvement when used together.’

The paper concludes by noting that the system has some limitations, such as its inability to render, as of yet, at higher resolutions, and observes that the addition of upscaling routines (a common measure in many generative workflows) could alleviate this.

For further visual examples an explanations of the system, please see the official project video embedded at the end of this article.

Conclusion

The primary obvious use for the system proposed would seem to be the generation of synthetic data, where multiple views of a single generated image would enable systems such as DreamBooth and LoRA to develop a more rounded visual concept from a single source image.

In one sense, Image Sculpting is analogous to the much older organic modeling system ZBrush, which allows users to operate on billions of polygons, and then translates this ungovernable amount of data into something that a CGI pipeline can actually handle.

This trend towards the deep integration of traditional CGI approaches into text/image AI-based approaches can only be stopped, it seems, by new breakthroughs in the ability of Large Language Models and multimodal trained systems to more explicitly follow instructions, and to restrict their hallucinative capabilities to the ground truth presented them (i.e., in a source image), instead of attempting to ‘improve’ the source data with related data from the enormous and variegated latent spaces of systems such as Stable Diffusion.

More To Explore

One2Avatar examples
AI ML DL

Better Neural Avatars From Just Five Face Images

Many neural avatar systems of the last 18 months require extensive training data, or even full videoclips. Others are performant, but have exorbitant training demands. However, a new system from Google and the University of Minnesota is proposing a photorealistic deepfake head system that’s trained on only five images – and can work quite well from just one image; and the new system of pretraining that the framework uses throws some of the conventions regarding hyperscale training datasets into question.

AI ML DL

The Challenge of Preventing ‘Identity Bleed’ in Face Swaps

KAIST AI has developed a new method of disentangling identity characteristics in a face-swap from secondary characteristics such as lighting, skin texture – and the original structure of the face to be ‘overwritten’ by the new identity. If such techniques can be perfected, facial replacement could be freed from having the original identity ‘bleeding through’ into the superimposed identity.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle