CGI-Style Object Control With Stable Diffusion

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Though it is relatively easy to put custom content (such as your own likeness, via systems like LoRA and DreamBooth) into the Stable Diffusion generative text-to-image system, it is far more difficult to get a consistent result with each image. Even when the text-prompt stays the same, the system will make an entirely new pass into the trained latent space of the model with each attempt, and will pick up various other random ad hoc facets on the way.

This makes it difficult to produce a multi-view sheet of a trained object or character, where the item or person in question is recognizably the same entity in each view – or to hope that once the sheet has been used to make a choice about the facet, that it will appear this way in the intended usage.

This happens, among other reasons, because latent diffusion models (LDMs) start off a generation by producing random noise. This is iterated, based on the input from text, and from the text/image component in the system, until the final version will hopefully reflect the prompt. There is no way to dispense with the randomness without also dispensing with the versatility and inventiveness which makes the system useful in the first place.

This particular issue largely defines the shortfall between generative AI and traditional CGI (which is increasingly being used to plug the gaps in neural rendering). With CGI, you have reliable meshes and reliable textures, and with generative systems such as LDMs and Generative Adversarial Networks (GANs), you can never be certain exactly what you are going to get, which inhibits professional take-up of such systems.


One possible solution to this challenge comes in the form of a new collaboration between Japanese and Chinese universities and Tencent. The approach offers a seminal method of reliably injecting consistent neural objects into a render, so that they can be depicted from multiple angles without changing appearance.

Examples from CustomNet. Source:
Examples from CustomNet. Source:

The new system, titled CustomNet, is an extension and development of the Columbia University/Toyota research project Zero-1-to-3, which does not quite have the capabilities of the new offering.

The new approach was able to improve notably both on the original work, and on several prior competing methods, offering some indication that it is increasingly becoming possible to have some kind of reliable injectable model that’s comparable to the consistency of CGI – even if there is a long way to go yet.

The new paper is called CustomNet: Zero-shot Object Customization with Variable-Viewpoints in Text-to-Image Diffusion Models, and comes from six researchers across Tsinghua Shenzhen International Graduate School, the University of Tokyo, and Tencent PCG. The initiative has an associated project page.


CustomNet is, the authors assert, the first method to propose a way to control location (i.e., where the inserted object appears within the picture), viewpoint and background in a customized synthesis pipeline. Unusually, the method does not use segmentation to isolate the inserted element, but rather re-synthesizes a proposed background in a similar way to the reference only ControlNet system adjunct for Stable Diffusion, which faithfully reproduces source images while changing them to support a user’s input prompt.

The system adapts the view-conditioned diffusion method from Zero-1-to-3, which in itself offers improved object stability by fine-tuning a base Stable Diffusion model on synthetic data:

From the original Zero-1-to-3 paper, examples of consistent object appearance across a range of viewpoints and ancillary circumstances. Source:
From the original Zero-1-to-3 paper, examples of consistent object appearance across a range of viewpoints and ancillary circumstances. Source:

A pre-trained CLIP model encodes the reference background-free object appearance into an embedding that contains high-level semantic data on the object. The object is then concatenated with the obtained viewpoint parameter (which is supplied by the Zero-1-to-3 framework) and passed to a slender multi-layer perceptron (MLP), to form a joint object embedding.

Conceptual architecture for CustomNet.
Conceptual architecture for CustomNet.

In the ‘generation branch’ of the image above (middle left), we see the process by which the location of the object is controlled. The user specifies a location, distinguished as a bounding box within the process, and the reference object is then resized into these constraints, and becomes a reference image for the workflow – though at this stage it is ‘floating’ against a null background. By contrast, the original Zero-1-to-3 approach can only place the object centrally in the output.

For background control, CustomNet offers two approaches: generation-based, and composition-based. In the former, a target background is synthesized from the user’s text prompt. In the latter, a specific source background image is reprocessed into the final output.

For the composition-based approach, the input channels for the architecture’s Unet  are extended by concatenating the user-provided background image into the standard Stable Diffusion inpainting pipeline.

Data and Tests

Most candidate datasets for this objective do not tend to come with background, but rather feature images floating over blank background. One example of this is the Objaverse dataset, which features over 800,000 individual entities and objects.

Examples from the Objaverse dataset, which features isolated objects. Source:
Examples from the Objaverse dataset, which features isolated objects. Source:

In order to make Objaverse suitable for CustomNet, the researchers created 250,000 image/text data pairs through the use of the Segment Anything Model (SAM) segmentation model and the BLIP2 captioning system, isolating and annotating each subject, respectively.

The authors state*:

‘Specifically, for a natural image, we first segment the foreground object using SAM model. Then we synthesize a novel view of the object using Zero-1-to-3 with randomly sampled relative viewpoints. The textual description of the image can be also obtained using the BLIP2 model. In this way, we can synthesize a large amount of data pairs from natural image datasets, like [OpenImages].

‘Meanwhile, the model trained with these data can synthesize more harmonious results with these natural images.’

(it may seem odd that a database with objects that have blank backgrounds would need segmentation, but the SAM isolation was needed in order to provide masks into which background material could be inserted – and the lack of existing background would presumably make this a trivial task for Segment Anything)

The 250,000 pairs obtained were trained with the AdamW optimizer, using a learning rate of 2×10-6 for 500,000 steps, at a formidable batch size of 96. The training was severely resource-intensive, even by the standard of the experimental image synthesis sector, requiring six days of training across eight NVIDIA V100 GPUs, each with 32GB of VRAM.

Former approaches tested against CustomNet were Textual Inversion; DreamBooth; ELITE; the encoder-based Google project GLIGEN; and BLIP-Diffusion.

Qualitative comparison against alternative prior frameworks.
Qualitative comparison against alternative prior frameworks.

Of these results, the authors state:

‘We see that the zero-shot methods GLIGEN, ELITE, BLIP-Diffusion, and the optimization-based method Textual Inversion are far from the identity consistent with the reference object. [DreamBooth] and the proposed CustomNet achieve highly promising harmonious customization results, while our method allows the user to control the object viewpoint easily and [obtain] diverse results.

‘In addition, our method does not require time-consuming model fine-tuning and textual embedding optimization.’

The frameworks were also tested quantitatively, using 26 different prompts to render customized images randomly three times on 50 different objects, with the visual similarity estimated through metric evaluation routines incorporated into the CLIP image encoder, and also through the 2021 self-distillation with no labels (DINO) encoder.

For CLIP, similarity was also measured directly (‘CLIP-T’ in results, as opposed to ‘CLIP-I’), with the metric standards being identity preservation (‘ID’), viewpoint preservation (‘View’) and fidelity to the prompt (‘Text’).

Quantitative results against prior frameworks.
Quantitative results against prior frameworks.

The authors note that CustomNet achieves better preservation of identity, using DINO-I and CLIP-I, than other methods.

A user-study was also conducted, in which 2,700 opinions were obtained in regard to identity similarity (how well the synthesized images reflected the source material), view variation (fidelity to the target view, and the extent to which the view seemed both novel and faithful), and text alignment (how well the synthesized image reflected the perceived intent of the prompt).

The authors note  that most participants in the study favored CustomNet over prior methods in all three aspects, with the new framework favored at the percentages 78.78, 64.67, and 67.84, respectively). These results can be seen in the right-hand section of the results table above.

CustomNet was also compared to three previous inpainting methods: Paint-by-Example; AnyDoor; and SD-Inpainting (a method native to Stable Diffusion).

Inpainting comparisons with prior approaches.
Inpainting comparisons with prior approaches.

Paint-by-Example and AnyDoor are able to inpaint the reference object based on a user-supplied background image, but the authors contend that the results seen in the source paper (the code is not currently available) demonstrates a rote copy-pasting effect, simple to pre-AI Photoshopping, rather than a true integration.

The paper observes the CustomNet is able to obtain ‘more harmonious results’, thanks to its bespoke data construction pipeline, which integrates realistic backgrounds into the generative functionality of the trained model, and contend that workflows developed from the native (‘cut out’) configuration of data in Objaverse is likely to obtain inferior results by comparison.

The new method can offer more fine-grained integrated backgrounds.
The new method can offer more fine-grained integrated backgrounds.

The researchers attribute CustomNet’s powers of disentanglement to the dual cross-attention employed in the process (see architecture schematic above).

In concluding, the paper concedes that CustomNet is currently limited to the 256px2 resolution of the Zero-1-to-3 system, and that it is not capable at this time of performing non-rigid transformations, or of affecting object styles (though the latter is arguably a trivial consideration).

Further comparative examples.
Further comparative examples.


As a proof of concept, CustomNet is an interesting addition to the small but growing number of projects that are interested in achieving reproducible object identity in serial generations – even if the examples given currently have the quality of relatively basic CGI.

Any method that make some progress in this regard without resorting to actual CGI, in the form of adjunct mesh-based systems such as 3DMMs, will be of interest to the research community.

Whatever our opinion of the quality of CustomNet itself, its stated aim is one of the most important in generative image synthesis research today, since the capability to consistently manipulate a photorealistic neural reference object from frame to frame could potentially end the long struggle to achieve flicker-free and consistent video generation.

* My conversion of the authors’ citations to hyperlinks, where the apposite link is not already included in preceding text.

More To Explore


Controllable Deepfakes With Gaussian Avatars

Could Gaussian Splatting become the hottest new deepfake technology since 2017? The massive surge of interest from the research sector suggests it might – and the latest innovation not only brings full controllability to neural or deepfaked faces, but also lets you become someone else at an unprecedented level of photorealism and efficiency.


Badly-Compressed Images Affect CLIP’s Performance, New Research Contends

CLIP is the new darling of the computer vision research, and of image-based generative AI, with wide uptake of the image/text analysis framework across the sector. However, new research indicates that CLIP’s efficiency and usefulness is negatively affected by badly-compressed images. Though this should not be a problem in the modern high-speed broadband age, it is – because so much essential data and methodologies still in use data back several decades.

It is the mark of an educated mind to be able to entertain a thought without accepting it.