Better Stable Diffusion Inpainting by Learning to Remove Real Objects

Paint-bu-inpaint

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

Over the last few years, the potential for synthetic data to fuel the current revolution in computer vision systems and generative image systems has excited both the research sector and the commercial sector that hopes to profit from new innovations in this field.

Synthetic data can be created at will,  comes with correct annotations – and involves none of the legal precarity involved in using web-scraped datasets, such as the LAION dataset that powers Stable Diffusion.

However, the features extracted from synthetic data do not have the authentic characteristics of real-world data, and may not be adequately photorealistic to bridge a domain gap. Thus, when a model is fed with synthetic data at training time, and then asked to operate on real data at inference time (i.e., when the model is supposed to be fully functional and capable of performing operations on novel, unseen data), the inauthentic nature of the training material can lead to sub-optimal results.

By example, a number of projects have been proposed in the last couple of years that offer end-users the chance to edit images using only text – for instance, to take an image of an empty lake, propose the text prompt ‘boat with two fishermen’, and have the system superimpose this element on the original image; but if the data that model was trained on was synthetic, such as CGI-based data, the interposed element may not be entirely convincing.

Paint By Inpaint

The trouble with training models that can inpaint images in this way is that one usually has to develop a significant paired-data dataset – a collection of images that show what an image is like with a certain element, and then without it (or vice versa).

During the course of training, the model then builds up a generalized understanding of the concept of inpainting, through seeing such twinned examples multiple times, and developing a capacity to either inpaint new data or remove it from the image.

Prior works of this kind have had to inject completely synthetic data into images in order to create high-volume datasets for systems of this nature, often failing to resolve the aforementioned domain gap issues.

Recently, however, a new project from Israel has taken a different approach to the creation of frameworks of this type – by removing real objects from real photos, and teaching a generative system how to, effectively, put them back again.

Examples of photorealistic inpainting from the 'paint by inpaint' system developed by Israeli researchers. Source: https://arxiv.org/pdf/2404.18212:
Examples of photorealistic inpainting from the 'paint by inpaint' system developed by Israeli researchers. Source: https://arxiv.org/pdf/2404.18212:

In this way, the system is learning from objects that were genuinely in the picture when the photograph was taken, and consequently has an improved capacity to inject credibly photoreal image facets into novel data, later on.

The researchers for the new system, titled Paint by Inpaint, have used multiple layers of Large Language Models (LLMs), along with a host of secondary metrics and methods, to develop a model that can change parts of an image convincingly without the need for the end-user to create or upload a mask – instead, the system understands through semantic recognition, based on the prompt and on evaluation of the target picture to be edited, what the novel element is, and where in the image it should be placed.

Examples of paired data, from the extensive dataset that powers the new work.
Examples of paired data, from the extensive dataset that powers the new work.

The dataset developed to power the new work therefore features images that are both complete, and from which one element has been removed, also through a semi-automated inpainting method, using Stable Diffusion; but the end user is able to make edits solely through text instructions.

The authors state:

‘[Removing] objects (Inpaint) is significantly simpler than its inverse process of adding them (Paint), attributed to the utilization of segmentation mask datasets alongside inpainting models that inpaint within these masks.

‘Capitalizing on this realization, by implementing an automated and extensive pipeline, we curate a filtered large-scale image dataset containing pairs of images and their corresponding object-removed versions. Using these pairs, we train a diffusion model to inverse the inpainting process, effectively adding objects into images.

‘Unlike other editing datasets, ours features natural target images instead of synthetic ones; moreover, it maintains consistency between source and target by construction.’

In extensive qualitative and quantitative tests, and in a human user study, the authors contend, results indicate that Paint by Inpaint represents the state of the art in text-based image editing in diffusion models.

The new paper is titled Paint by Inpaint: Learning to Add Image Objects by Removing Them First, and comes from four researchers across the Weizmann Institute of Science and the Technion-Israel Institute of Technology. There is an associated project site, but this does not currently contain additional or supplementary material.

Method and Data

The dataset developed for the project is titled Paint by Inpaint Edit (PIPE), and consists of around one million image pairs, together with related (automatically generated) text instructions. To generate the images, a two-stage procedure is adopted:

Schema for generation of dataset images for PIPE.
Schema for generation of dataset images for PIPE.

The pairs of source and target images in PIPE are derived from object segmentation datasets that have already decomposed the elements in the images into distinct entities, removing the need to annotate the data initially. The datasets from which elements were combined include COCO and Open-Images, the latter with segmentation masks provided by LVIS.

Examples from the COCO dataset, which pre-solves the thorny task of identifying classes for objects in images. Source: https://arxiv.org/pdf/1405.0312
Examples from the COCO dataset, which pre-solves the thorny task of identifying classes for objects in images. Source: https://arxiv.org/pdf/1405.0312

The merged datasets finally comprise a combined dataset of 889,230 images spanning 1,400 classes.

The next task in development was to remove identified objects in the images using the Stable Diffusion inpainting model. The pre-existing segmentation masks were filtered with CLIP to calculate the similarity between the segmented object and its known class name. Abnormal or occluded views are filtered out through this process, and dilation morphology used to ensure entire coverage of the addressed object.

The prompts that are subsequently used in the removal process are all designed so that the object will be removed, rather than replaced with an alternate object, or merely modified. The authors explain:

‘[We] guide the model by equipping it with positive and negative prompts which are designed to to replace objects with non-objects (e.g., background). The positive prompt is set to “a photo of a background, a photo of an empty place”, while the negative prompt is defined as “an object, a <class>”, where <class> denotes the class name of the object.

‘Empirically, combining these prompts demonstrates improved robustness of the removal process. During the inpainting process, we utilize 10 diffusion steps and generate 3 distinct outputs for each input.’

Dataset filtering stages for PIPE development.
Dataset filtering stages for PIPE development.

CLIP is also used in the post-removal verification process, by calculating the standard deviation of the CLIP embeddings for the inpainted areas (a metric which the authors dub ‘CLIP Consensus’). A second pass is also performed, to ensure that the target object has been completely removed, i.e., by examining the inpainted region for CLIP similarity to the obliterated object, whereby any significant match reveals a failure case.

At this stage, there are now two entities per entry in the PIPE database – the original image, and a version of that image where an object has been removed. Though a raw class name is also included in this data, this is not enough to train a semantically-versatile inpainting system, and it’s necessary to include far more granular and comprehensive annotation.

To this end, an additional 1,878,919 object addition instructions are generated for training purposes, by three methods. Firstly, a class name-based approach, where the current existing class is transformed into the instruction add a [class].

The second approach makes use of Vision Language Models (VLMs) combined with LLMs. An automated pipeline is implemented, where the masked-out object is passed to the CogVLM model, which is capable of augmenting simple descriptions into semantically-rich annotations.

CogVLM can transform and extend base descriptions into semantically rich instructions and augmented descriptions and annotations. Source: https://arxiv.org/pdf/2311.03079
CogVLM can transform and extend base descriptions into semantically rich instructions and augmented descriptions and annotations. Source: https://arxiv.org/pdf/2311.03079

The caption produced by CogVLM is subsequently reformatted into a direct instruction (rather than description), using the Mistral-7B framework, with diverse examples of varying lengths supplied by the authors, so that the system does not overfit on any single length or particular description.

Finally, a manual reference method is adopted, where three datasets containing object references (RefCOCO, RefCOCO+, and RefCOCOg) are used as the basis for novel instructions.

The authors note that the resulting augmented dataset offers a new benchmark in regard to the sheer volume of images and related editing instructions:

Comparison between PIPE and analogous datasets.
Comparison between PIPE and analogous datasets.

At this stage, the PIPE dataset is complete. We can see in the examples below that instructions such as ‘add a bird’ are disingenuous, since the paired data has not actually added a random bird (as in previous datasets), but actually taken one away, and is presenting the original image in the light of an amended image.

PIPE images where the apparent 'addition' of an element is actually the original and unaltered source image.
PIPE images where the apparent 'addition' of an element is actually the original and unaltered source image.

However, there is no way for the training process to know about this; nor, if it cared about anything, would it care.

Training and Tests

Training of the model took place across what is increasingly becoming a standard configuration, using eight NVIDIA A100 GPUs (40Gb or 80GB model is not specified in the paper), and a collated batch size of 4096, considering the array of connected GPUs and the effect of accumulation steps – one of the highest effective batch sizes to have surfaced in the recent literature (though it should be emphasized that each individual GPU is assigned a batch size of 128).

The learning rate for the main model is 5·10−5, while the input image resolution is set to 256x256px, and training is undertaken for 60 epochs*.

The training itself is effectively a highly-modified fine-tuning of the Stable Diffusion V1.5 model, still the most widely-used version of the system, though the diffusion process is conditioned heavily by Classifier-Free Guidance (CFG), with additional inputs from the specific source images (since the final model is intended to edit existing images rather than generate entirely new images).

The authors note that several aspects of the training process are similar to the current state-of-the-art approach to text-based image editing, InstructPix2Pix (IP2P), against which the system would ultimately be tested. The most important difference between the two approaches is that IP2P uses synthetic data, while the new approach uses only real data.

InstructPix2Pix contributes much of the basic methodology of the new method, but relies on artificial data, Source: https://arxiv.org/pdf/2211.09800
InstructPix2Pix contributes much of the basic methodology of the new method, but relies on artificial data, Source: https://arxiv.org/pdf/2211.09800

The benchmarks chosen for tests of the new system, besides a test set from the authors’ own PIPE dataset, were a subset of 750 images from the COCO validation split; the MagicBrush dataset, a manually annotated collection; and the object placement assessment dataset (OPA). Images were filtered to be appropriate to the task at hand, i.e., the addition or removal of objects.

The four prior frameworks against which the new approach would be tested were IP2P; VQGAN-CLIP; Hive; and Stochastic Differential Editing (SDEdit).

Model-based metrics used were the CLIP and DINO image encoders, which evaluated the similarity between the edited images and the input (ground truth) images. For model-free metrics, simple L1 and L2 metrics were used.

In accordance with the MagicBrush methodology, the DreamBooth metric CLIP-T was also used, to estimate text-image alignment between textual descriptions from the edit instructions, and semantic evaluation of the final edited area.

Results on the PIPE test set found the new method mostly in the lead:

Results against the authors' own PIPE dataset.
Results against the authors' own PIPE dataset.

The authors comment:

‘[Our] model significantly surpasses the baselines, confirming to its high consistency level. Furthermore, it exhibits a higher level of semantic resemblance to the target ground truth image, as reflected in the CLIP-I and DINO scores. As for CLIP-T, IP2P obtains marginally better results.’

Next, the researchers tested their system against the MagicBrush dataset:

Results against the MagicBrush dataset.
Results against the MagicBrush dataset.

The authors state:

‘[Our] model achieves the best results in both model-free and model-based similarities (L1, L2, CLIP-I, and DINO) with the target image. However, while it generally matches the performance of other methods, it does not outperform VQGAN-CLIP in aligning with the text (CLIP-T), which is expected given that VQGAN-CLIP maximizes an equivalent objective during the editing process.

‘Following the approach of [MagicBrush], we also fine-tuned our model on the small finetuning training object-addition subset of MagicBrush, comparing it against IP2P, which was similarly fine-tuned. In this setting, our model outperforms IP2P in four out of five metrics, while achieving an equal CLIP-T score.’

Finally, for the quantitative round, the researchers tested their method against OPA, definitively leading the results:

Results against OPA.
Results against OPA.

The paper states:

‘The quantitative evaluations across the benchmarks demonstrate that our model consistently outperforms competing models, affirming not only its high-quality outputs but also its robustness and adaptability across various domains.’

A qualitative round was added, comparing the performance of the new system to prior approaches:

Qualitative tests. Please refer to the source paper for better resolution.
Qualitative tests. Please refer to the source paper for better resolution.

Of these results, the authors state:

‘[The] proposed model, in contrast to competing approaches, seamlessly adds synthesized objects into images in a natural and coherent manner, while maintaining consistency with the original images prior to editing.

‘Furthermore, the examples, along with those in [image above] demonstrate our model’s ability to generalize beyond its training classes, successfully integrating items such as a ”princess”, ”steamed milk”, and ”buttoned shirt”.

A user study was conducted as well, in which the participants were asked the same questions outlined in the development of the MagicBrush dataset. The first direction was ‘Compare the edit instruction with the actual changes made in the edited images. Select one edit that most accurately and consistently implements the edit instruction.’

Results from the first directions in the user study. The new method is indicated in red, and the baseline in blue. Please refer to the source paper for better resolution.
Results from the first directions in the user study. The new method is indicated in red, and the baseline in blue. Please refer to the source paper for better resolution.

The second direction was ‘Select one edited image that exhibits the best image quality. (Some aspects you may consider, such as the preservation of visual fidelity from the original image seamless blending of edited elements with the original image, and the overall natural appearance of the modifications, etc.)’.

Results from the second set of directions in the user study. The new method is indicated in red, and the baseline in blue. Please refer to the source paper for better resolution.
Results from the second set of directions in the user study. The new method is indicated in red, and the baseline in blue. Please refer to the source paper for better resolution.

Of these qualitative tests, the authors comment:

‘Overall, as our study indicates, our method leads to better results for human perception. Interestingly, as expected due to how PIPE was constructed, our model maintains a higher level of consistency with the original images in both its success and failure cases.’

Conclusion

As director Ridley Scott once said (referring to his use of shellfish to create the original facehugger’s anatomy in Alien), ‘real is always better’, and the PIPE dataset and method seems an obvious and worthwhile advance on the state of the art – not least because there is no ambiguity as to the provenance and legality of the imagery being erased or inserted, since it is all contained in the single source image.

In terms of potential use in movie and TV visual effects, practitioners may need more precise placement than most such systems are currently capable of; but if semantically-based replacement methods became reliable enough, they would surely be welcome additions to the tool-set.

* For the training against the MagicBrush dataset, the settings were a little different, with a learning rate of 10-6, a batch size of 8 per GPU, no gradient accumulation, and a training length of 250 epochs.

  My substitution of hyperlinks for the authors’ inline citations.

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle