Getting Real Transparency Into Stable Diffusion

LayerDiffuse

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

In the current period, the general trend in generative synthesis is aimed at highly scalable and effective systems that have potential as closed-source, API-accessible and highly-rentable services. The growing zeitgeist of paranoia around newer deepfake methods such as Stable Diffusion are giving companies who fund research a tacit remit not to release code or model weights, as – unusually – Stability.ai has done, and continues to do, for the Stable Diffusion range.

Only a relatively small percentage of new AI synthesis projects are directly aimed at the VFX or professional photography sector, which has less interest in generating viral, one-off text-to-video memes, and is far more concerned with new developments which offer the same kind of granular control for neurally-generated content that the scene has enjoyed for the last thirty years since the advent of CGI.

One particular luxury, which changed the face of both desktop publishing and VFX workflows, was the ability to use semi-transparent layers in image and video pipelines. Prior to the advent of Photoshop, After Effects, and the other compositing and editing applications that rose to rapid adoption in the early 1990s, the industry had had to make do with photo-chemical methods for background removal such as blue-screen matting and Disney’s better (but more expensive) yellow-based sodium vapor method.

Now, backgrounds could be knocked out in seconds, or rotoscoped out with overall less tedium and labor than the benighted ‘cut-out squad’ had experienced in the 1960s for the making of Stanley Kubrick’s 2001: A Space Odyssey.

The advent of machine learning-based image and video synthesis presents VFX professionals with output that is ‘glued together’ once more, necessitating the use of old-school methods to separate the elements for compositing.

Further, rendering elements separately in diffusion-based systems is not usually a workable solution, because the entangled nature of an AI-generated image means that it is difficult to obtain accurate reflections or shadows in this way – risking that the result will look like an out-of-context cheap cut-and-paste in Photoshop.

So, while these new technologies are very exciting, and while their potential cannot be ignored, the lack of a native alpha channel is something of a hindrance, with the primary resort currently being semantic segmentation, usually now an iteration of the Segment Anything (SAM) model. However, such methods remain too far in the semantic (image/text) domain to offer the industrial reliability of older and more established non-AI approaches.

LayerDiffuse

Thus it has been of interest to us to see a recent project from Stanford University offering a method that can introduce isolated layers into a Stable Diffusion workflow – generated images which contain alpha channel transparency, but which can still reflect (even literally) other elements that are generated separately into other layers or other images:

In this example from the new paper, the reflections and shadow for the mirror ball literally reflect the alternate background component, yet exist with the object in a separate alpha channel . Source: https://arxiv.org/pdf/2402.17113.pdf
In this example from the new paper, the reflections and shadow for the mirror ball literally reflect the alternate background component, yet exist with the object in a separate alpha channel . Source: https://arxiv.org/pdf/2402.17113.pdf

Titled LayerDiffuse, the new approach is capable of transforming any Latent Diffusion Model (LDM – i.e., Stable Diffusion) into a transparency-capable system, via the use of LoRA interpretation, and through intensive training on off-the-shelf commercial imagery which contains transparency, from outlets such as Adobe Stock.

Generated output from LayerDiffuse captures fine details such as hair, and challenging transparency effects such as glass.
Generated output from LayerDiffuse captures fine details such as hair, and challenging transparency effects such as glass.

In a user survey, results from the system scored similarly to expensive commercial stock photos that contain layers. Additionally, the system can be refined through the use of the highly popular ControlNet adjunct system for Stable Diffusion, which allows factors such as outlines, depth information and normal maps to contribute to the quality of the generated image:

ControlNet can be used in conjunction with LayerDiffuse.
ControlNet can be used in conjunction with LayerDiffuse.

The paper states:

‘We show that latent transparency can be applied to different open source image generators, or be adapted to various conditional control systems to achieve applications like foreground/background-conditioned layer generation, joint layer generation, structural control of layer contents, etc.

‘A user study finds that in most cases (97%) users prefer our natively generated transparent content over previous ad-hoc solutions such as generating and then matting. Users also report the quality of our generated transparent images is comparable to real commercial transparent assets like Adobe Stock.’

The paper is titled Transparent Image Layer Diffusion using Latent Transparency, and comes from two researchers at Stanford.

Method

The new system offers a ‘latent transparency’ approach that can enable the SDXL version of Stable Diffusion to output transparent images, in addition to multiple transparent layers.

This involves a kind of ‘steganographic’ approach, wherein the transparency quality is encoded and decoded by various external adjunct models, in an oblique way, removing the need for the kind of extensive base fine-tuning of the original model (which can have a deleterious effect on quality, since it moves around and tends to compromise the original weights that were trained into the model at great expense of money and time).

Further examples of the ability of LayerDiffuse to isolate rendered subject matter while accounting for the total effect of background influences.
Further examples of the ability of LayerDiffuse to isolate rendered subject matter while accounting for the total effect of background influences.

In line with the principles of prior projects such as Invertible Image Rescaling and Explaining and Harnessing Adversarial Examples, LayerDiffuse infiltrates a perturbation into the latent space of the core model (in this case, Stable Diffusion), but this time with the intention not of hiding an entire image or other kinds of data, but of hiding alpha channel capabilities.

Though some fine-tuning of the core Stable Diffusion model is necessary to facilitate transparency, it was necessary, the researchers state, to ensure that the original Variational Autoencoder (VAE) and the modified model share the same latent distribution closely enough that none of the typical artifacts that characterize output after fine-tuning are apparent. In other words, the features of the novel adjunct systems have to accord with the original system to a great extent.

This can be measured by a bespoke ‘harmfulness’ metric, which measures any such artifacts that occur in decoding a latent image, compared to base performance – a metric that is used extensively throughout the process.

Architectural schema for the LayerDiffuse system.
Architectural schema for the LayerDiffuse system.

With this explicit regulation in place, fine-tuning to accommodate the nascent transparency component can be undertaken. Wherever possible, the models in question are frozen and unaffected by these external processes.

A transparency encoder is trained from scratch, containing an additional alpha channel, tacked onto the RGB channels that Stable Diffusion already operates in. Then, a subsequent latent transparency decoder is trained that interprets the latent that was adjusted in the previous stage, together with the RGB reconstruction, and extracts the transparent image from the perturbed latent space.

The quality of this process, the paper states, can be augmented further with a PatchGAN discriminator loss.

Additionally, multiple layers can extend this base model through shared attention and the use of LoRAs.

The workflow for the base diffusion training and the LoRA-augmented layer model training.
The workflow for the base diffusion training and the LoRA-augmented layer model training.

In this case, the ‘background’ and ‘foreground’ LoRA operate independently but in tandem, via shared attention (i.e., the identical criteria is applied across two parallel processes).

Alternative settings for the VAE’s influence on the Unet can also provided methods to encode foreground, background, or any other number of layer conditions, and to generate blended images directly.

Alternate architectures for diverse applications.
Alternate architectures for diverse applications.

Data and Tests

The base dataset used in training and tests for LayerDiffuse comprised 20,000 HQ PNG images, with transparency, either purchased or downloaded for free from five online stock images sources (with usage licenses permitting projects of this sort).

Examples of stock images obtained for the project.
Examples of stock images obtained for the project.

Using randomly-sampled examples from this initial cache of data, an SDXL VAE for Stable Diffusion was trained with latent transparency, at a batch size of 8, with the core model itself then trained using the same parameters, assuring latent distribution agreement between the VAE and the model.

After this, for 25 successive rounds, 10,000 random samples were generated using the previous model, based on random prompts supplied by LAIONPOP, a subset of the LAION-5B dataset.

Of these, 1000 were manually cherry-picked to add to the training dataset, with the newest sample given double the chance of appearing in the training batch for the next round. This brought the total number of images to 45,000.

Next, five million text/image sample pairs were generated without further interaction, and filtered to a minimum score of 5.5, using the LAION Aesthetic threshold from LAION-5B, resulting in a million sample pairs.

Any samples without transparency were removed, along with any that were completely transparent. All images were then captioned with the open source multimodal GPT4-style captioning system LLaVA.

Lastly, the VAE and base diffusion model were fine-tuned again to 15,000 iterations on the one million images obtained.

To obtain a multi-layer dataset (where the layers can exist on separate strata, similar to Photoshop levels), an enhanced schema was adopted which includes text, foreground layer and background layer elements.

Multi-layer training schema.
Multi-layer training schema.

A mixture of ChatGPT and LLaMA2 was used for the 900,000 requests for apposite text prompts for these images. These systems were asked to add the word ‘nothing’ to the background prompt, to characterize transparent areas semantically.

After the existing models from the non-layer version were used to generate images with transparency, the Stable Diffusion Inpainting model was used to inpaint all pixels with an alpha below a value of 1, in order to obtain intermediate prompts for the completed images.

The background alpha was then inverted and inpainted again with the background prompt, to obtain the relevant hindmost layer. This process was repeated one million times to obtain a million layer pairs.

For training, the AdamW optimizer was used at a learning rate of 1e-5 (the lowest practicable rate) for the VAE and base model. A network rank of 256 was used for LoRA training, for all layers.

For the human-in-the-loop filtering, each round of triage comprised 10,000 iterations at a batch size of 16.

The systems were trained on four A100 NVIDIA GPUs acting in concert via NV-link, each containing 80GB of VRAM. The training took seven days, and was paused when human intervention was needed for the next round of optimization, with a total of 14.5 days used in terms of A100 hours. The cost was around a 1,000 dollars USD.

In terms of tests, the novelty of the system meant that there were no comparable prior systems to trial against LayerDiffuse, with the experiments limited thus to qualitative and ablative tests (the latter not covered here).

Qualitative results for LayerDiffuse. Please refer to source paper for superior resolution and detail.
Qualitative results for LayerDiffuse. Please refer to source paper for superior resolution and detail.

Of the first qualitative results presented in the paper, using the base (rather than the layered) model, the researchers state:

‘These results showcase the model’s capability to generate natively transparent images that yield high-quality glass transparency, hair, fur, and semi-transparent effects like glowing light, fire, magic effect, etc. These results also demonstrate the model’s capability to generalize to diverse content topics.’

Qualitative rounds are also presented for the multi-layer model:

Qualitative results from tests on the layered model. Please refer to the source paper for better resolution and detail, and for extensive additional results that we do not have space to feature here.
Qualitative results from tests on the layered model. Please refer to the source paper for better resolution and detail, and for extensive additional results that we do not have space to feature here.

Here the paper states:

‘These results showcase the model’s capability to generate harmonious compositions of objects that can be blended together seamlessly.

‘The layers are not only consistent with respect to illumination and geometric relationships, but also demonstrate the aesthetic quality of Stable Diffusion (e.g., the color choice of the background and foreground follows a learned distribution that looks harmonious and aesthetic).’

Conditional layer generation was also tested, where layer results have a deeper relationship to each other, but are nonetheless discretely represented (if not merged):

A limited selection of the paper's results for conditional generation. Please refer to the paper for additional results in this round, and for better resolution and detail.
A limited selection of the paper's results for conditional generation. Please refer to the paper for additional results in this round, and for better resolution and detail.

Here the authors comment:

‘We can see that the model is able to generate consistent composition with coherent geometry and illumination. In the “bulb in the church” example, the model tries to generate a aesthetic symmetric design to match the foreground. The [“sitting on bench”] [example demonstrates] that the model is able to infer the interaction between foreground and background and generate corresponding geometry.’

Additional qualitative tests were undertaken for iterative generation, where composition can be achieved over an arbitrary number of layers. In this case, for each layer, all previously-generated layers are blended into a single RGB output and used to inform the background-conditioned foreground model.

Examples of iterative generation in LayerDiffuse.
Examples of iterative generation in LayerDiffuse.

The paper states:

‘We [observe] that the model is able to interpret natural language in the context of the background image, e.g., generating a book in front of the cat. The model displays strong geometric composition [capabilities], e.g., composing a human sitting on a box.’

Finally, the system is tested with ControlNet as a guideline direction:

Integration of environment is evident and controllable in LayerDiffuse with ControlNet.
Integration of environment is evident and controllable in LayerDiffuse with ControlNet.

Regarding the ControlNet tests, the authors state:

We can see that the model is able to preserve the global structure according to the ControlNet signal to generate harmonious compositions with consistent illumination effects.

‘We also use a “reflective ball” example to show that the model is able to interact with the content of the foreground and background to generate consistent illumination like the reflections.’

To demonstrate that LayerDiffuse offers more potential than simply substituting SAM for segmentation, the authors produced samples with three segmentation methods: PPMatting, which is reported to have achieved the highest matte precision to date; Matting Anything, a fine-tuned derivative of SAM; and VitMatte, a high quality vision transformer.

Tests against 'pure' segmentation/matting reveal the distinction between LayerDiffuse's approach and straight extraction.
Tests against 'pure' segmentation/matting reveal the distinction between LayerDiffuse's approach and straight extraction.

The paper comments here:

‘We can see that several types of patterns are difficult for matting approaches, e.g., semi-transparent effects like fire, pure white fur against a pure white background, shadow [separation]. For semi-transparent contents like fire and shadows, once these patterns are blended with complicated background, separating them becomes a nearly impossible task.

‘To obtain perfectly clean elements, probably the only method is to synthesize elements from scratch, using a native transparent layer generator. We further notice the potential to use outputs of our framework to train matting models.’

The researchers conducted a user study comparing the results from Stable Diffusion in conjunction with PPMatting and Matting Anything (it is not stated why VitMatte was excluded from this study), in which participants were asked to rate images from LayerDiffuse against these two alternatives.

Further (‘Group 2’ in table below), the authors conducted another study in which participants were asked to compare the quality of LayerDiffuse-generated images against commercial stock.

Results from the user study comparing matting approaches.
Results from the user study comparing matting approaches.

Against the prior matting methods, LayerDiffuse largely won out here, with a 97% preference from study participants. Of the comparison to ‘real’ stock images, the authors comment:

‘[The] preference rate of our method is very close to commercial stock (45.3% v.s. 54.7%). Though the high-quality paid content from commercial stock is still preferred marginally.

‘This result indicates that our generated transparent content is competitive to commercial sources that require users to pay for each image.’

The paper contains extensive additional ablation and general test images, and we refer the reader to it, since there is not space here to reproduce all the samples.

Conclusion

It is easy to conclude that LayerDiffuse represents a possible solution to entanglement (the extent to which elements within a generated image are essentially ‘glued together’, and difficult to pick apart. It’s not that, since even the layered model produces images which contain reflections and environmental clues that relate to other layers.

Nonetheless, LayerDiffuse offers hope that something more controllable than post facto segmentation may be available to AI VFX production pipelines in the future. Starved as the sector is of truly granular control methods, this paper is of notable interest to VFX professionals.

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle