New System Enables Tight Compositions and High-Res Output in Stable Diffusion

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

A researcher from the Knowledge Engineering Institute of the University of Madrid has developed a method that offers greater compositional control over the output of text-to-image prompts in Stable Diffusion, and the ability to produce image sizes up to 4K, while still maintaining the ability to run on consumer-level hardware.

Additionally, the method, called Mixture of Diffusers (MoD), which composes multiple Stable Diffusion models inside the latent space to form ‘orchestrated’ image generations, can make use of the CPU instead of the more expensive (and usually less abundant) VRAM in the user’s graphics card.

Top right, the placement map for the output of different prompts; below left, the various contributing prompts, color-coded to the areas indicated in the map. Source: https://arxiv.org/pdf/2302.02412.pdf
Top right, the placement map for the output of different prompts; below left, the various contributing prompts, color-coded to the areas indicated in the map. Source: https://arxiv.org/pdf/2302.02412.pdf

In the image above, we can see the mapped areas (top right) where the image elements are intended to be imposed into the picture, together with the multiple separate prompts that are defining the images (bottom left, each color-coded to the mapping). In the center, we see the generated result; the seams between the six generated images have been created in the latent space, and would be practically impossible to replicate by attempting to ‘stitch together’ six separate generations, not least because no one image would segue into the borders of its adjacent image in so smooth a way.

The ability to corral multiple prompts into a single unified output also offers the possibility of creating smooth transitions between styles, as demonstrated in the image below, which cycles through variegated artist domains in a graduated way that would be challenging with less intrinsic methods:

Cycling through artistic styles with Mixture of Diffusers.
Cycling through artistic styles with Mixture of Diffusers.

MoD can also replicate the outpainting functionality that may be familiar to users of DALL-E 2, where the canvas bounds can be arbitrarily extended in any direction, while keeping faith with the content of the original, smaller image, and/or extending the semantic content of the extra material with alternate or amended prompts:

Outpainting with guide image in Mixture of Diffusers (MoD).
Outpainting with guide image in Mixture of Diffusers (MoD).

High-Res Output

The system orchestrates any number of existing model instances without requiring greater resources by default. If resources are low, the elements that will enter into the final image generation can be created sequentially; if resources are abundant, quicker results can be obtained by the parallel processing of models.

Mixture of Diffusers defines multiple regions that the user can assign prompts to, with each region of pixels mapped to a latent region of size, which is divided by the upscaling factor U in the encoder.
Mixture of Diffusers defines multiple regions that the user can assign prompts to, with each region of pixels mapped to a latent region of size, which is divided by the upscaling factor U in the encoder.

MoD also offers the ability to edit existing images, and allows for direct and specific placement of an element:

Precise element placement is possible in MoD, as well as SD-Edit-style editing of existing images, by converting them into latent vectors.
Precise element placement is possible in MoD, as well as SD-Edit-style editing of existing images, by converting them into latent vectors.

Regarding the system’s ability to generate images at resolutions that are typically beyond a domestic Stable Diffusion set-up, the author comments:

‘In our experiments we were able to [enable] high-resolution image generation up to 4K images (3840×2160) while still using a 16GB RAM GPU.’

Though the images are essentially stitched together, the fact that they are composed in the latent space instead of in pixel space means that semantic gradients between the images (‘overlap’ areas that are 256 pixels wide/tall) can much more effectively transition from one image to another, while retaining overall compositionality.

The overall image remains subject to a specific compositional tone. For instance, the paper observes, in the outpainting example above (the image of a building extended upwards), the building photo has been used a guide image within the purview of a wider prompt, and is not simply a discrete element ‘glued’ into another picture.

Though no indications have been given of the extent to which transferring MoD’s operations to the CPU affects latency at inference time, the paper comments:

‘We should also note that when using a Latent Diffusion Model (such as Stable Diffusion in our experiments) the encoder and decoder networks do have a memory cost that grows linearly with the number of pixels in the images being processed. Still, since these networks are reasonable fast to evaluate, we can run the encoding and decoding processes in CPU, [where] RAM is [a] cheaper resource.’

Beyond Mere Semantics

As the new preprint points out, Stable Diffusion not only has difficulty in placing images where the users want them to appear inside the image, but also struggles to include every single specified element in a prompt. By default, the system prioritizes facets that are mentioned early inside a prompt; in cases where the prompt is very long, SD tends to jettison the lower-prioritized prompt elements (i.e., those that appear later in the sentence).

To this end, a number of systems, including the popular AUTOMATIC1111 distribution, allow the users to add prompt weights to laggard words, so that something akin to equal priority can be given to a sequence of word tokens.

For example, the contributing weight of the prompt elements in a cat sunbathing on a beach can be roughly equalized by adding growing parentheses around the trailing words, i.e., a cat (sunbathing) on a ((beach)). By the same device, extra attention can be assigned to any particular prompt element that should not be considered ‘expendable’.

Since such semantic tricks have long been considered a crude workaround by Stable Diffusion users, a new tranche of research has formed in the last six months, aimed at allowing more precise control over layout and editability of text-to-image generations. In the last few weeks alone, a number of such approaches have been published, such as Attend-and-Excite, GANalyzer, Text-driven image Manipulation framework via Space Alignment (TMSA), and GLIGEN.

Approach

MoD works by making use of an arbitrary number of noise prediction models that gradually build up a global noise prediction (i.e., the encompassing ‘tone’ of the image).

One limitation of the system is that it operates only on rectangular areas, though the authors envisage that a masking system could be adapted into the system. However, since the gradients between the sub-generations are handled in the latent space, there is some accommodation in any case for non-square content, in terms of the system accounting for irregular shapes in the ordinary course of calculating the 256px of ‘shared space’.

A 4K 'composite' image (3840x2160px) produced from 8x11 models arranged into a grid, each generated by the prompt 'Magical diagrams and runes written with chalk on a blackboard, elegant, intricate, highly detailed, smooth, sharp focus, vibrant colors, artstation, stunning masterpiece'. Source: https://albarji-mixture-of-diffusers-paper.s3.eu-west-1.amazonaws.com/4Kchalkboard.png
A 4K 'composite' image (3840x2160px) produced from 8x11 models arranged into a grid, each generated by the prompt 'Magical diagrams and runes written with chalk on a blackboard, elegant, intricate, highly detailed, smooth, sharp focus, vibrant colors, artstation, stunning masterpiece'. Source: https://albarji-mixture-of-diffusers-paper.s3.eu-west-1.amazonaws.com/4Kchalkboard.png

The paper notes that the overall noise estimation generated from the sub-generations takes a form that’s similar to Classifier Free Guidance (CFG) in Stable Diffusion – a popular system where fidelity to the user’s text-prompt can be traded off against the quality of output (though the results can be unpleasant if the model in question does not support the prompted content).

The final overall noise prediction (the combined ‘map’ that will produce the orchestrated image) is calculated from a weighted mixture of each individual noise prediction among the arbitrary number of prompt-rectangles.

The system is very economical:

‘In practice, for efficiency reasons, all [models] use the same underlying noise-prediction [network]. Because of this, using more than one model does not necessarily result in a larger memory footprint, as only one model is loaded into memory and the model calls can be made sequentially. Alternatively, if ample memory is available, all [calls] can be batched together for better GPU efficiency.’

An increasing rarity in the image generation space, the Mixture of Diffusers project has been made available at GitHub.

More To Explore

One2Avatar examples
AI ML DL

Better Neural Avatars From Just Five Face Images

Many neural avatar systems of the last 18 months require extensive training data, or even full videoclips. Others are performant, but have exorbitant training demands. However, a new system from Google and the University of Minnesota is proposing a photorealistic deepfake head system that’s trained on only five images – and can work quite well from just one image; and the new system of pretraining that the framework uses throws some of the conventions regarding hyperscale training datasets into question.

AI ML DL

The Challenge of Preventing ‘Identity Bleed’ in Face Swaps

KAIST AI has developed a new method of disentangling identity characteristics in a face-swap from secondary characteristics such as lighting, skin texture – and the original structure of the face to be ‘overwritten’ by the new identity. If such techniques can be perfected, facial replacement could be freed from having the original identity ‘bleeding through’ into the superimposed identity.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle