A researcher from the Knowledge Engineering Institute of the University of Madrid has developed a method that offers greater compositional control over the output of text-to-image prompts in Stable Diffusion, and the ability to produce image sizes up to 4K, while still maintaining the ability to run on consumer-level hardware.
Additionally, the method, called Mixture of Diffusers (MoD), which composes multiple Stable Diffusion models inside the latent space to form ‘orchestrated’ image generations, can make use of the CPU instead of the more expensive (and usually less abundant) VRAM in the user’s graphics card.
In the image above, we can see the mapped areas (top right) where the image elements are intended to be imposed into the picture, together with the multiple separate prompts that are defining the images (bottom left, each color-coded to the mapping). In the center, we see the generated result; the seams between the six generated images have been created in the latent space, and would be practically impossible to replicate by attempting to ‘stitch together’ six separate generations, not least because no one image would segue into the borders of its adjacent image in so smooth a way.
The ability to corral multiple prompts into a single unified output also offers the possibility of creating smooth transitions between styles, as demonstrated in the image below, which cycles through variegated artist domains in a graduated way that would be challenging with less intrinsic methods:
MoD can also replicate the outpainting functionality that may be familiar to users of DALL-E 2, where the canvas bounds can be arbitrarily extended in any direction, while keeping faith with the content of the original, smaller image, and/or extending the semantic content of the extra material with alternate or amended prompts:
The system orchestrates any number of existing model instances without requiring greater resources by default. If resources are low, the elements that will enter into the final image generation can be created sequentially; if resources are abundant, quicker results can be obtained by the parallel processing of models.
MoD also offers the ability to edit existing images, and allows for direct and specific placement of an element:
Regarding the system’s ability to generate images at resolutions that are typically beyond a domestic Stable Diffusion set-up, the author comments:
‘In our experiments we were able to [enable] high-resolution image generation up to 4K images (3840×2160) while still using a 16GB RAM GPU.’
Though the images are essentially stitched together, the fact that they are composed in the latent space instead of in pixel space means that semantic gradients between the images (‘overlap’ areas that are 256 pixels wide/tall) can much more effectively transition from one image to another, while retaining overall compositionality.
The overall image remains subject to a specific compositional tone. For instance, the paper observes, in the outpainting example above (the image of a building extended upwards), the building photo has been used a guide image within the purview of a wider prompt, and is not simply a discrete element ‘glued’ into another picture.
Though no indications have been given of the extent to which transferring MoD’s operations to the CPU affects latency at inference time, the paper comments:
‘We should also note that when using a Latent Diffusion Model (such as Stable Diffusion in our experiments) the encoder and decoder networks do have a memory cost that grows linearly with the number of pixels in the images being processed. Still, since these networks are reasonable fast to evaluate, we can run the encoding and decoding processes in CPU, [where] RAM is [a] cheaper resource.’
Beyond Mere Semantics
As the new preprint points out, Stable Diffusion not only has difficulty in placing images where the users want them to appear inside the image, but also struggles to include every single specified element in a prompt. By default, the system prioritizes facets that are mentioned early inside a prompt; in cases where the prompt is very long, SD tends to jettison the lower-prioritized prompt elements (i.e., those that appear later in the sentence).
To this end, a number of systems, including the popular AUTOMATIC1111 distribution, allow the users to add prompt weights to laggard words, so that something akin to equal priority can be given to a sequence of word tokens.
For example, the contributing weight of the prompt elements in a cat sunbathing on a beach can be roughly equalized by adding growing parentheses around the trailing words, i.e., a cat (sunbathing) on a ((beach)). By the same device, extra attention can be assigned to any particular prompt element that should not be considered ‘expendable’.
Since such semantic tricks have long been considered a crude workaround by Stable Diffusion users, a new tranche of research has formed in the last six months, aimed at allowing more precise control over layout and editability of text-to-image generations. In the last few weeks alone, a number of such approaches have been published, such as Attend-and-Excite, GANalyzer, Text-driven image Manipulation framework via Space Alignment (TMSA), and GLIGEN.
MoD works by making use of an arbitrary number of noise prediction models that gradually build up a global noise prediction (i.e., the encompassing ‘tone’ of the image).
One limitation of the system is that it operates only on rectangular areas, though the authors envisage that a masking system could be adapted into the system. However, since the gradients between the sub-generations are handled in the latent space, there is some accommodation in any case for non-square content, in terms of the system accounting for irregular shapes in the ordinary course of calculating the 256px of ‘shared space’.
The paper notes that the overall noise estimation generated from the sub-generations takes a form that’s similar to Classifier Free Guidance (CFG) in Stable Diffusion – a popular system where fidelity to the user’s text-prompt can be traded off against the quality of output (though the results can be unpleasant if the model in question does not support the prompted content).
The final overall noise prediction (the combined ‘map’ that will produce the orchestrated image) is calculated from a weighted mixture of each individual noise prediction among the arbitrary number of prompt-rectangles.
The system is very economical:
‘In practice, for efficiency reasons, all [models] use the same underlying noise-prediction [network]. Because of this, using more than one model does not necessarily result in a larger memory footprint, as only one model is loaded into memory and the model calls can be made sequentially. Alternatively, if ample memory is available, all [calls] can be batched together for better GPU efficiency.’