It is quite hard to make Stable Diffusion do exactly what you want. This first year of the text-to-image revolution that followed the generative image architecture’s release in August of 2022 has, to a certain extent, been an ‘exploratory’ period; a time for users to gain some idea of the billions of possibilities for image (and even video) creation that are baked into even the basic, ‘stock’ Stable Diffusion models, with thousands more custom-crafted creations shared every day through resources such as civit.ai.
However, more and more, as the first rush of awe finally begins to tarnish a little, using Stable Diffusion can feel more like zapping through the biggest package of TV channels ever – there’s something for everyone; there’s something real close to what you had in mind, even; but, as the Rolling Stones once pointed, you can’t usually get exactly what you want.
The problem is entanglement – the extent to which the symbiosis of words and images trained into the models can become bound up and hard to pick apart. Whatever you ask for seems to come with something else that you didn’t ask for.
Because of entanglement, a Stable Diffusion model trained on a particular (for instance) celebrity, may begin to associate a number of non-relevant image factors into the core concept being summoned up by a text prompt. For example, if a model trained on images of Tom Cruise happens to have many photos of Cruise against a green studio background, that’s likely to keep cropping up in image generations that don’t say anything about a green background. If the model is a little overfitted, even negative prompts won’t be able to keep that green wall out of your images.
This can be mitigated a little by tedious annotation, in adjunct systems such as Low Rank Adaptation (LoRA), which, unlike standard DreamBooth methodologies, not only produces a much smaller and more portable model, but also allows the user to label every item in every image in the training dataset.
The idea here is that if the model understands what each item is, it will, when trained, only reproduce those items if you specifically ask it to, and it won’t assume that something that Tom Cruise is holding or wearing in a number of pictures should be associated with the central Tom Cruise concept.
Given this approach, if automated object recognition systems such as WD14, BLIP and CLIP were more reliable, this might be the answer to disentanglement; as it stands, they can only offer a rough reliable ‘first pass’ of tagging and labeling, and the user is still facing hours of tedious curation, even for a very modest dataset of 100 images (which is fairly standard for a LoRA personalization model). At the hyperscale level of popular training datasets such as LAION, this level of manual intervention is inconceivable.
Changing the Rules
Now a new paper from China is offering a method of improving the fidelity of results to a user’s prompt, without custom-trained, per-use-case secondary solutions such as LoRA or DreamBooth, by altering the fundamental way that Stable Diffusion processes the user’s commands through the model’s latent space.
In the example above, a stylized source image is provided (far left), together with the prompts ‘a dog in a bucket’ and ‘a motorcycle’. Since there’s really nothing in the source image that contains any elements from the text prompt, the user’s expectation is that the system will adopt the style of the source image. In the second-from-left column, we can see that SD has simply taken the text prompt literally, and pretty much jettisoned the source image style.
In a simpler method (third column from left), which uses only the first half of the researchers; new approach, we see that the style but not the text content has been represented in the output.
In the final column, using the full method proposed, both the image style and text content has been represented equally – a facility that Stable Diffusion, as users will know, is very unlikely to produce by default.
In tests, the researchers found that the new approach, titled StyleAdapter, consistently achieves a better balance between the input elements, compared to former approaches, without the need to train LoRA files, but rather by interfering with the core functionality of Stable Diffusion in a multipart module that is broadly applicable to all uses.
The new paper is called StyleAdapter: A Single-Pass LoRA-Free Model for Stylized Image Generation, and comes from seven researchers across The University of Hong Kong, ARC Lab, Tencent PCG, the University of Macau, the Shenzhen Institute of Advanced Technology, and Shanghai AI Laboratory.
StyleAdapter uses two sequential approaches to achieve this decoupling: a Simple Combination (SC) module which consists of a two-path cross-attention module (TPCA); and a group of three diverse adjunct treatments: a) ‘shuffling’ the style reference image (i.e., interfering with the way that CLIP treats generated patches at inference time); b) suppressing the related CLIP class embedding; and c), providing multiple images (similar to a tiny training dataset, but an ad hoc collection intended to influence a sole image).
The first phase, the Simple Combination stage (TPCA module), consists of two parallel cross-attention modules, and edits the feature prompt that’s automatically generated by CLIP in Stable Diffusion so that a predefined learnable embedding, based on the user input, is added to the process. The embedding is passed through three Transformers blocks (see image below).
‘This approach can achieve a desirable stylization effect. However, it faces two major challenges: 1) the prompt loses controllability over the generated content, and 2) the generated image inherits both the semantic and style features of the style reference images, compromising its content fidelity.’
For one of the twin modules, Stable Diffusion’s original cross-attention representation is retained during the process, and the relevant layers frozen, so that the augmenting modules do not interfere with it. The second module is implemented with the same schema as SD, and is trained to conform to the style reference of the embedding.
The three subsequent decoupling strategies, the authors found, mitigate the limitations of the first two-pass Transformers-based approach.
The shuffling process, the authors explain, can aid separation of style and content, but at some cost, which is why it needs to be used in tandem with additional methods:
‘[Shuffling] not only disturbs the semantic information but also breaks the coherence of the style and textural information in the image, which affects the learning of style features.
‘Therefore, to avoid multiple cropping, instead of shuffling the raw style references, our shuffling design conforms to the design of the vision model.’
The Class Embedding Suppression stage, the second part in the tripartite second module phase, effectively removes CLIP’s judgement on the embedding, which allows for substitution of the augmented embedding.
The final part of the second module is the use of multiple style references, where the vision model in CLIP extracts corresponding features from all the collected images.
This is perhaps the most innovative part of StyleAdapter. Though ControlNet allows for the input of multiple images when performing image-to-image operations, this is more to do with structure than semantic meaning. By contrast, the multiple style references in StyleAdapter really do amount to a one-shot model training pipeline aimed at a single generation, trawling several images for common styles and averaging out an output style from this concatenation.
Data and Tests
Training data for the tests for StyleAdapter was supplied by the LAION-AESTHETICS dataset. The authors constructed a test set from this collection of prompts, images and style references.
ChatGPT was used to generate various prompts, with low-quality prompts manually filtered out, to arrive at a total of 50 prompts. Since some of the methods tested – CAST and StyTR – require additional images as content input, the entirety of the tests were constructed to accommodate this. Again, 50 images were selected.
The authors curated eight sets of style references obtained from the internet, each with 5-14 images, used as multi-reference inputs. One representative image was also selected from each of these sets as the ‘target’ image/concept. Thus the tests make use of 400 test pairs in total.
Stable Diffusion V1.5 was used for the tests, along with the text and vision encoders from CLIP, implemented with a sizable Vision Transformer (ViT), with a patch size of 14.
The authors fixed the parameters of the original Stable Diffusion installation and associated CLIP modules, and only updated the weights the novel StyEmb module and the cross-attention modules. The optimizer was Adam, run at a learning rate of 8×10-6, at a batch size of 8.
The tests were run on a NVIDIA Tesla 32G-V100 GPU. Input and style images were resized, respectively, to 512x512px and 224x22px. For data augmentation, random flip, random crop, resize, horizontal flipping and rotation were used, among others, notably increasing the diversity of available training images.
To apply qualitative and quantitative metrics to results, the authors conducted a human user study, and also used a CLIP-based metric to compute the cosine similarity between the source prompt and the generated images (denoted in results as Text-Sim), and also between the generated image and the source style image (called Style-Sim in results).
Additionally, Fréchet Inception Distance (FID) was used as an evaluative metric.
For the first round of tests, frameworks tried included CAST, StyTR, and three methods based on Stable Diffusion: Inversion-Based Style Transfer with Diffusion Models (InST); Textual Inversion (TI); and LoRA – as well as base Stable Diffusion results.
As we can see from the initial results in a qualitative test round, the StyleAdapter method (second column from left) clearly adopts both the style of the image and the content of the text prompt, especially in the case of the ‘monkey’ example. By contrast, the other methods output either literal interpretations or else get totally lost in the attempt to interpret the provided visual style in the source image.
The authors state:
‘While traditional style transfer methods such as [CAST] and [StyTr2] mainly focus on color transfer, and diffusion-based methods like [SD] and [InST] struggle to balance content and style, the results obtained with our StyleAdapter contain more style details from the reference images, such as brushstrokes and textures, and better match the prompt content.’
Next, the researchers tested the ability of the systems to interpret multiple image sources into a single image, under the constraints mentioned earlier.
The authors comment:
‘Our proposed method is comparable to [LoRA] in style, but it performs better in text similarity, according to the higher score of Text-Sim and the generated tie and rainbows responding to the prompts in the visualized results, which demonstrates that our StyleAdapter achieves a better balance between content similarity, style similarity, and generated quality in objective metrics.’
For the user study, the researchers selected 35 random results from the generations and polled 24 workers at AIGC, who were asked to rate them for text similarity, style similarity and general quality, for a total of 2520 votes.
Results (see table above) confirm that the output from StyleAdapter was preferred in regard to all three aspects.
The authors concede that StyleAdapter cannot yet match the potential quality of a LoRA assigned to the same task, but believe that further work in generalization could bring about parity in time.
Though StyleAdapter is not yet able to match the quality of LoRA, it remains, arguably, one of the most interesting offerings in a year that has been replete with disentanglement solutions. Since it can hook into existing and popular frameworks such as ControlNet, it could potentially become a default configuration.
Arguably, since it represents something better than default SD, and has an eye on approaching the quality of LoRA, StyleAdapter could even be a viable candidate to replace the factory settings of a Stable Diffusion installation, or to become bound into an official distribution.
This would largely depend on the runtime VRAM requirements of an optimized installation. However, since the advent of the high resolution SDXL format for Stable Diffusion, the community is already conceding a certain fatalism that the future of home generative AI is likely to become more resource intensive, and to require better GPUs with larger amounts of VRAM.