Custom Styles in Stable Diffusion, Without Retraining or High Computing Resources


About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

A researcher from Spain has developed a new method for users to generate their own styles in Stable Diffusion (or any other latent diffusion model that is publicly accessible) without fine-tuning the trained model or needing to gain access to exorbitant computing resources, as is currently the case with Google’s DreamBooth and with Textual Inversion – both methods which are primarily intended to insert objects or people into the Stable Diffusion universe, rather than impose environmental ambience or styles (i.e. ‘in the style of Van Gogh/Kubrick/Mapplethorpe’, etc.).

Customized styles creating by altering the behavior of the CLIP interpreter inside Stable Diffusion, in a non-destructive way that requires minimal computing resources, compared to existing current methods. Source:
Customized styles creating by altering the behavior of the CLIP interpreter inside Stable Diffusion, in a non-destructive way that requires minimal computing resources, compared to existing current methods. Source:

The new method does not specifically require Stable Diffusion, and would in theory work as well on other noise-based image generation architectures, such as DALL-E 2, Imagen or Parti, if one only had the same kind of extraordinary access to them that has allowed by open-sourcing Stable Diffusion.

The new system works by the user training a novel and distinct adjunct file almost momentarily on a limited number of photos and a single text embedding (rather than a text embedding for each photo, as is the case with the competing methods). 

Since the object, in most cases, is to recreate a style rather than a specific object, only a single phrase or word is necessary, because the intention is for the user-created style to permeate and completely influence the image that results from the text prompt.

Boosting Existing Styles

The system is titled aesthetic gradients, and is not only capable of imposing novel styles that the latent diffusion model is unaware of, but also can ‘boost’ existing styles which are (in the opinion of the end-user) too scantly-represented in the dataset that trained the model which powers the latent diffusion architecture.

In the case at hand, that architecture is Stable Diffusion, whose model was trained on various subsets of LAION-5B, in a collection dubbed LAION-aesthetics.

In experiments, the paper’s researcher ‘augmented’ some already-existent styles in Stable Diffusion by adding additional image material. 


In the above image, which apparently shares a frozen seed across all renders, the left-most image is basic Stable Diffusion output; the prompt for the middle image, appends a ‘style-summoning’ keyword, but is mostly unaffected by this; and the final image, far-right, which uses the aesthetic gradients approach, changes the image notably, because it is invoking far more associated material for the prompt than is extant in the standard Stable Diffusion model – material that has been supplied by the user.

In other words, if you add even more Van Gogh data to Stable Diffusion, as adjunct material, your ‘…in the style of Van Gogh’ output will be much more…well, Van Gogh-y; and you won’t, apparently, have to trash the conventional functionality of the model by fine-tuning it (a ‘hijacking’ exemplified by the recent Waifu Diffusion ‘fork’ model); buy a high-end video card; or resort to hiring external cloud-based GPU resources from the likes of Google Colab Pro and, among others.

The method, which has been released on GitHub as a Stable Diffusion fork, only modifies the weights of the CLIP encoder in Stable Diffusion – the functionality that associates images with their labels, and acts as an interpretive layer between the user’s text-prompt, then synthesizes an apposite image that’s based on similar word/text associations trained into the model


Aesthetic gradients essentially intervene in the standard prompt>CLIP>noise>image process by interposing the aesthetic embedding generated by the user (i.e., by their contributed images and single text definition). 

The contributed images are ‘averaged out’ in the pipeline and finally normalized to the ‘unitary norm’ of the standard Stable Diffusion text2img process – augmenting rather than substituting it.

The paper notes:

‘The similarity between the two embeddings, computed as the dot product ceᵀ, can be used to measure the agreement between CLIP representation of the textual prompt and the preferences of the user. 

‘Thus, the previous expression can be used as a loss and we can perform gradient descent with respect to CLIP text encoder weights to drive the prompt representation towards the aesthetics of the user.’

In this way, the ‘standard’ output of Stable Diffusion is essentially being used as a loss metric, so that CLIP’s weights can be modified by gradient descent to steer the final result towards the user’s preference rather than the default preference.

In the author’s experiments, this process was very efficient, requiring only 20 gradient steps to make the embedding compatible with the standard CLIP encoder in Stable Diffusion (though the exact hardware used was not specified).

The author notes:

‘The resulting [representation] is more aligned to the user preference, while preserving the original semantics…Note that only the weights of the CLIP text encoder are modified, nor the visual encoder nor any other component of the diffusion model.’

The paper also observes that since the final output requires just a single embedding, the user saves storage space, and that this economy makes sharing much easier. 

Diverse Tests

Though the very short paper gives scant details of the creation or training process, it does provide diverse examples of the system in use.

In initial tests, the researcher employed two sets of images: SAC8+, which is a subset of Simulacra Aesthetic Captions.

The original Simulacra Aesthetic Captions dataset contains over 238,000 synthetic images generated by AI models such as GLIDE and Stable Diffusion, covering 40,000 user-contributed text prompts, and aesthetically-scored by real people in order to generate image/caption/rating triplets. Source:

…and LAION7+, a subset of LAION Aesthetics v1, which can score images effectively based on aesthetics originally derived from human-based scores, and generalized into an applicable model:

LAION 7+ showcases the aesthetic scoring at the heart of Stable Diffusion's popularity. Source:

In the latter case, users can see for themselves the difference that LAION’s aesthetic score makes by activating the ‘aesthetic score’ drop-down menu (the feature is off by default) for ‘cat’ at the CLIP retrieval site for LAION.

'Cat' results, live, with and without aesthetic scoring. Source:

For LAION7+, images were filtered for a rating of 7 (out of 10) or higher. 

The author tested several aesthetic embeddings with a collection of prompts (extensively detailed in the source paper) of diverse complexity and length, and observes that SAC8+ produces more ‘fantasy-like’ imagery, while LAION7+ produces more floral patterns, exemplifying the extent to which the user can potentially gain control over a suitable environment for objects and people depicted (see ‘Potential Applications’ below):

Demonstrations of the extent to which LAION breeds flowers, and SAC breeds fantasy imagery. Please refer to the source paper for many more examples.

If ‘attractive’ pictures are the primary objective (and sometimes ‘accuracy’ or ‘photorealism’ may be more important, the new system quantifiably improves the aesthetic appeal of output pictures, at least for Stable Diffusion:

Aesthetic scores with and without aesthetic gradients, as scored by the open source Simulacra Aesthetic Models in the experiments (

The paper emphasizes that the personalized model produces improved aesthetic scores without in any way modifying the source model or architecture.

Please refer to the paper itself for further qualitative results, including the appendix material, which includes tests wherein existing and unknown terms were calculated into embeddings, and either enabled new types of style to emerge, or to augment the aesthetic appeal and/or detail of existing styles.

Diverse custom styles applied via the new method, compared to standard Stable Diffusion output.

In these secondary experiments, 100 images each for the terms cloudcore, gloomcore and glowwave were scraped from Pinterest using these target terms, while five images of paintings by the 19thC Romantic Russian painter Ivan Aivazovsky were added to another dedicated embedding.

Potential VFX Applications

As is often the case with image synthesis system innovations (particularly those based on Stable Diffusion), there are potential other applications for these user-specified styles than simply generating one-shot, attractive pictures.

For instance, in the problematic task of achieving temporal coherence over a series of contiguous frames in Stable Diffusion, one of the chief issues is that lighting and environment are difficult to control as the sequence progresses. 

Potentially, a lightweight system such as aesthetic gradients could allow a user to cheaply and quickly integrate very specific environmental conditions that are easily summoned into Stable Diffusion by the apposite embedded phrase, effectively creating a ‘locked set’ in which other creations, locked into their own styles by other stochastic methods, can expect consistent lighting and reflections.

More To Explore

Main image derived from

Detecting AI-Generated Images With Inverted Stable Diffusion Images – and Reverse Image Search

A new system for the detection of AI-generated images trains partially on the noise-maps typical of Stable Diffusion and similar generative systems, as well as using reverse image search to compare images to online images from 2020 or earlier, prior to the advent of high-quality AI image systems. The resulting fake detector works even on genAI systems that have no public access, such as the DALL-E series, and MidJourney.

Illustration developed from 'AMP: Adversarial Motion Priors for Stylized Physics-Based Character Control' (

Powering Generative Video With Arbitrary Video Sources

Making people move convincingly in text-to-video AI systems requires that the system have some prior knowledge about the way people move. But baking that knowledge into a huge model presents a number of practical and logistical challenges. What if, instead, one was free to obtain motion priors from a much wider net of videos, instead of training them, at great expense, into a single model?

It is the mark of an educated mind to be able to entertain a thought without accepting it.