Improving Stable Diffusion With Better Captions

About the author

Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

A new paper from Google Research offers a way to substantially reduce entanglement and color contamination, and improve object placement and various other user bugbears in the Stable Diffusion generative text-to-image mode – by using secondary language models to rewrite all the text tags that accompanied the images when they were first scraped from the web, so that the descriptions are actually meaningful.

The 'self-serving' captions provided for these images, taken from the LAION dataset that powers Stable Diffusion, are not much use in helping machine learning systems understand what is really in these pictures. Underneath, in green, we see more meaningful interpretations, from the Google LLM PaLi, and from the RECAP system developed from PaLi, for the new work. Source: https://arxiv.org/pdf/2310.16656.pdf
The 'self-serving' captions provided for these images, taken from the LAION dataset that powers Stable Diffusion, are not much use in helping machine learning systems understand what is really in these pictures. Underneath, in green, we see more meaningful interpretations, from the Google LLM PaLi, and from the RECAP system developed from PaLi, for the new work. Source: https://arxiv.org/pdf/2310.16656.pdf

The ‘Alttext’ (in red in the image above, taken from the new paper) is what the original uploaders of the image added as an alt-text attribute – an accompanying annotation, specified when the internet was developing in the 1990s, and intended to aid the use of screen readers (for instance, for vision-impaired users).

Systems such as Stable Diffusion train on such a high number of images that it is impossible to manually caption them, or to correct errant captions – therefore, the CLIP-based systems that power these generative models are forced to rely on whatever the original uploaders wrote as descriptive text.

In the example above, we see that the text ‘Home Design Ideas’ has been added to the image of a motorcycle, even though it is clearly irrelevant. This phrase occurs elsewhere in the LAION image dataset that powers Stable Diffusion, in equally irrelevant contexts:

A recurrence of the 'Home design ideas' spam campaign. Source: https://jalammar.github.io/illustrated-stable-diffusion/
A recurrence of the 'Home design ideas' spam campaign. Source: https://jalammar.github.io/illustrated-stable-diffusion/

The central idea behind such mendacious tagging is to increase site rank (how near the start of the list a particular web page will be placed in search results) for a web-page, by hijacking a popular search term, so that ‘all roads lead’ to the source page where the image is hosted.

Though this kind of SEO chicanery was quite effective 15-20 years ago, a quick image search (on any search engine) for the phrase ‘Home Design Ideas’ will confirm that such unrelated images are no longer presented for results for this valuable search term.

However, machine learning systems do not get to benefit from the improved spam-filtering tactics that search engines have been constantly refining for decades. If, during training, you tell a generative AI that the pixels of a motorcycle are related to home décor, it is likely to believe the association, and, at least occasionally, to reproduce that inappropriate pairing in images.

To combat this, the researchers of the new work fine-tuned the existing V1.4 Stable Diffusion model using exactly the same filtered LAION images that the model was originally trained on, but this time with novel and improved captions processed via the PaLi language model (also from Google Research).

Passing existing captions through a language model pipeline that rewrites and improves the caption.
Passing existing captions through a language model pipeline that rewrites and improves the caption.

In tests, the revised Stable Diffusion was now able to generate images with multiple (and correct) colors that reflect what the user prompted (instead of allowing the first-mentioned color to bleed over into other objects in the image).

In the example below, the prompt ‘Two flowers, one is blue and the other is green’, we can see on the left-hand side of the image that Stable Diffusion 1.4 cannot separately apply the colors specified in the command, whereas the modified model is capable of this:

The altered V1.4 model is additionally able to literally interpret placement commands, instead of merely placing objects in positions where it had seen them in images.

One example, shown below, demonstrates the difference between baseline V1.4 interpretation of ‘A pizza near a pineapple’, and the way that the modified model renders it:

On the left, in the image above, we see that Stable Diffusion V1.4 has seen a lot of Hawaiian pizza, but either has never seen a pineapple near a pizza (rather than cut up on top of it), or else, if it did, that image was not described properly in the alt tag on the web resource that was used.

By creating a clearer description of each identified object in the dataset, each entity then becomes notably more disentangled, and does not so easily get drawn back into cultural associations such as the one illustrated above.

Though the new work from Google Research is far from the first to propose improving on dataset captions, and while this particular pursuit is now of growing interest to the generative AI and computer vision sector, there are very few academic departments that can summon up the processing power necessary to prove such theorems at scale.

In the case of the new system being proposed, for instance, the 1.4 model was retrained at an enormous batch size of 512 (presuming that the paper’s authors have not mis-written batch size for training image dimensions), which requires hardware resources that are typically beyond the capabilities of the average computing lab.

It is hard to overstate what a problem miscaptioning is in generative AI, or to exaggerate the scores – perhaps hundreds – of papers and lines of research that could happily be abandoned for more fruitful endeavors if the text in image/text pairs was accurate instead of self-serving, truncated, absent, or actively deceptive (see ‘The SEO trap’ below).

A system such as the one being proposed here, capable of transforming bad captions into truly useful ones, would be arguably the biggest leap forward for generative AI since the launch of Stable Diffusion, DALL-E 2, and the spate of LLMs and generative AIs that have come into existence over the last 18 months.

The new publication states:

‘We show that by relabeling the corpus with a specialized automatic captioning model and training a text-to-image model on the recaptioned dataset, the model benefits substantially across the board.’

The new paper is titled A Picture is Worth a Thousand Words: Principled Recaptioning Improves Image Generation, and comes from five authors at Google Research.

The SEO Trap

There is a certain irony to this innovative and interesting outing from Google Research, since the effective (rather than official) SEO policies of Google Search ranking criteria have long encouraged web marketers and SEO consultants to ensure that image captions are optimized for KPIs, rather than serving their original purpose in the W3C specification – to describe the image textually, nominally for vision-impaired users.

We can see in the Alttext above (first row under each image) that whoever labeled these images did so with a marketing imperative, and that the tags as written contain zero actual description of what is happening in the image. Underneath, we see a default PaLi interpretation, and, respectively, a short and long version of the new RECAP system's altered description, which now has some relevance to the pictures.
We can see in the Alttext above (first row under each image) that whoever labeled these images did so with a marketing imperative, and that the tags as written contain zero actual description of what is happening in the image. Underneath, we see a default PaLi interpretation, and, respectively, a short and long version of the new RECAP system's altered description, which now has some relevance to the pictures.

While the official W3C advice and Google’s own guidelines recommend that image captions should be pertinent and descriptive, there is a tacit ‘wink’ about this among SEO practitioners, who have very little interest in providing agnostic semantic descriptions either for vision-impaired people or for AI systems – but who have a definite interest in associating images on their web-pages with trending terms, or else wish to artificially devise such associations.

While the official line from W3C and Google adjures that image alt tags should be descriptive, there are some rare moments of honesty on the internet where the truth emerges – that alt tags are often used tactically, as marketing tools, notwithstanding that image descriptions no longer describe the image correctly. Source: https://archive.li/mExYx
While the official line from W3C and Google adjures that image alt tags should be descriptive, there are some rare moments of honesty on the internet where the truth emerges – that alt tags are often used tactically, as marketing tools, notwithstanding that image descriptions no longer describe the image correctly. Source: https://archive.li/mExYx

This is easy enough for anyone to test with a desktop browser, since hovering over images in web pages will, in most browser configurations, show a tooltip revealing the alt text supplied for the image. Alternately, one can see the image/text pairs used in the web-scraped, hyperscale LAION dataset (and thus Stable Diffusion) directly, at a dedicated site.

Examples of non-apposite captions that ended up in Stable Diffusion via the LAION dataset. Source: https://rom1504.github.io/clip-retrieval/
Examples of non-apposite captions that ended up in Stable Diffusion via the LAION dataset. Source: https://rom1504.github.io/clip-retrieval/

Method

Though the summary thus far largely encapsulates the value of the new paper, let’s take a selective look at some of the methodologies and technologies used in the new system, titled RECAP.

To ensure that the results were not specious, the researchers meticulously recreated the superficial circumstances of the original training of the V1.4 Stable Diffusion model, though they did not train the model from zero, but rather fine-tuned it, which involves loading the original model and training it further.

Therefore they selected a subset of ten million photos from the LAION-2B-en improved Aesthetics dataset, using exactly the same filtering criteria as was originally employed, but holding back 10,000 photos for training validation.

Though the Google Research Pathways Language and Image (PaLi) model was used as the transformative element, the researchers first obtained 100 manually-generated captions from human raters, who were asked, respectively, to answer the image-associated questions ‘Describe what you see in each image using 1-2 detailed sentences’, and ‘Describe what you see in each image using a single short sentence’.

The constrained length of these responses was dictated by the limitations of the CLIP text encoder, which has a context size of 77 tokens.

With this diminutive dataset, the researchers trained PaLi for 300 steps at a learning rate of 4e-5, a dropout rate of 0.1, and a batch size of 64, with an equal mix between short and long captions. The data was exposed to the system multiple times, sometimes with the shorter caption, and sometimes the longer.

Above, sets of short and long captions provided by the human raters; bottom, example alternative output from the PaLi model trained on the human responses.
Above, sets of short and long captions provided by the human raters; bottom, example alternative output from the PaLi model trained on the human responses.

This small model became the basis for a far more extensive fine-tuning of the Stable Diffusion V1.4 model, which was trained on for an additional 250,000 steps (with another iteration trained at 1 million steps, for certain examples), this time at a more precise learning rate of 1e-5 –  the lowest and finest practicable rate.

Both the Unet (image) and CLIP (text) weights were affected by the training, and an even mix of short and long RECAP re-writings of captions were used.

Data and Tests

To ensure that the text was fair, the model was trained again in exactly the same configuration/s, but this time substituting the original (Alttext) captions scraped from the web for LAION. This was necessary, since fine-tuning a model will irrevocably and notably alter its original weights, which would make a comparison between the fine-tuned and original model an unfair one.

The same random seeds were used across all models, and the DDIM sampling method used for generation, with 50 inference steps, and a Classifier-Free Guidance (CFG) scale of 7.5 (though the fine-tuning does not dictate any particular sampling method or configuration thereof).

Metrics used were based on the Text-to-Image Synthesis Evaluation (TISE) initiative, and performed on Microsoft’s MS-COCO validation dataset, with an emphasis on the importance of the Fréchet Inception Distance (FID) evaluation method.

The tests also included bespoke evaluation methods for Semantic Object Accuracy, which measures the fidelity of the output image facet to the original prompt; Counting Alignment error estimation, which checks how many (if any) versions of each requested object in a prompt actually appeared in the image; and Positional Alignment, which evaluates overall conformity of position to the prompt’s specifications (see the ‘pineapple’ example outlined earlier).

Initial results for automated metrics on the RECAP model, with the improved captions pitted against the original captions.
Initial results for automated metrics on the RECAP model, with the improved captions pitted against the original captions.

The authors state:

‘In all the metrics, we see no improvement in the Alttext model compared to the baseline, proving the improvements stem from the captions themselves and not from the additional training.’

Additionally, the output images were evaluated by human raters in a qualitative round, using pictures generated on 200 random prompts across the MS-COCO validation set, and then again on the more challenging DrawBench dataset.

Four images were created from different seeds for each prompt, with the seeds consistent across the models used.

Metric used were a) the percentage of successful image generations across all seeds and models and b) the percentage of at least one successful image generation for any particular prompt.

Results from images presented to human raters.
Results from images presented to human raters.

Here the authors comment:

‘We see a relative 64.3% improvement in successful image generation on MS-COCO, and a 41.7% improvement on DrawBench. We also see a relative improvement of 42.1% in successful prompt generation on MS-COCO and 37.5% improvement on DrawBench.

‘The Alttext model showed minor improvement on the MS-COCO dataset (12%-13%) and did not improve the DrawBench dataset.’

For a qualitative round, comparisons were made using the higher-trained model. The results, the authors assert, outperform the Alttext versions.

Qualitative comparisons for Alttext vs. RECAP. Please refer to source paper for better resolution.
Qualitative comparisons for Alttext vs. RECAP. Please refer to source paper for better resolution.

Besides outlining the improvement in positional representation and color fidelity (the earlier ‘pineapple’ and ‘flowers’ examples, not visualized again here), the paper states:

‘[RECAP also better handles cases where different modifiers are applied to multiple entities (e.g. “A red bench and a yellow clock”). The base model will treat the sentence as a bag of words, applying all modifiers to all entities or ignore some of them.’

Though we do not generally cover ablation studies, in this case some of the results are noteworthy – particularly a comparison between image quality obtained between models using long or short versions of the RECAP captions, since a central contention of the work is that models are generally starved of adequate descriptive content, and that a larger amount of pertinent text could aid superior representations.

In general, a blend of long and short captions in training ('RECAP Mix') achieves overall better results.
In general, a blend of long and short captions in training ('RECAP Mix') achieves overall better results.

Ultimately, the paper concedes, a mix of long and short captions obtains the best overall results, though shorter captions yield better FID scores and longer captions achieve superior semantic representation. This is in accord with dropout and masked training, which deliberately obscures parts of the data so that generalization of the model is more flexible, and memorization (rote ‘pasting’ of training data) is avoided.

This accords as well with the practical experience of anyone who has ever trained a machine learning model, in that the earlier checkpoints tend to be very flexible, and the later checkpoints tend to be less flexible but more detailed and accurate – a fundamental characteristic of current training architectures.

In conclusion, the authors surmise that it would be interesting to conduct a full-scale training from zero, instead of a fine-tuning – an effort so formidable and costly that even Google Research hesitated to take it on speculatively, it seems.

They further consider that diverse mixtures of the three different types of captions studied for the work may yield additional improvements, and that alternative Recaptioning models not affected by CLIP’s 77-token limit offer further possibilities for improved flexibility and accuracy of output.

Conclusion

On a personal and most subjective note, perusing the comparisons offered in the paper, I observe that where objects are successfully disentangled via RECAP, their representations seem slightly ‘undercooked’ in comparison to the Alttext entangled ones – especially in the ‘pineapple’ RECAP examples, which have the characteristic color blow-out and garish aspect associated with training that has not entirely been successful:

Zooming in on the Alttext and RECAP generations in the study.
Zooming in on the Alttext and RECAP generations in the study.

In the few examples given, besides the fact that the main RECAP visuals use the 1m-trained model, no specific details about FID scoring is  provided. Additionally, the PDF is quite compressed, making a fair comparison difficult.

In any case, even with rigidly repeated comparisons, Stable Diffusion cannot be relied upon for consistent quality of output. Nonetheless, it is possible that the ‘entangled’ embeddings remain more mature, and that the original model depicts these objects best in those annoying and entangled contexts, and has difficulty maintaining the same fidelity and accuracy when the associated object is separated from its native context.

In truth, one would need an expensive from-zero training in order to ascertain how effectively better captioning could discretize individual elements while retaining quality; and that’s a formidable proposition.

More To Explore

AI ML DL

Controllable Deepfakes With Gaussian Avatars

Could Gaussian Splatting become the hottest new deepfake technology since 2017? The massive surge of interest from the research sector suggests it might – and the latest innovation not only brings full controllability to neural or deepfaked faces, but also lets you become someone else at an unprecedented level of photorealism and efficiency.

AI ML DL

Badly-Compressed Images Affect CLIP’s Performance, New Research Contends

CLIP is the new darling of the computer vision research, and of image-based generative AI, with wide uptake of the image/text analysis framework across the sector. However, new research indicates that CLIP’s efficiency and usefulness is negatively affected by badly-compressed images. Though this should not be a problem in the modern high-speed broadband age, it is – because so much essential data and methodologies still in use data back several decades.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle