High Resolution (And High Accuracy) Stable Diffusion With a Relatively Simple Hack

About the author

Picture of Martin Anderson

Martin Anderson

I'm Martin Anderson, a writer occupied exclusively with machine learning, artificial intelligence, big data, and closely-related topics, with an emphasis on image synthesis, computer vision, and NLP.

Share This Post

The progress of the image synthesis scene is characterized by fits and starts – something fundamental comes along only very rarely, and the research sector tends to spend years offering refinements and incremental improvements, rather than actually developing something of weight, substance and utility.

In the case of Latent Diffusion Models, best exemplified by the runaway success of the open source Stable Diffusion system, the papers subsequent to its release in 2022 are dominated by such ‘toying around’ with the essentials of the framework.

There are exceptions – the advent of DreamBooth and LoRA allowed customization and personalization for the base Stable Diffusion models, while the ControlNet ancillary system offered users a way to reign in the chaos of diffusion-based workflows, and create very specific imagery, without being confined to the semantic ambiguities of text prompts.

In fairness, these developments alone were more significant than any follow-on work to the revelatory release of Generative Adversarial Networks (GANs) and Neural Radiance Fields (NeRF), both of which astonished the world initially, and neither of which, despite years of effort and investment from the commercial and academic scenes, have resulted in a practical or profitable synthesis system.

However, the general run of innovation around LDMs offers less epochal innovations than DreamBooth and LoRA; most of the papers seeking to reign in the chronic entanglement, lack of temporal stability, and some of the other bugbears of Stable Diffusion, tend to involve the imposition of burdensome secondary technologies, such as CGI (especially in the form of 3DMM-style frameworks such as FLAME) and a heavy cumulative weight of other synthesis libraries.  

Upsample Guidance

By contrast, last week brought forth a paper from Korea that claims to have found a way to fundamentally improve not just Stable Diffusion, or indeed specifically image-based generative systems, but potentially any generative system that uses denoising as its root method.

To boot, the method devised by the researchers is essentially just a modification of the way that Stable Diffusion works in any case, and which not only allows the system to effectively output at higher resolutions, but apparently improves the accuracy and fidelity to the text-prompt that was used to devise the image:

Diffusion results without (above) and with (below) the new Upsample Guidance (UG) technique. Source: https://arxiv.org/pdf/2404.01709.pdf
Diffusion results without (above) and with (below) the new Upsample Guidance (UG) technique. Source: https://arxiv.org/pdf/2404.01709.pdf

The improvement in prompt fidelity is a fortunate by-product, it seems, of the researchers’ primary aim of getting Stable Diffusion to work natively at higher resolutions than the data that it was trained on.

There is a logic to this: the 512x512px images sampled from the LAION subset used to train what continues to be the most popular SD model, V1.5, mean that Stable Diffusion tends to think in 512×512 patches.

This means that if you ask for ‘A lion eating a banana in a restaurant’ (see leftmost example above), and tell Stable Diffusion to give you a 1024x1024px image, it will natively tend to concatenate two 512x512px images, and do what I can to stay faithful to the prompt within those limitations.

If, instead, as the Korean researchers have apparently achieved, you can scale up those early 512x512px signals into a larger dimension space, Stable Diffusion does not need to perform this background ‘stitching’ together of 512px panels, and then tortuously try and make the prompt work.

In the leftmost two examples in the image above, we can see evidence of this ‘paneling’, in the native and unaided Stable Diffusion generations:

Freed from 'thinking in 512px', Stable Diffusion can produce high-resolution output without compromising the layout, and without the scores of ancillary layout systems that have beset the literature since SD was released.
Freed from 'thinking in 512px', Stable Diffusion can produce high-resolution output without compromising the layout, and without the scores of ancillary layout systems that have beset the literature since SD was released.

It is difficult to say whether the significance of this new approach, titled upsample guidance (UG), lies more in the ease with which native high resolution output can be achieved, or whether the added fidelity to the prompt is the more notable development.

In either case, the system is rare in that it does not require fine-tuning of the base model, or any of the dozens of layout systems that have arisen in the last 18 months, all of which seek to use the stubborn tendency of Stable Diffusion to ‘think inside the box’ to some kind of apparent advantage.

Further examples of native Stable Diffusion rendering behavior on the left, with more cohesive high-res results, using UG, on the right.
Further examples of native Stable Diffusion rendering behavior on the left, with more cohesive high-res results, using UG, on the right.

The paper states*:

‘[Upsample] guidance can be universally introduced to any types of diffusion model, including pixel-space, latent-space, or even video diffusion model.

‘Moreover, it is fully compatible with any diffusion models or all previously proposed techniques that improve or control diffusion models such as SDEdit, ControlNet, LoRA, and IP-Adapter.

‘Surprisingly, our method can even allows to generate higher resolution images that never shown in the training dataset, such as 642 resolution images of CIFAR-10 dataset, which has 322 resolution images.’

The new paper is titled Upsample Guidance: Scale Up Diffusion Models without Training, and comes from three researchers across Seoul National University, Seoul’s Center for Theoretical Physics and Artificial Intelligence Institute, Seoul National University, Seoul’s School of Computational Science, and the Korea Institute for Advanced Study.

An example of the improvement in quality of upscaling capable under the new system.
An example of the improvement in quality of upscaling capable under the new system.

Method

Latent diffusion models such as Stable Diffusion, though typically trained at 512x512px resolution, perform numerous upsampling and downsampling operations which are not pixel-dependent, i.e., which scale the latent embeddings of the emerging pictures up and down, as required.

The authors of the new work point out that the generation process can take a downsampled image as input during the creation process:

A demonstration of the consistency between resolutions at varying stages of the denoising process.
A demonstration of the consistency between resolutions at varying stages of the denoising process.

The downsampled image is visualized in the image above (center image), and the paper notes that it is significantly different from the embedding of the trained resolution (image on the right).

The key to the new method is to ‘steal’ the signal-to-noise ratio (SNR) match from the ‘wrong’ version of the sequence of images, and re-inject it into a later stage (by which time it would normally have been transformed, or effectively destroyed).

Since it is not easy to identify the correct generation in the sequence which will hold these qualities, the image is selected on a time basis, in that it is known at what temporal point in the sequence of generations that this valuable image will crop up, and at that stage it is captured.

With some adjustments, the latent code can be adjusted to the target power, and in making this calculation, the difference between the two equivalent noises becomes guidance for the process.

Upsample guidance. The model concatenates the sum of the image at two different resolutions.
Upsample guidance. The model concatenates the sum of the image at two different resolutions.

The paper states:

‘[Recognizing] the need for adjustments to ensure consistency among noise predictors at various resolutions, we substitute the term about trained resolution with the adjusted noise [predictor]…

‘…This model parallelly sees and predicts noises at both resolutions.’

The interpolation between the ‘naïve’ sampling from the target resolution and the parallel sampling at the trained resolution must be considered carefully, and it is this interpretation that the authors dub ‘upsampling guidance’ – the heart of the process.

Referring to Classifier-Free Guidance (CFG), the researchers state:

‘Similar to how CFG incorporates the shift from unconditional to conditional noise, UG represents the influence pushing the model towards consistency with the trained low-resolution component.’

The authors assert that this very general methodology has additional challenges in an LDM, such as Stable Diffusion*:

‘[In] the context of LDMs, the pixel space undergoes a transformation into the latent space using a nonlinear variational autoencoder (VAE).

‘Consequently, it is crucial to proceed with caution, as the latent space of a downsampled image at the target resolution may not align with the latent space of the resolution it was originally trained on. The outcomes of downsampling in latent space and subsequently decoding back into pixel space are shown [in the image below].’

Examples of artifacts from the encoder/decoder of an LDM, which occur when decoding happens in pixel space. The VAE subsequently introduces nonlinearity into upsample guidance, leading to degraded results.
Examples of artifacts from the encoder/decoder of an LDM, which occur when decoding happens in pixel space. The VAE subsequently introduces nonlinearity into upsample guidance, leading to degraded results.

This limitation was overcome by the researchers by the aforementioned use of ‘capturing’ a processed image early enough in the process to anticipate the degradation introduced by the VAE, which is in strict terms a matter of predicting the timing.

Data and Tests

Experiments undertaken for the new technique included spatial and temporal upsampling of video generation output, as well as static images. The authors emphasize that this is only one application of a principle which may well have wider-reaching implications:

‘The core concept behind upsample guidance lies in the SNR matching during the downsampling process. As a result, it can be extended to diverse data generation tasks, not confined to images alone. Moreover, its compatibility extends to any pre-trained model, conditional generation, and application techniques.’

Since the upsampling operation described in the paper is in effect a fairly simple linear operation on the predicted noise, it is applicable in a wide range of scenarios. For an initial test, the researchers employed a model pre-trained on CIFAR-10 and CelebA-HQ 256×256(px).

They also tested 2x-upsampling via UG on Stable Diffusion V1.5, and further checked its capacity on a fine-tuned model, using different aspect ratios and image conditioning approaches.

Each image in the test example figured below (which the authors concede are slightly ‘cherry-picked’), are generated from the same initial noise. Where the resolution varied, the seed noise was resized and its variance adjusted to suit.

The bulk of the results from the initial test. Please refer to the source paper for better resolution.
The bulk of the results from the initial test. Please refer to the source paper for better resolution.

This round also included tests with the ControlNet ancillary guidance system for Stable Diffusion:

Results on ControlNet.
Results on ControlNet.

The authors comment:

‘Upon carefully examining some samples from CIFAR-10, UG sometimes alters coarse contents (overall colors and shapes) between 322 and 642 resolutions, with details emerging at higher resolutions that were not present at lower ones. This suggests that UG does more than just interpolation or sharpening; it actually generates new meaningful features.’

The authors further measured the impact of their intervention on the generic generation process, using a NVIDIA RTX3090 GPU at 1024x1024px resolution, concluding that the effect on generation times is minimal, and noting also that this additional calculation is only applied in certain circumstances, and that sometimes, therefore, time spent for generation would be completely unaffected.

The minimal generic impact of UG on generation times.
The minimal generic impact of UG on generation times.

However, in the specific case of LDMs, the decoding process also takes additional time, and the researchers surmise that when step sampling is shorter (i.e., fewer steps are needed, a strong trend in the current literature), the impact of UG is even further reduced.

The paper states*:

‘With the recent advancements in sampling methods leading to a reduction in the number of inference steps, our method becomes more competitive, requiring only ≤ 10% additional computation cost within 20 inference steps.’

The authors also tested the capabilities of the system’s native and content-preserving upscale approach on the AnimateDiff framework, a generative video model that incorporates a motion module into a text-to-image model.

The authors note:

‘In AnimateDiff, a video is represented as a sequence of color, time, width, and height in latent space, basically a tensor with the shape [C, T, W, H].

‘While we can upsample in the spatial dimensions [W, H] as above, it’s also possible to upsample in the temporal dimension T, increasing the number of frames by a factor of m. Assuming UG gives robustness for temporal resolution, we expect an increase in frames per second rather than an extension of time length, similar to the case with images.’

Spatial and temporal video upsampling with UG, via the AnimateDiff framework. Please refer to the source paper for better resolution.
Spatial and temporal video upsampling with UG, via the AnimateDiff framework. Please refer to the source paper for better resolution.

Of these results, the researchers comment††:

‘For spatial upsampling, issues like multiple subjects appearing and misalignment with text prompts were resolved thanks to UG, indicating that spatial UG works similarly in video generation as it does for images.

‘For temporal upsampling, we kept the spatial size constant and generated 32 frames, double the 16 frames AnimateDiff was trained on. Without UG, there was a complete failure in maintaining temporal consistency, and sometimes even adjacent frames lost continuity.

‘However, with UG, the videos were overall consistent at a level similar to the trained temporal resolution, and greater continuity was also appeared in the subject’s movements. This difference is more pronounced when viewing the videos in playback rather than as listed frames’

The paper observes that it is crucial to keep the guidance scale at zero during the mid-stage resampling operations, in order to prevent artifacts from appearing in the results. However, it also notes that if a certain amount of artifacts can be tolerated, better fidelity to the text prompt becomes possible, and that the selection of a workable trade-off between prompt fidelity and image quality can offer a new scale for the entire process, similar to CFG itself.

Some select examples (please see the paper for full results), of comparison between the traditional Lanczos resampling, the more recently (and very commonly-used) CodeFormer, and the new UG method. It is important to refer to the source paper for better resolution.
Some select examples (please see the paper for full results), of comparison between the traditional Lanczos resampling, the more recently (and very commonly-used) CodeFormer, and the new UG method. It is important to refer to the source paper for better resolution.

The authors conclude:

‘The computational cost of UG is marginal, and ongoing research aimed at reducing inference steps further minimize the portion of time consumption due to UG in LDMs.

‘We consider our method a universally beneficial add-on for generating high-resolution samples due to its ease of implementation and cost-effectiveness.’

Conclusion

This is one of the most interesting papers to have come out of the very crowded upsampling research stream in quite a while, devoid as it is of bolt-on systems fine-tuning, 3DMM and related CGI-based adjuncts, and all the other painful paraphernalia that has defined recent research directions.

Though it is a little disappointing to find that CLIP imposes on the system a trade-off between artifact generation and prompt fidelity, we are already well-used to these decisions in the CFG system, which generally degrades quality in favor of fidelity as the CFG scale rises.

The results for prompt accuracy are outstanding, in a field beset by far less elegant layout-based solutions, while the simplicity of methodology for what appears to be very effective upscaling is rare in the current run of publications.

* My substitution of the researchers’ inline citations for hyperlinks.

My emphases.

†† It should be noted that at this time, no video samples for this experiment have been provided

More To Explore

LayGa - Source: https://arxiv.org/pdf/2405.07319
AI ML DL

Editable Clothing Layers for Gaussian Splat Human Representations

While the new breed of Gaussian Splat-based neural humans hold much potential for VFX pipelines, it is very difficult to edit any one particular facet of these characters, such as changing their clothes. For the fashion industry in particular, which has a vested interest in ‘virtual try-ons’, it’s essential that this become possible. Now, a new paper from China has developed a multi-training method which allows users to switch out garments on virtual people.

A film grain effect applied to a stock image - source: https://pxhere.com/en/photo/874104
AI ML DL

The Challenge of Simulating Grain in Film Stocks of the Past

Hit shows like The Marvelous Mrs. Maisel and WandaVision use some cool tricks to make modern footage look like it was shot in the 1960s, 70s, and various other eras from film and TV production. But one thing they can’t quite pull off convincingly is reproducing the grainy film stocks of yesterday – a really thorny problem that’s bound up with the chemical processes of emulsion film. With major directors such as Denis Villeneuve and Christopher Nolan fighting to keep the celluloid look alive, it would be great if AI could lend a hand. In this article, we look at the challenges involved with that.

It is the mark of an educated mind to be able to entertain a thought without accepting it.

Aristotle