Google Research has revealed a new type of framework for text-to-image synthesis, based on the Transformers architecture, rather than latent diffusion.

The project is titled Muse, and is essentially a patchwork assembly of multiple prior works (many also from Google Research) – though in itself it constitutes the first major generative system to leverage transformers so directly, and to abandon latent diffusion, despite the current huge popularity of that architecture.
The Transformers architecture itself was debuted by Google Brain (amongst others) in mid-2017, and has since formed only sections of some implementations of popular image synthesis frameworks, rather than being the central enabling technology – as is the case with the new work.
The project’s most notable breakthrough is in how quickly it can perform inference (i.e., how long you have to wait for your ‘photo of a bear drinking coffee’). In this respect, Muse definitively beats Stable Diffusion, as well as Google’s own prior Imagen text-to-image model:

The improved inference times are attributable to various factors, not the least of which is parallel decoding, not a mainstay of existing text-to-image systems, and which optimizes the handling of tokens in the generative procedure (tokens are pieces of data which have been split up – usually, a token is simply a word, or a term). To quote the source paper from which this aspect of Muse is derived, ‘at each iteration, the model predicts all tokens simultaneously in parallel but only keeps the most confident ones.’

Among a complex slew of tests, output from Muse was subject to human evaluation. In an evaluation round requiring a 3-user consensus, results from text prompts for Muse were preferred by raters in 70.6% of cases, which the authors attribute to Muse’s superior prompt fidelity.

The authors estimate that Muse is over ten times faster at inference time than Google’s prior Imagen-3B or Parti-3B models, and three times faster than Stable Diffusion (v1.4). They attribute the latter improvement to the ‘significantly higher number of iterations’ that Stable Diffusion requires for inference.
The new Google paper notes that the improved speed does not impose any loss of quality, in comparison to recent and popular systems, and further observes that Muse is capable of out-of-the-box maskless editing, outpainting and inpainting.
Mask-free editing is, in particular, a ‘Holy Grail’ for Stable Diffusion, which struggles to make selective prompt-based changes to an existing image, largely due to the sector-wide issue of entanglement.

Muse addresses a frequent bugbear of systems such as Stable Diffusion and DALL-E 2 – cardinality: to what extent should the output image prioritize one part of the prompt over another, without resorting to hacks, workarounds and other types of third-party instrumentality that aren’t conducive to an easy user experience?

The parts of a text-prompt that get ‘preference’ can greatly affect the quality of the output in a generative system. In Stable Diffusion, for instance, the earliest words in a long text-prompt will be prioritized, and later or ancillary parts of the prompt may be ignored entirely as the architecture’s ability to coherently synthesize the prompt is challenged by complex requests.
The authors of the new paper note that Muse is well-disposed to include the entire content of even quite lengthy prompts – a shortcoming in rival systems.

The authors note that Muse’s approach to image editing is less laborious than many recent attempts at the challenge, stating*:
‘This method works directly on the (tokenized) image and does not require “inverting” the full generative process, in contrast with recent zero-shot image editing techniques leveraging generative models.’
Inpainting, by contrast, is a more stratified procedure, where the user explicitly masks out a section of a picture and applies changes only to that area, similar to Photoshop. In that area, a new or revised text-prompt can alter the content of the picture through modified or alternative text-prompts while retaining broad context:

Muse’s sampling procedure facilitates inpainting ‘for free’ (as the authors describe it). The input image (which may be a real or previously synthesized image) is converted into a set of text/image tokens which are conditioned on unmasked tokens and a text-prompt (supplied by the user). The amended image parameters are downsampled and re-upsampled (256px/512px), with both images converted to high and low resolution tokens, before the system masks out the apposite region for this group of tokens. After this, parallel sampling provides the actual inpainting.
Outpainting is a broadly similar procedure, except that it usually involves extending the image beyond its current borders. Nonetheless, token-based context is still the key to generating image-consistent imagery.


Approach
As mentioned above, the central approach of Muse is based on Google’s prior work, MaskGIT masked image modeling, which debuted in February of 2022. The MaskGIT paper leveled some criticism against prior Transformer-centric image synthesis systems, broadly comparing such approaches to the linear processes of obtaining a picture with a flatbed scanner.
Instead, Muse’s image decoder architecture is conditioned on embeddings obtained from a pre-trained and frozen large language model (LLM), T5-XXL. The authors observe*:
‘In agreement with Imagen, we find that conditioning on a pre-trained LLM is crucial for photorealistic, high quality image generation. Our models (except for the VQGAN quantizer) are built on the Transformer architecture.’
In the image below, we can see the parallel processes in action for the conceptual architecture of Muse, with both the higher and lower-res versions of the input image being run through differing routines which extract and reconstruct semantic tokens, until high-res tokens are available for image creation and manipulation:

The VQ tokenizer for the lower-resolution model is pre-trained on 256x256px images, generating a 16×16 latent space of tokens. The resulting sequence is then masked at a variable (rather than constant) rate per sample. After this, cross-entropy loss learns to predict the masked image tokens, and the schema can be used to create higher resolution masked tokens.
To a certain extent, the researchers behind Muse do not know exactly how the T5-XXL LLM obtains the results that it’s capable of within the Muse framework, and state*:
‘Our hypothesis is that the Muse model learns to map these rich visual and semantic concepts in the LLM embeddings to the generated images; it has been shown in recent work that the conceptual representations learned by LLM’s are roughly linearly mappable to those learned by models trained on vision tasks.
‘Given an input text caption, we pass it through the frozen T5-XXL encoder, resulting in a sequence of 4096 dimensional language embedding vectors. These embedding vectors are linearly projected to the hidden size of our Transformer models (base and super-res).’
Cloistered Genius?
There are two interesting considerations here: one is that this approach is quite revolutionary, since it departs entirely from diffusion models and performs well on the home territory of Stable Diffusion (and, to a less important extent, Generative Adversarial Networks, or GANs, which have lately been eclipsed by diffusion approaches). Google Research releases papers frequently, and it can be hard to tell which are going to have a major impact on the image synthesis scene.
For instance, the release of DreamBooth late last year was almost lost in a conference-season avalanche of new releases from the research arm of the search giant. As it transpired, the relatively dry DreamBooth documentation and code was eventually spun into what may be the biggest and most controversial advent in the short history of deepfakes, since they were defined in 2017
The second consideration is that Google, once again, as stated clearly in the new paper, is apparently too afraid of repercussions to release the code. After announcing that neither the source code for Muse nor any public demo will be made available ‘at this time’, the new paper states:
‘[we] do not recommend the use of text-to-image generation models without attention to the various use cases and an understanding of the potential for harm. We especially caution against using such models for generation of people, humans and faces.’
Barring a change of heart on this, it seems that Muse, like many other ‘powerful’ frameworks from Google, may end up simply being yet another benchmark for subsequent image synthesis frameworks (that may not be released either, for the same reasons); though it seems probable that the company is at least interested in developing API-only access to such systems, once the inevitable user workarounds on restrictions have been adequately nailed down.
* My conversion of the author’s inline citation to hyperlinks.