Google has unveiled a new text-to-image model called Muse that it says can generate high-quality images on par with competing models such as DALL-E 2 and Imagen but at a significantly faster speed.
According to Google, a 256 by 256 image could be generated in as quickly as 0.5 seconds on a TPUv4 chip, compared to 9.1 seconds using Imagen, a diffusion model from Google touted as offering an “unprecedented degree of photorealism” and a “deep level of language understanding”.
TPUs, or Tensor Processing Units, are custom chips developed by Google as dedicated AI accelerators. TPUv4 was released last year.
At its heart, Muse is a 900-million parameter model that generates images via a masked generative transformer model instead of pixel-space diffusion or autoregressive model. In a nutshell, Muse is trained to predict randomly masked image tokens.
Apart from being faster, Google says the state-of-the-art (SOTA) model behind Muse generates excellent images based on its 7.88 FID and 0.32 CLIP scores – FID measures image generation quality, while CLIP measures alignment with text prompts.
A new approach
Under the hood, Muse is more efficient due to the use of discrete tokens and fewer sampling iterations. Crucially, this efficiency does not result in lower image quality. As noted by the Google researchers behind Muse: “The efficiency improvement of Muse, however, does not come at a loss of generated image quality or semantic understanding of the input text prompt.”
Moreover, the use of a pre-trained large language model (LLM) enables fine-grained language understanding for high-fidelity image generation. The LLM that Muse is built on is Google’s own T5, which stands for Text-To-Text Transfer Transformer.
According to Synced, Muse inherits rich information such as objects, actions, visual properties, and spatial relationships from its T5 embeddings, and is hence better positioned to apply these concepts to generated images. In short, Muse generates images that better match text inputs.
“Muse generates images that reflect different parts of speech in input captions, including nouns, verbs, and adjectives. Furthermore, we present evidence of multi-object properties understanding, such as compositionality and cardinality, as well [as] image style understanding,” wrote the researchers.
Mask-free editing and other tricks
In addition, Muse directly enables image editing applications without the need to fine-tune or invert the model. Specifically, Muse offers mask-free editing of user-supplied images by default. It achieves it by iteratively resampling image tokens conditioned on a text prompt.
It can also perform inpainting or outpainting. Inpainting is the ability to remove objects within images by filling in the background behind the object to be removed, while outpainting is the extending of existing images with AI-generated content.
In an example of mask-free editing given by the researchers, Muse swapped the background of a park with that of San Francisco, New York City, or Paris. In another sample, a car parked beside a cabin was removed, and alternatively replaced with a “beat up pickup truck” and “horse tied to a post”.
Its language abilities also made it possible to make precise edits. In another example given, a picture containing a slice of cake and a cup of coffee was edited into one showing a croissant with a flower latte art in the coffee.
No code released
The sheer versatility of Muse is compelling even in the face of an increasingly crowded field. However, though Google released a research paper about Muse, it did not release the code or any demo of the project.
This was flagged as concerning by at least one observer. Eva Rtology, an art curator who does ML consulting, wrote: “This is a surprising move from Google, considering that it has been a major proponent of open-source projects. Such a project could have been a huge leap forward for the generative AI space, but as it stands, Google has chosen to keep it under wraps for the time being.”
As I wrote recently in “The Price of AI”, major AI tools today are open-sourced but the race to develop more capable AI may see organizations holding back. This could end up hurting the field as progress slows and new monopolies develop to put a break in the era of openness and advanced AI research to date.
In the meantime, you can read more about Muse in the blog here, or access the research paper “Muse: Text-To-Image Generation via Masked Generative Transformers” here.
Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].
Image credit: iStockphoto/Muse