Another day, another new text-to-image AI model. But unlike other recent offerings, Stable Diffusion will have fewer limitations on the type of images that users can generate, and will run on consumer GPUs.
Why does the ability to run on a consumer GPU matter? For a start, the massive amount of computing required to train large models has meant that the best models to date are typically built by large tech firms who keep the models to themselves. While some of them can be accessed by researchers, their sheer size often means that only well-resourced labs have the hardware to run them even if they are granted access.
The model for everyone
Built by a startup called Stability AI, Stable Diffusion is billed as the text-to-image model for everyone.
“Stable Diffusion is a text-to-image model that will empower billions of people to create stunning art within seconds. It is a breakthrough in speed and quality meaning that it can run on consumer GPUs,” said Emad Mostaque, the founder of Stability AI in a blog post.
Apart from being made available online, the startup will release specific models under a license that allows it to be downloaded and used for any purpose. It is understood that Stable Diffusion will run on graphics cards with around 5GB of VRAM and can generate 512x512 images within seconds.
For users who prefer to access it online, Stable Diffusion also appears to have less strict rules on its use. While it incorporates keyword filters to prevent blatant misuse, Stability AI doesn’t have a policy against images with public figures and has been reported to be more permissive than most. It is up to the user to do as they will, according to Mostaque.
Training Stable Diffusion
Mostaque shed some light Stable Diffusion in his blog post; like OpenAI’s DALL-E 2, Stable Diffusion is a diffusion model which gradually builds a coherent image from pure noise, refining the image over time to bring it to the given text description.
The core dataset was trained on LAION-Aesthetics, a soon-to-be-released subset of LAION 5B, which is a large-scale dataset consisting of 5,85 billion CLIP-filtered image-text pairs. LAION-Aesthetics itself is a new CLIP-based model that selected images according to how “beautiful” it was based on ratings assigned by alpha testers.
Stable Diffusion was trained on a cluster of 4,000 Nvidia A100 GPUs running on AWS for a month. The resultant model was tested at scale with over 15,000 beta testers creating two million images a day, according to Mostaque.
According to a report on TechCrunch, CompVis, a machine vision and learning research group at the Ludwig Maximilian University of Munich oversaw the training while Stability AI donated the compute power.
Making money from AI
So how will Stability AI make money? Speaking to TechCrunch, Mostaque shared that his company will make money by training private models for customers and acting as a general infrastructure layer.
He also claims that Stability AI has other projects that can be commercialized in the works, including AI models for generating audio, music, and even video.
“We will provide more details of our sustainable business model soon with our official launch, but it is basically the commercial open source software playbook: services and scale infrastructure,” said Mostaque.
“We think AI will go the way of servers and databases, with open beating proprietary systems – particularly given the passion of our communities.”
The future for humans
So where does the ready availability of increasingly advanced AI models leave us? While many of the generated images offer telltale signs of an algorithmic creation – think mismatched limbs or inaccurate reflections, they look passable to a casual viewer.
This means that malicious actors could potentially use Stable Diffusion to generate photorealistic images for general misinformation or propaganda. Moreover, AI models like Stable Diffusion will only get better.
Finally, as image-generation software improves in leaps and bounds, can creatives survive the coming onslaught of AI-generated imagery? Over at photography website PetaPixel, it was pointed out that Stable Diffusion appears to generate far better landscape images than DALL-E 2.
According to instructions on the Stable Diffusion GitHub page, it is possible to input a base image for text-guided image-to-image translation and upscaling. Finally, Stability AI doesn’t assert rights over images created by Stable Diffusion.
Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].