Meta Unveils Text-to-video AI Generator

Meta last week announced its Make-A-Video, a new AI-powered system that turns text prompts into high-quality video clips. Built off Meta AI’s recent progress in generative technology research, Make-A-Video is touted as offering new opportunities for creators and artists.

Make-A-Video

The announcement comes just months after the releases of various text-to-image services such as DALL-E 2 in April and Stable Diffusion in August, offering a glimpse of the breathtaking pace of progress in AI-generated media.

According to Meta, Make-A-Video requires just a few words or lines of text to create one-of-a-kind videos “full of vivid colors, characters, and landscapes”. The system can also create videos from images or take existing videos to create new ones that are similar.

“Make-A-Video research builds on the recent progress made in text-to-image generation technology built to enable text-to-video generation… Make-A-Video lets you bring your imagination to life by generating whimsical, one-of-a-kind videos with just a few words or lines of text,” noted the Make-A-Video website.

To generate a video, users merely need to type in prompts such as “A teddy bear painting a portrait” or “A golden retriever eating ice cream on a beautiful tropical beach at sunset, high resolution”.

Credit: Meta AI
Credit: Meta AI

Even Meta CEO Mark Zuckerberg chimed in. He wrote in a Facebook post: “This is pretty amazing progress. It's much harder to generate video than photos because beyond correctly generating each pixel, the system [must also] predict how they'll change over time.”

Cutting edge AI

According to the white paper (pdf) released by the Meta AI team behind Make-A-Video, text-to-video generation lags behind text-to-image generation for two main reasons: A lack of large-scale datasets with high-quality text-video pairs, and the complexity of modeling higher-dimensional video data.

To get around this, Meta combined data from three open-source image and video data sets to train its model. Text-image data sets of labeled still images allowed the AI to learn the name of objects and their appearance; the database of videos helped it learn how objects move. This broke the dependency on text-video pairs for text-to-video generation.

Or as explained by Ars Technica, Meta took image synthesis data of still images trained with captions and applied unlabeled video training data so the model learns a sense of where a text or image prompt might exist in space and time. This made it possible for Make-A-Video to predict what comes after the image and display the scene in motion for a short period.

To be clear, Meta had shared just over a dozen AI-generated videos so far. Moreover, each video clip is no longer than five seconds and contains no audio. This makes it difficult to tell how easy it  is to generate usable videos, especially given that the system is not currently available to the public.

On its part, Meta said it is restricting access as it wants to be thoughtful about how it builds next-generation AI systems. For now, you can sign up to indicate interest should Meta ever decide to make it available.

The road ahead

Understandably, Meta’s reveal reignited concerns that sophisticated text-to-image and text-to-video tools will fuel rampant misinformation by lowering barriers to the creation of fake media. (OpenAI removed its waitlist for DALL-E 2 last week, making it available to anyone willing to verify their email address and mobile number)

The videos created from Make-A-Video currently come with a small “Meta AI” logo at the bottom right-hand corner and are easy to remove with a video editing tool. Some commentators have called for larger watermarks that cannot be removed, except that this would render AI-generated video useless.

What Meta decides to do might not matter in the long run, however. The existence of Make-A-Video will only spur other AI labs to either create their text-to-video systems with equivalent capabilities, or to release an improved text-to-video system by improving on Meta’s research.

And if this isn't too much for your mind to process yet, a report on Ars Technica raised the ethical question around the use of commercial media for non-commercial academic research – but which subsequently gets incorporated into a commercial AI product.

Specifically, researcher Simon Willison discovered that Meta used over 10 million videos scraped from Shutterstock without permission, while researcher Andy Baio found another 3.3 million videos from YouTube.

You can read the thought-provoking commentary from Baio titled “AI Data Laundering: How Academic and Nonprofit Researchers Shield Tech Companies from Accountability” here.

Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].​

Image credit: DALL-E 2