Continually feeding an AI with AI-generated content can cause its output quality to erode, according to a new research paper on this subject.
In a study published earlier this month, scientists at Rice and Stanford University concluded that training AI models exclusively on the outputs of generative AI is not a good idea. They titled their report: “Self-consuming generative models go MAD”.
Train generative AI models with enough AI-generated stuff, and the resultant model will apparently turn out worse. This pertains to both ChatGPT-like systems and image generators alike. For images, the researchers found that the training of generative AI models on synthetic data progressively amplifies artefacts.
A plague of synthetic data
As synthetic data from generative models proliferates on the Internet and in standard training datasets, future models will likely be trained on some mixture of real and synthetic data, forming an autophagous (“self-consuming”) loop, notes the author.
Moreover, AI builders are always hungry for more training material. This might lead to a temptation to leverage “synthetic data”, or those generated from other AI models as a substitute.
“Seismic advances in generative AI algorithms for imagery, text, and other data types has led to the temptation to use synthetic data to train next-generation models,” explained the researchers.
Using a mixture of fixed real and synthetic training data without sampling bias reduces both the quality and diversity of their synthetic data over generations, albeit more slowly than in a fully synthetic loop case.
“Our primary conclusion across all scenarios is that without enough fresh real data in each generation of an autophagous loop, future generative models are doomed to have their quality (precision) or diversity (recall) progressively decrease,” they wrote.
Going MAD
The researchers coined the term “MAD” to name this process, making an analogy with mad cow disease.
The authors noted how the practice of feeding cattle the remains of other cattle led to mad cow disease – and which was only brought under control through massive intervention.
“Lest an analogous malady disrupt the AI future, and to coin a term, it seems prudent to understand what can be done to prevent generative models from developing Model Autophagy Disorder (MAD).”
The race for fresh training data is real. Currently, limitations and walls are being erected all over as platforms from Twitter to Reddit bear the brunt of data scraping to train new AI models.
For now, the research raises the very real possibility that future AI models will not improve significantly, or perhaps even get worse due to the impossibility of reliably detecting and removing AI-generated content for training new AI models.
Will we have to dump our AI-tainted tools as they experience inevitable model collapse somewhere in the future? You can access the paper here (pdf).
Paul Mah is the editor of DSAITrends. A former system administrator, programmer, and IT lecturer, he enjoys writing both code and prose. You can reach him at [email protected].
Image credit: iStockphoto/yuriyzhuravov