According to a new study that published in Nature, the researchers found that training AI models using datasets generated by the AI itself can lead to “model collapse,” since the models will produce increasingly absurd results over generations. "
In one example, a model began with a text on European architecture in the Middle Ages and ended—in the ninth generation— spouting nonsense about rabbits".
The study, led by Ilia Shumailov, a Google DeepMind and postdoctoral researcher at the University of Oxford, found that an artificial intelligence may under-select lines of text that contain meaning, meaning subsequent models trained will not be able to convey the nuances of meaning. Training new models on the output of previous models in this way results in a recursive loop. In an accompanying article, Emily Wenger, an assistant professor of electrical and computer engineering at Duke University, illustrated the breakdown of the model with the example of a system tasked with generating images of dogs.
"The AI model will gravitate toward recreating the dog breeds that are most common in its training data, so it may overrepresent the Golden Retriever compared to the Petit Basset Griffon Vendéen, given the relative prevalence of the two breeds," he said.
“If subsequent models are trained on an AI-generated dataset that over-represents Golden Retrievers, the problem will get worse. With several rounds of overrepresented Golden Retrievers, the model will forget that other dog breeds like the Petit Basset Griffon Vendeen exist and create images of only Golden Retrievers
Eventually, the model will break down, rendering it unable to generate meaningful content. While he admits that overrepresenting Golden Retrievers may not be bad, the process of collapse is a serious problem for a meaningfully representative output that includes less common ideas and ways of writing.
"This is the problem at the heart of the collapse of the model," he said.
doi: https://doi.org/10.1038/d41586-024-02420-7