A lot interesting publication from VentureBeat, predicts the ominous future of artificial intelligence's large language models (LLMs):
As those who follow the growing industry and its underlying research know, the data used to train the large language models (LLMs) and other models that support products like ChatGPT, Stable Diffusion, and Midjourney come originally from human sources – books , articles, photos and so on – created without the help of artificial intelligence.
Now, as more and more people use AI to produce and publish content, an obvious question arises:
What will happen as AI-generated content proliferates online and AI models begin to be trained by it, rather than human-generated content?
A team of researchers from the UK and Canada looked at this very problem and recently published a paper in arXiv open access journal.
What they found is troubling for current AI technology and its future:
"We find that using model-generated content in training causes irreversible defects in the resulting models." Specifically looking at the probability distributions for the text-to-text and image-to-image AI generation models, the researchers concluded that “learning from data generated by other models causes model collapse – a degenerative process in which, over time over time, models forget the truth. This process is inevitable, even for cases with near-ideal conditions for long-term learning."
Ilia Shumailov, in an email to VentureBeat said, "We were surprised to notice how quickly model collapse can occur: Models can quickly forget most of the initial data they learned from in the first place."
In other words: as an AI training model is exposed to more AI-generated data, it performs worse over time, producing more errors in the answers and content it generates.
As another of the study's authors, Ross Anderson, professor of safety engineering at the University of Cambridge and the University of Edinburgh, wrote in a blog post discussing the work:
"Just like we've filled the oceans with plastic trash and the atmosphere with carbon dioxide, we're going to fill the internet with blah blah. This will make it harder to train newer models from human-generated data, giving the advantage to companies that already do or that control access to human data.”