As those who follow the growing industry and its underlying research know, the data used to of large language models (LLM) and other models that support products like ChatGPT, Stable Diffusion, and Midjourney are originally derived from human sources – books, articles, photos, and so on – created without the help of artificial intelligence.

Now, as more and more people use AI to produce and publish content, an obvious question arises:

What will happen as AI-generated content proliferates online and AI models begin to be trained by it, rather than human-generated content?

A team of researchers from the UK and Canada looked at this very problem and recently published a paper in arXiv open access journal.

"We find that using model-generated content in training causes irreversible defects in the resulting models." Specifically looking at the probability distributions for the text-to-text and image-to-image AI generation models, the researchers concluded that “learning from data generated by other models causes model collapse – a degenerative process in which, over time over time, models forget the truth. This process is inevitable, even for cases with near-ideal conditions for long-term learning."

Ilia Shumailov, in an email to VentureBeat said, "We were surprised to notice how quickly model collapse can occur: Models can quickly forget most of the initial data they learned from in the first place."

In other words: as an AI training model is exposed to more AI-generated data, it performs worse over time, producing more errors in the answers and content it generates.

As another of its authors wrote s, Ross Anderson, professor of safety engineering at the University of Cambridge and the University of Edinburgh, in a blog post discussing the work:

“Όπως έχουμε γεμίσει τους ωκεανούς με πλαστικά σκουπίδια και την ατμόσφαιρα με διοξείδιο του άνθρακα, θα γεμίσουμε το Διαδίκτυο με μπλα μπλα. Αυτό θα καταστήσει δυσκολότερο την εκπαίδευση νεότερων μοντέλων από τα δεδομένα που δημιουργήθηκαν από τον άνθρωπο, δίνοντας το πλεονέκτημα σε που το έκαναν ήδη ή που ελέγχουν την πρόσβαση σε ανθρώπινα δεδομένα”.

