



Artificial intelligence systems face an unexpected problem when they are fed with self-generated data: they begin to produce meaningless results, disconnected from reality. Described as a form of digital cannibalism, this situation marks a new breaking point in artificial intelligence training. The study, published by Oxford University researchers in the journal Nature, mathematically proved how models “collapse” when fed with their own output. As the content of the Internet is increasingly generated by artificial intelligence, this problem is becoming not just a technical detail, but a systemic risk that threatens the entire AI ecosystem. Model collapse could have consequences that are serious enough to bring the development of the next generation of artificial intelligence systems to a standstill.
Model collapse refers to the permanent decline in performance of productive AI models when trained with the contents produced by previous AI models. This phenomenon is reminiscent of a widely accepted principle in AI development: a model is only as good as the data on which it is trained.
The research team, led by Dr. Ilia Shumailov from the University of Oxford, described the model collapse in two stages in their study, published in the journal Nature in 2024. In early model collapse, the model begins to lose information at the extremes of the data distribution. At this stage, minority data and rare specimens face extinction, but a marked decline in overall performance metrics may not be observed. In the case of late model collapse, the situation is much more serious: the model loses a large part of its performance, confuses concepts, and the diversity in its outputs is almost completely eliminated.
The researchers found that models trained with artificial intelligence generation, also known as synthetic data, produced datasets that varied less than the original data distributions. Any error in the output of one model is included in the training of the next model. The new model also continues the process by adding its own bugs. As this cycle progresses, errors accumulate and lead to irreversible damage.
The mechanism of the emergence of the model collapse lies in the nature of modern artificial intelligence training processes. Large-language models are initially trained with human-sourced data collected from the internet. But with the proliferation of artificial intelligence tools, an increasingly large portion of content on the web is now produced by machines.
This transformation of the Internet poses a critical problem. As future models scour the web for educational data, they are more likely to encounter artificial intelligence outputs rather than human-made original content. This leads to the fact that models are unwittingly trained with the contents produced by themselves or their antecedent models.
The process works like this: The first generation model is trained on real data and produces output with a certain margin of error. These outputs spread across the internet and blend into databases. When training the second generation model, the dataset contains both original data and erroneous outputs of the first model. The second model learns the errors of the first model and adds its own errors to them. By the time the third generation arrives, data pollution has increased exponentially.
As Zakhar Shumaylov, an artificial intelligence researcher at the University of Cambridge, points out, extreme caution should be exercised about what goes into educational data. Otherwise things will go wrong in a mathematically provable way. Models increasingly move away from real-world data, and eventually their output turns into content that has no relation to reality.
Model collapse affects different AI architectures in different ways, but the result is the same in all of them: loss of performance and erosion of reliability.
Collapse in Major Language Models (LLM): Model collapse in language models manifests itself in the form of increasingly irrelevant, meaningless, and repetitive text outputs. Oxford University researchers have achieved stunning results in their experiment with the open-source OPT-125M model developed by Meta. The researchers trained successive generations of the model with data generated by the previous model. An English text on medieval architecture was used as the first input. By the ninth generation, the model began to produce content about rabbits with tails of different colors. Lexical, syntactic, and semantic diversity narrowed with each new generation, and there were marked declines in tasks that required creativity.
Distortion in Visual Production Models: Model collapse is particularly noticeable in image-producing AI. Visual quality decreases, diversity decreases, and sensitivity disappears. In an experiment with Variational Autoencoder (VAE) models, a dataset of different handwritten digits was used. After more than one cycle of training, subsequent generations produced figures that looked similar to each other. In another study, the Generative Adversarial Network (GAN) model, trained with various facial images, began to produce more homogeneous faces over time.
Degeneration in Gaussian Mixture Models: Gaussian Mixture models that divide data into clusters are also affected by model collapse. The researchers found that a GMM tasked with splitting the data into two clusters dropped significantly after a few dozen cycles. The model's perception of the basic data distribution changed over time, and by the 2000th generation iteration, its output showed little variance.
Model collapse has tangible and costly consequences for organizations that integrate AI systems into their business processes. The scope of these impacts spans a wide range from customer service to critical diagnostic systems.
Errors in decision-making processes carry great risks for businesses. AI systems affected by model collapse may make incorrect recommendations or present faulty analyses. For example, AI-assisted medical diagnostic tools may not be able to detect rare diseases. Because low-probability situations during model collapse were forgotten in previous generations and deleted from training datasets. Even if a patient has a rare disease, the system can ignore it.
The user experience is also seriously affected. Systems going through model collapse can ignore actual human interactions and preferences regarding endpoints. Consider the recommendation system for online shoppers: if a consumer prefers peanut green shoes but the system constantly recommends black and white shoes, which are bestsellers, the consumer may turn to another platform. While the system focuses on popular preferences, it becomes unable to satisfy individual and original desires.
Perhaps the most dangerous long-term effect of model collapse is the reduced diversity of information. If widely used AI systems go through model collapse and consistently produce narrower outputs, “long tail” ideas could be erased from society's consciousness. Today, scientists can turn to AI-powered research tools to inform their research. But tools affected by model collapse can only present highly cited studies, depriving users of basic information that could potentially lead to important discoveries.
Model collapse is just one of multiple pattern distortion phenomena observed in machine learning. Although each of them bears similarities, there are important distinctions between them.
Catastrophic Forgetting: Both model collapse and catastrophic forgetting involve information lost by artificial intelligence systems. But catastrophic forgetting occurs when a single model “forgets” previous information when it learns new information. When applied to a task that requires the use of old knowledge, the performance of the model decreases. Model collapse, on the other hand, involves performance degradation in successive model generations; it is different from data loss within a single model.
Collapse Mode: Despite the similarity of the name, mode collapse is a phenomenon especially peculiar to GAN models. GAN models consist of two different components: a generator and a discriminator. The generator generates synthetic data that is statistically similar to real data. The discriminant acts as a constant control over the process and identifies data that appears unreal. Mode collapse occurs when there is a lack of variance in the output of the constructor and this defect is not detected by the discriminator.
Model Drift: Model shift is the decline of machine learning model performance due to changes in data or changes in the relationships between input and output variables. Patterns created with historical data can become stagnant. If the old training data of a model does not match the incoming data, it cannot interpret this data correctly. Model collapse is different because the production of new artificial intelligence in cyclical processes involves training models with data.
Performative Prediction: The researchers compared model collapse in productive AI models with performative prediction in supervised learning models. Both involve contamination of training sets of previous machine learning model inputs. Performative prediction occurs when the output of a supervised learning model influences real-world outcomes in a manner consistent with the model's prediction. This in turn influences future model outputs and creates a “self-fulfilling prophecy”.
AI developers and organizations can implement a variety of strategies to prevent model collapse. These approaches span a wide range from data management to AI management.
Protection of Non-Artificial Intelligence Data Sources: High-quality original data sources can provide significant variation in some AI-generated data that may be missing. Ensuring that AI models are still trained with such human-sourced data could preserve the ability of systems to account for low-probability events. In situations such as when a consumer prefers an unusual product or a scientist benefits from a rarely cited study, the resulting output may not be widespread or popular, but it is still the most accurate.
Data Origin Determination: It can be difficult to distinguish between model-generated data and human-generated data in information ecosystems, but coordination between LLM developers and AI researchers can help provide access to information about the data origin. The Data Provenance Initiative, a collective of artificial intelligence researchers from MIT and other universities, audited more than 4,000 data sets. Such collaborations are critical to maintaining access to clean and reliable education data.
Utilizing Data Accumulation Methods: According to a study, AI developers can avoid underperformance by training with both real data and multiple generations of synthetic data. This accumulation is the opposite of the practice of replacing original data entirely with AI-generated data. The researchers showed that the collection of synthetic data from multiple generations was effective in preventing model collapse.
Better Use of Synthetic Data: As AI developers explore data accumulation, they can also benefit from improvements in the quality of synthetic data produced specifically for machine learning training purposes. Advances in data generation algorithms can help increase the reliability and utility of synthetic data. In healthcare, for example, synthetic data can be used to provide a wider range of scenarios for educational models.
Application of Artificial Intelligence Governance Tools: AI governance tools can help AI developers and companies reduce the risk of declining AI performance. These tools provide surveillance and control over artificial intelligence systems. It can include automated detection systems for bias, slippage, performance, and anomalies, potentially detecting model collapse before it affects an organization's profitability.
Today, leading technology companies and AI platforms are developing proactive measures against model collapse. AI governance platforms offer tools to monitor, test, and validate models throughout their lifecycle. These systems are able to detect anomalies at an early stage by constantly monitoring data quality metrics.
Data provenance tracking systems keep track of which sources education data comes from, separating human-generated and AI-generated content. In this way, models are trained with a balanced data mix and are protected from synthetic data pollution. In addition, regular model performance audits and diversity checks are performed to assess whether the outputs reflect real-world distributions.
Model collapse poses a critical threat to the sustainability of the AI ecosystem. The study, published in the journal Nature by researchers from the universities of Oxford, Cambridge and Toronto, issued an important warning to the industry by proving this problem mathematically. At a time when content on the Internet is increasingly generated by artificial intelligence, urgent measures need to be taken to feed next-generation models with healthy data.
Strategies such as data origin tracking, the use of quality synthetic data, artificial intelligence governance tools, and the protection of original human-sourced data form effective lines of defense against model collapse. Organizations need to take this threat seriously and develop proactive data management policies to protect the value of their AI investments and ensure the long-term reliability of systems.
Are you looking to get expert support to maintain the performance of your AI models and strengthen your data management strategies?
Data architecture is a set of rules, policies, standards, and models that govern and determine the type of data collected, and show how this data is used, stored, managed, and integrated within an enterprise and database systems.
Data intensive refers to data structures in which most cells or fields in the data matrix or data set are filled, with minimal empty or zero values.
A private cloud is a cloud computing model that is designed entirely specifically for the needs of a company or organization, using hardware and software resources exclusively for that company.
We work with leading companies in the field of Turkey by developing more than 200 successful projects with more than 120 leading companies in the sector.
Take your place among our successful business partners.
Fill out the form so that our solution consultants can reach you as quickly as possible.