We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

When AI Cannibalizes Its Data

2025/2/18

Short Wave

AI Deep Dive AI Chapters Transcript

People

Ilya Shumailov

Regina Barber

Topics

Regina Barber: 现在生成式AI无处不在，应用于各种场景，例如Google搜索、TikTok工具推荐、客户服务聊天等。大型语言模型如DeepSeek R1、ChatGPT等，能够生成图像、视频等多种形式的内容，但同时也面临着数据偏差和模型崩溃的风险。我们需要深入了解这些风险，并探讨相应的解决方案。 Ilya Shumailov: 为了训练大型语言模型，我们需要使用大量人类书写的例子，让模型阅读整个互联网。然而，随着生成式AI的普及，互联网上越来越多的内容是由AI生成的，这导致模型可能会消耗自己合成的内容，从而产生数据偏差和模型崩溃的问题。模型出错的原因主要有三个：数据相关错误、学习机制的结构性偏差以及模型设计本身的问题。此外，硬件的限制也会导致经验误差。当模型不断地从自身生成的数据中学习时，不可能发生的事件会逐渐消失，模型会变得越来越自信，最终导致模型崩溃。作为研究人员，我正在积极探索各种数据过滤方法，以确保模型摄取的数据能够代表底层数据分布，并防止模型崩溃的发生。我相信我们能够解决这个问题，并继续推动AI技术的发展。

Deep Dive

Shownotes Transcript

Translations:

中文

This message comes from Fred Hutch Cancer Center, whose discovery of bone marrow transplants has saved over a million lives worldwide. Learn how this and other breakthroughs impact the world at fredhutch.org slash lookbeyond. This message comes from Fred Hutch Cancer Center, whose discovery of bone marrow transplants has saved over a million lives worldwide. Learn how this and other breakthroughs impact the world at fredhutch.org slash lookbeyond. You're listening to Shortwave from NPR.

It seems like these days generative AI is everywhere. It's in my Google searches. It's suggested as a tool on TikTok. It's running customer service chats. And there's a lot of forms that generative AI can take, like it can create images or video. But the ones that have been in the news recently, DeepSeek R1, OpenAI's ChatGBT, Google Gemini, Apple Intelligence, all of those are large language models.

A large language model is kind of like the predictive text feature in your phone, but on steroids. Large language models are statistical beasts that learn from

That's Ilya Shumailov. He's a computer scientist, and he says in order to teach these models, scientists have to train them on a lot of human-written examples. Like, they basically make the models read the entire internet.

Which works for a while. But nowadays, thanks in part to these same large language models, a lot of the content on our internet is written by generative AI. If you were today to sample data from internet randomly, I'm sure you'll find that a bigger proportion of it is generated by machines. But this is not to say that the data itself is bad. The main question is how much of this data is potentially downstream dangerous.

In the spring of 2023, Elia was a research fellow at the University of Oxford. And he and his brother were talking over lunch. They were like, OK, if the Internet is full of machine-generated content and that machine-generated content goes into future machines, what's going to happen? Quite a lot of these models, especially back at the time, they're relatively low quality. So there are errors there.

And there are biases, there are systematic biases inside of those models. And thus, you can kind of imagine the case where rather than learning useful context and useful concepts, you can actually learn things that don't exist. They are purely hallucinations. Ilya and his team did this research study indicating that eventually, any large language model that learns from its own synthetic data would start to degrade over time, producing results that got worse and worse and worse and worse.

So today on the show, AI model collapse. What happens when a large language model reads too much of its own content? And could it limit the future of generative AI? I'm Regina Barber, and you're listening to Shortwave, the science podcast from NPR.

This message comes from Capital One. Say hello to stress-free subscription management. Easily track, block, or cancel recurring charges right from the Capital One mobile app. Simple as that. Learn more at CapitalOne.com slash subscriptions. Terms and conditions apply.

This message comes from Charles Schwab. When it comes to managing your wealth, Schwab gives you more choices, like full-service wealth management and advice when you need it. You can also invest on your own and trade on Thinkorswim. Visit Schwab.com to learn more.

Okay, Ilya, before we get into the big problem of like model collapse, I think we need to understand why these errors are actually happening. So can you explain to me what kinds of errors do you get from a large language model and like how do they happen? Why do they happen?

So there are three sources, three primary sources of error that we still have. So the very first one is basically just data associated errors. And usually those are questions along the lines of, do we have enough data to approximate a given process? So if some things happen very infrequently in your underlying distribution, your model may get a wrong perception that

that some things are impossible. Wait, what do you mean by they are impossible? Like an example I've seen on Twitter was if you Google for a baby peacock, you'll discover pictures of birds that look

relatively realistic, but they are not peacocks at all. They are completely generated and you will not find the real picture. But if you try learning anything from it, of course, you're going to be absorbing this bias. Right. You're like telling me now that there's like a lot of fake baby peacock images, but machines don't know that, right? They're just going to think, great, this is a baby peacock. And also there's not that many like real baby peacock images to compare it to.

Exactly. And those are the kinds of errors that you don't normally see that often because they are so improbable, right? And if people are going to start reporting things to you and saying, oh, your model is wrong here, they're likely to notice things that on average are wrong. But if they're wrong in some small part of the internet that nobody really cares about, then it's very unlikely that you will even notice that you're making a mistake. And usually this is the problem because as the number of dimensions grow,

you will discover that the volume in the tails is going to grow disproportionately. Not just babies, but baby birds. Not just baby birds, but baby peacocks. Yeah, exactly. So as a result, you'll discover that you need to capture quite a bit. Okay, so that's one kind of problem, a data problem. What are the other two? On top of it, we have errors that come from learning regimes and from the models themselves. So on learning regimes, we are all training our models, all of them,

are structurally biased. So basically to say that your model is going to be good, but it's unlikely to be optimal. So it's likely to have some errors somewhere. And this was the error source number two. And error source number three is that the actual model design, what shape and form your model should be taking is very much alchemy. Nobody really knows

why stuff works, but kind of just know empirically stuff works. It's like a black box. We don't know how it's making these decisions. We don't know, like, where in, like you said, in that order it's fixing those decisions, you know. Yeah.

Yeah, which parts of the model are responsible for what? We don't know the fundamental underlying bias of a given model architecture. What we observe is that there is always some sort of an error that is introduced by those architectures. Right, right. Okay, so the three places errors could come from is like one, the model itself, two, the way it's trained, right? And three, the data or the lack of data that it's trained on.

Exactly. And then we also have empirical errors from, for example, hardware. So we also have practical limitations of hardware with which we work. And those errors also exist.

Let's talk about how those errors build. What happens when they start to build upon each other? Can you describe that outcome to me? Yes, certainly. So what we observe in simple theoretical models is that two main phenomena happen. The very first phenomena that happens is

it's really hard to approximate improbable events, in part because you don't encounter them very often. So you may discover that you're collecting more and more data, and a lot of this data looks very similar to what you already possess. So you're not discovering too much information. But importantly, you're not discovering this infrequent data points. So those tail events, they kind of disappear. And then the other thing that happens is that

The first time you made this error and underestimated your improbable events, when you hit the model on top of this, it's unlikely to recover from this taking place. Okay, so over time you start to lose the more unique occurrences and all the data starts to look more similar to the average.

Originally improbable events are even more improbable for the subsequent model, and it kind of like snowballs out of control until the whole thing just collapses fully to near zero variance. So instead of this bell curve, you just have like a point in the middle. You just have a whole bunch of stuff in the middle. Exactly.

Exactly. And the thing is, you can theoretically describe this. It's actually very simple. And you can run these experiments however many times you want. And you'll discover that even if you have a lot of data, if you keep on repeating this process and the rate at which this collapses, you can also bound, you end up always in a state where your improbable events kind of disappear. In practice, when we grab large language models...

We observed that they become more confident in the predictions that they are making. So basically, the improbable events here are going to be things that the model is not very confident about. And normally, it would not make predictions about it. So when you're trying to generate more data out of a language model in order for another language model to learn from it, over time, basically, it becomes more and more confident. And then...

It basically, during the generation setup, it gets stuck very often in these repetitive loops. I know this isn't exactly the same, but it makes me think of the telephone game. You know, when you, like, tell somebody a phrase or, like, a couple sentences, and then, like, the next person tells a person the same two sentences, and then, like, the next person says the same two sentences. And it usually gets, like, more and more garbled as it goes down the line.

I think this is a comparison kind of works. Yes. So this is the first thing. It's the improbable events. And then the second thing that happens is your models are going to produce errors. So misunderstandings of the underlying phenomena. Right. And as a result...

what you will see is that those errors start propagating as well. And they are relatively correlated. If all of your models are using the same architecture, then it's likely to be correlatedly wrong in the same kinds of way. So whenever it sees errors, it may amplify the same errors that it's observing. Yeah, I mean, I'm looking at some of the image output of these models that are trained on their own data right now. And we'll link these images in the show notes, but

I'm looking at like somebody's handwriting of like zero to nine and you know, it's not perfect. It's, it's handwriting, but like as it gets regenerated by the models over and over like 15 times, they, they're just dots, right? Like they're not distinguishable. You can't even tell their, their numbers, like which one is which.

Yeah, so approximations of approximations of approximations end up being very imprecise. As long as you can bound the errors of your approximations, it's okay, I guess. But yeah, in practice, because machine learning is very empiric, quite often we can't. Oh, I love these images. This is so good, Ilya. This is so good. Yeah. So an important thing to say here is that...

The settings we talk about here are relatively hypothetical in a sense that we are not in the world in which, you know, today we can build a model and tomorrow they disappear. That is not going to happen.

We already have very good models and the way forward is having even better models and there is no doubts about it. Okay. So like you said, you know, chat GPT isn't going to disappear tomorrow. What are researchers doing to avoid the problem of model collapse? Like as a computer scientist, what do you think the solution is?

I mean, there are many different solutions. You'll find a lot of different papers that are exploring what are the most effective mitigations. And it's mostly data filtering of different kinds.

and basically making sure that the data that ends up being ingested by the models is representative of the underlying data distribution. And whenever we hit this limit and we see that our model diverges into some sort of a training direction, the trajectory that is making the model worse, I promise you people will stop training of the models, retract back a couple of steps, maybe add additional data of certain kind and then keep on training.

Right. Because we can always go back to previous models, nothing stopping us. And then we can always spend more effort getting high quality data. Or paying more people to create high quality data. Yeah. So model collapse is not going to magically kill the models tomorrow. We just need to change the way we build stuff.

So this is not all doom and gloom. I am quite confident we'll solve this problem. I like that perspective. Ilya, thank you so much for talking with us today. Thank you very much for having me. It was a pleasure.

If you want to see some of the images I was looking at, you know, see the consequences of AI model collapse for yourself, we'll link to those in our show notes. Also, make sure you never miss a new episode by following us on whichever podcasting platform you're listening from. This episode was produced by Hannah Chin and edited by showrunner Rebecca Ramirez. Hannah and Tyler Jones checked the facts. Jimmy Keeley was the audio engineer. Beth Donovan is our senior director and Colin Campbell is our senior vice president of podcasting strategy.

I'm Regina Barber. Thank you for listening to ShoreWave, the science podcast from NPR.

This message comes from Capella University. With Capella's FlexPath learning format, you can set your own deadlines and learn on your schedule. A different future is closer than you think with Capella University. Learn more at capella.edu.

Support for the following message comes from LinkedIn Ads. As a B2B marketer, you know how noisy the digital ad space can be. If your message isn't targeted to the right audience, it just disappears into the noise. By using LinkedIn Ads, you can reach professionals who are more likely to find your ad relevant. Target them by job title, industry, company, and more. Get a $100 credit on your next campaign at linkedin.com slash results. Terms and conditions apply.

When AI Cannibalizes Its Data 13:23 Share

Short Wave

Deep Dive

Shownotes Transcript

When AI Cannibalizes Its Data