We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode When AI Cannibalizes Its Data

When AI Cannibalizes Its Data

2025/2/18
logo of podcast Short Wave

Short Wave

AI Deep Dive AI Chapters Transcript
People
I
Ilya Shumailov
R
Regina Barber
Topics
Regina Barber: 现在生成式AI无处不在,应用于各种场景,例如Google搜索、TikTok工具推荐、客户服务聊天等。大型语言模型如DeepSeek R1、ChatGPT等,能够生成图像、视频等多种形式的内容,但同时也面临着数据偏差和模型崩溃的风险。我们需要深入了解这些风险,并探讨相应的解决方案。 Ilya Shumailov: 为了训练大型语言模型,我们需要使用大量人类书写的例子,让模型阅读整个互联网。然而,随着生成式AI的普及,互联网上越来越多的内容是由AI生成的,这导致模型可能会消耗自己合成的内容,从而产生数据偏差和模型崩溃的问题。模型出错的原因主要有三个:数据相关错误、学习机制的结构性偏差以及模型设计本身的问题。此外,硬件的限制也会导致经验误差。当模型不断地从自身生成的数据中学习时,不可能发生的事件会逐渐消失,模型会变得越来越自信,最终导致模型崩溃。作为研究人员,我正在积极探索各种数据过滤方法,以确保模型摄取的数据能够代表底层数据分布,并防止模型崩溃的发生。我相信我们能够解决这个问题,并继续推动AI技术的发展。

Deep Dive

Shownotes Transcript

Asked ChatGPT anything lately? Talked with a customer service chatbot? Read the results of Google's "AI Overviews" summary feature? If you've used the Internet lately, chances are, you've consumed content created by a large language model. These models, like DeepSeek-R1 or OpenAI's ChatGPT, are kind of like the predictive text feature in your phone on steroids. In order for them to "learn" how to write, the models are trained on millions of examples of human-written text. Thanks in part to these same large language models, a lot of content on the Internet today is written by generative AI. That means that AI models trained nowadays may be consuming their own synthetic content ... and suffering the consequences.**View the AI-generated images) mentioned in this episode.***Have another topic in artificial intelligence you want us to cover? Let us know my emailing [email protected])!Listen to every episode of Short Wave sponsor-free and support our work at NPR by signing up for Short Wave+ at plus.npr.org/shortwave).*Learn more about sponsor message choices: podcastchoices.com/adchoices)NPR Privacy Policy)