We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode When an AI internet search competes against a human internet search

When an AI internet search competes against a human internet search

2025/5/1
logo of podcast Marketplace All-in-One

Marketplace All-in-One

AI Deep Dive Transcript
People
M
Megan McCarty Carino
S
Selena Deckelmann
Topics
Megan McCarty Carino: 我注意到,随着AI技术的快速发展,网络爬虫对维基百科等网站的数据抓取造成了巨大的压力。例如,当吉米·卡特去世时,维基百科的访问量激增,服务器一度出现负载过重的情况。这不仅是因为访问量增加,还因为大量AI机器人爬取数据用于训练AI模型,导致服务器超负荷运行。 此外,AI机器人爬取数据的模式与人类用户不同,它们会抓取所有数据,而不是像人类用户那样关注特定内容,这进一步加剧了服务器的负担。这种现象并非个例,许多网站都面临着类似的问题。一些网站开始采取措施阻止网络爬虫,但这往往收效甚微,甚至催生了反爬虫技术的产业,最终可能影响人类用户获取信息的效率。 Selena Deckelmann: 作为维基百科的首席产品和技术官,我深刻感受到AI机器人爬取数据对我们基础设施造成的巨大压力。虽然维基百科拥有先进的缓存系统,可以应对突发流量,但AI机器人爬取数据的模式与人类用户不同,它们会抓取所有数据,这导致系统负载远超预期。 维基百科的数据对训练大型语言模型非常有价值,因为这些数据由人类生成,能够更好地回答人类提出的问题。我们理解AI公司需要这些数据来训练模型,但我们也呼吁他们与我们合作,并为维基百科的基础设施建设提供支持,以确保维基百科的持续存在。我们希望他们能够遵循我们的Creative Commons开源许可,对维基百科内容进行适当的引用,并考虑通过付费的方式支持维基百科的未来发展,例如使用我们的Wikimedia Enterprise产品。我们相信,商业用途和非商业用途都能支持维基百科的使命,即永久性地向全球传播免费知识。但大型商业公司需要认识到,其产品价值依赖于维基百科的人工生成知识,并为其基础设施建设做出贡献。我们希望找到一种平衡点,在满足AI公司数据需求的同时,确保人类用户能够便捷地访问维基百科的信息。

Deep Dive

Shownotes Transcript

Translations:
中文

Wikipedia is crawling with AI bots. From American Public Media, this is Marketplace Tech. I'm Megan McCarty Carino. When Jimmy Carter died late last year, the foundation that runs Wikipedia noticed something unusual. The flood of interest in the late president created a content bottleneck, slowing load times for about an hour.

Wikipedia is built to handle spikes in traffic like this, according to the foundation. But it's also dealing with a surge of bots, scraping the site to train AI models and clogging up its servers. That's according to Selina Deckelman, chief product and technology officer at Wikimedia Foundation.

This is traffic that doesn't necessarily follow like what other people are looking at. So we have a very sophisticated caching system so that when something like someone passes away or there's some important event happening in the world that many people want to learn about, we cache that result. But with the cross-caching,

crawlers do is they just look at everything. They're not as interested in what other human beings in that moment are interested in. And so what that does is it causes all of the systems to load a lot more data than they normally would.

Web crawlers have been a part of the Internet since the Internet came online. That's how we find web pages when we search Google. But how has AI really changed the volume and intensity of traffic from bots? What we've seen is just a massive increase in interest in crawling the entire Internet and creating bots.

Maybe just like a treasure trove of everything that's like on the internet. And what that's for is teaching...

large language models about, you know, what is on the internet and giving them the ability to answer questions that people might have, such as, you know, someone who's using a chatbot. You've interacted with ChatGPT, some other kind of a chatbot. You ask it a question and its ability to respond is based on its exposure to all this training data. So over time, as these models have become more popular, you know, they've developed

been deployed at all of the major websites that someone might encounter, those bots, they need training data. They need to know, be taught about the world through this collection of data. So that's largely what we think is driving it. And as part of that, what we've noticed is that the data that comes from Wikipedia and other projects that we

support, because it's generated by human beings, it's even more valuable to be trained on because it is very good at answering the kinds of questions that human beings ask.

What are the implications of all this for the foundation's infrastructure? The most important thing for us right now is to communicate with the people who are operating scrapers and to ask them to collaborate with us. You know, we actually believe that the data that has been collected by all of these incredible volunteers is

that it should be part of the global information ecosystem, that training on it is within the licenses that we have. They're called Creative Commons licenses. It's openly licensed content. But our requests are that the companies that are relying on this information, that they make every effort to support the continued existence of it, which is supporting these editors. And it's also following a few other aspects of these licenses, which are that

They include some kind of attribution. You know, we think that responsible product design choices like properly attributing Wicked Media content and other openly licensed content, it'll help in sharing back and ensuring that other people consider participating in these Commons projects.

And we also ask that companies think about paying to support the future of Wikipedia. Commercial companies, they can use something like Wikimedia Enterprise, which is our paid-for product that enables them to reuse the content and supports the infrastructure in more effective ways. Yeah, because ultimately, what does this increased strain mean for the usability of Wikimedia's sites?

I think the main effect for us when we exceed our capacity is that it impacts human access to knowledge. People rely on these information sources every day. And so our job is to try to find ways to make sure that they have access to it, even as commercial companies and even just like researchers will be accessing this data, just find ways for us to coexist. And one of those ways is to

Learn about more responsible ways of accessing the data other than just like kind of massively scraping the sites. We'll be right back. You're listening to Marketplace Tech. I'm Megan McCarty Carino. We're back with Selena Deckelman, Chief Product and Technology Officer at Wikimedia Foundation. As you noted, Wikipedia has this really unique model. A lot of the content is volunteer generated or volunteer edited.

And then you have these web crawlers, often in service of for-profit companies, training their AI, which also comes with some ethical concerns of its own. Are there tensions there? Well, like I said, I believe that both the commercial and non-commercial uses of this support our mission, which is to distribute free knowledge in perpetuity worldwide. We think that that

The internet itself, it's a place for, you know, exploration, connecting with other people, sharing knowledge, and it's also a place for commerce. So, you know, the license was designed from the beginning to support those use cases. And I can't deny that there is tension there. But I think for us, where we think...

We can support, you know, whatever evolution is coming for the internet in all of these changes with AI. Where we can best like collaborate on that is thinking of our systems as our content being free, but the infrastructure is not.

And these large-scale commercial reusers, they really need to recognize that the value of their products, it depends on this knowledge, on this human-generated knowledge, which then supports a wider information ecosystem. And, you know, a system that ultimately we think can be used for bettering humanity can be used for helping people know more. And, you know, that fits, I think, well within our mission.

So what does the foundation need in order to be able to scale up to meet this demand? The primary thing that we need right now is for folks who are writing scrapers to think a little bit about how they're doing that to, you know, use our best practices to communicate with us and identify themselves so that in a moment where, you know, sometimes these people

scrapers that go haywire, it might just actually be a mistake. So, you know, giving us a way of contacting them is really important. And then, like I said, finding ways of supporting the future of Wikipedia through attribution, through working with us on Wikimedia Enterprise, those are the best ways right now. That was Selena Deckelman at the Wikimedia Foundation.

We've got more from Wikimedia Foundation on how much AI scraping is straining its infrastructure at MarketplaceTech.org. In a blog post, they stress it's not just the amount of bot traffic, but that randomness of the traffic that creates problems. In fact, Wikimedia says bots are responsible for 65% of the site's most expensive traffic.

It's a problem a lot of other websites are having, and not all of them are taking the generous approach of Wikimedia. MIT Technology Review reports some are putting up barriers to web crawlers, bits of code that tell them to go away. But because these instructions are often ignored, it spurred a cottage industry of anti-crawling technologies to detect, block, and charge bots.

The upshot for actual human users? This could make it a lot harder for us to find and access what we're looking for on the internet. Jesus Alvarado produced this episode. I'm Megan McCarty Carino, and that's Marketplace Tech. This is APM. If there's one thing we know about social media, it's that misinformation is everywhere, especially when it comes to personal finance.

Financially Inclined from Marketplace is a podcast you can trust to help you get serious about your money so you can build a life you've always dreamed of. I'm the host, Janelia Espinal, and each week I ask experts important money questions, like how to negotiate job offers, how to choose a college that you can afford, and how to talk about money with friends and family. Listen to Financially Inclined wherever you get your podcasts.