Wikipedia is crawling with AI bots. From American Public Media, this is Marketplace Tech. I'm Megan McCarty Carino. When Jimmy Carter died late last year, the foundation that runs Wikipedia noticed something unusual. The flood of interest in the late president created a content bottleneck, slowing load times for about an hour.
Wikipedia is built to handle spikes in traffic like this, according to the foundation. But it's also dealing with a surge of bots, scraping the site to train AI models and clogging up its servers. That's according to Selina Deckelman, chief product and technology officer at Wikimedia Foundation.
This is traffic that doesn't necessarily follow like what other people are looking at. So we have a very sophisticated caching system so that when something like someone passes away or there's some important event happening in the world that many people want to learn about, we cache that result. But with the cross-caching,
crawlers do is they just look at everything. They're not as interested in what other human beings in that moment are interested in. And so what that does is it causes all of the systems to load a lot more data than they normally would.
Web crawlers have been a part of the Internet since the Internet came online. That's how we find web pages when we search Google. But how has AI really changed the volume and intensity of traffic from bots? What we've seen is just a massive increase in interest in crawling the entire Internet and creating bots.
Maybe just like a treasure trove of everything that's like on the internet. And what that's for is teaching...
large language models about, you know, what is on the internet and giving them the ability to answer questions that people might have, such as, you know, someone who's using a chatbot. You've interacted with ChatGPT, some other kind of a chatbot. You ask it a question and its ability to respond is based on its exposure to all this training data. So over time, as these models have become more popular, you know, they've developed
been deployed at all of the major websites that someone might encounter, those bots, they need training data. They need to know, be taught about the world through this collection of data. So that's largely what we think is driving it. And as part of that, what we've noticed is that the data that comes from Wikipedia and other projects that we
support, because it's generated by human beings, it's even more valuable to be trained on because it is very good at answering the kinds of questions that human beings ask.
What are the implications of all this for the foundation's infrastructure? The most important thing for us right now is to communicate with the people who are operating scrapers and to ask them to collaborate with us. You know, we actually believe that the data that has been collected by all of these incredible volunteers is
that it should be part of the global information ecosystem, that training on it is within the licenses that we have. They're called Creative Commons licenses. It's openly licensed content. But our requests are that the companies that are relying on this information, that they make every effort to support the continued existence of it, which is supporting these editors. And it's also following a few other aspects of these licenses, which are that
They include some kind of attribution. You know, we think that responsible product design choices like properly attributing Wicked Media content and other openly licensed content, it'll help in sharing back and ensuring that other people consider participating in these Commons projects.
And we also ask that companies think about paying to support the future of Wikipedia. Commercial companies, they can use something like Wikimedia Enterprise, which is our paid-for product that enables them to reuse the content and supports the infrastructure in more effective ways. Yeah, because ultimately, what does this increased strain mean for the usability of Wikimedia's sites?
I think the main effect for us when we exceed our capacity is that it impacts human access to knowledge. People rely on these information sources every day. And so our job is to try to find ways to make sure that they have access to it, even as commercial companies and even just like researchers will be accessing this data, just find ways for us to coexist. And one of those ways is to
Learn about more responsible ways of accessing the data other than just like kind of massively scraping the sites. We'll be right back. You're listening to Marketplace Tech. I'm Megan McCarty Carino. We're back with Selena Deckelman, Chief Product and Technology Officer at Wikimedia Foundation. As you noted, Wikipedia has this really unique model. A lot of the content is volunteer generated or volunteer edited.
And then you have these web crawlers, often in service of for-profit companies, training their AI, which also comes with some ethical concerns of its own. Are there tensions there? Well, like I said, I believe that both the commercial and non-commercial uses of this support our mission, which is to distribute free knowledge in perpetuity worldwide. We think that that
The internet itself, it's a place for, you know, exploration, connecting with other people, sharing knowledge, and it's also a place for commerce. So, you know, the license was designed from the beginning to support those use cases. And I can't deny that there is tension there. But I think for us, where we think...
We can support, you know, whatever evolution is coming for the internet in all of these changes with AI. Where we can best like collaborate on that is thinking of our systems as our content being free, but the infrastructure is not.
And these large-scale commercial reusers, they really need to recognize that the value of their products, it depends on this knowledge, on this human-generated knowledge, which then supports a wider information ecosystem. And, you know, a system that ultimately we think can be used for bettering humanity can be used for helping people know more. And, you know, that fits, I think, well within our mission.
So what does the foundation need in order to be able to scale up to meet this demand? The primary thing that we need right now is for folks who are writing scrapers to think a little bit about how they're doing that to, you know, use our best practices to communicate with us and identify themselves so that in a moment where, you know, sometimes these people
scrapers that go haywire, it might just actually be a mistake. So, you know, giving us a way of contacting them is really important. And then, like I said, finding ways of supporting the future of Wikipedia through attribution, through working with us on Wikimedia Enterprise, those are the best ways right now. That was Selena Deckelman at the Wikimedia Foundation.
We've got more from Wikimedia Foundation on how much AI scraping is straining its infrastructure at MarketplaceTech.org. In a blog post, they stress it's not just the amount of bot traffic, but that randomness of the traffic that creates problems. In fact, Wikimedia says bots are responsible for 65% of the site's most expensive traffic.
It's a problem a lot of other websites are having, and not all of them are taking the generous approach of Wikimedia. MIT Technology Review reports some are putting up barriers to web crawlers, bits of code that tell them to go away. But because these instructions are often ignored, it spurred a cottage industry of anti-crawling technologies to detect, block, and charge bots.
The upshot for actual human users? This could make it a lot harder for us to find and access what we're looking for on the internet. Jesus Alvarado produced this episode. I'm Megan McCarty Carino, and that's Marketplace Tech. This is APM. If there's one thing we know about social media, it's that misinformation is everywhere, especially when it comes to personal finance.
Financially Inclined from Marketplace is a podcast you can trust to help you get serious about your money so you can build a life you've always dreamed of. I'm the host, Janelia Espinal, and each week I ask experts important money questions, like how to negotiate job offers, how to choose a college that you can afford, and how to talk about money with friends and family. Listen to Financially Inclined wherever you get your podcasts.