We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Reddit Sues Anthropic for Secretly Scraping Data

2025/6/7

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning

AI Deep Dive AI Chapters Transcript

People

主

主持人

专注于电动车和能源领域的播客主持人和内容创作者。

Topics

主持人：Reddit正在起诉Anthropic，指控他们未经授权抓取Reddit的数据来训练AI模型。Reddit通过robots.txt文件明确禁止了AI模型抓取其数据，但Anthropic无视了这些规定。Reddit认为Anthropic的行为侵犯了用户隐私，因为未经许可抓取的数据可能被用于反向工程，从而泄露用户身份信息。Reddit已经与Google和OpenAI签订了数据许可协议，并希望Anthropic也能达成类似的协议。我认为，这场诉讼的根本原因是经济利益，尽管隐私问题也被提及以争取公众支持。作为OpenAI的CEO，Sam Altman持有Reddit的股份，这使得情况更加复杂。Reddit声称自2014年以来，Anthropic的机器人持续抓取数据，且抓取次数超过10万次。Anthropic对此表示否认，并表示将积极为自己辩护。

Deep Dive

Shownotes Transcript

Translations:

中文

It's so funny because right now there's a ton of drama going on between Reddit and Claude. But what's interesting is at the same time, there's also drama going on between Windsurf and Claude. I guess Anthropic, right? And both of these, while some people think are completely different,

unrelated, I believe are quite directly correlated. And I will explain why. So the drama that's happening with Reddit right now is that they're accusing Anthropic and by accusing, they're launching a lawsuit of, they're accusing Anthropic of not paying for training data. Essentially, Anthropic's going and training on Reddit's data without having any permission to. And the way that they essentially opt out or don't give them permission

permission. It's pretty easy for an AI company or for any sort of media company, I guess. They put a robot.txt thing on there essentially saying that AI models cannot scrape their data, but their data is still on Google, so it's tricky. In any case, Anthropic has not listened to the terms of service and they're going and allegedly continuing to scrape it. We'll get a little bit more into the accusations in a second.

Before we do, I wanted to mention, if you ever want to try the latest model out of Anthropic, you're like, oh man, they're scraping Reddit without permission and their model is getting so much better. How much better is it getting, you might ask? Well, you can go over to AIbox.ai, my startup, and check out our playground. We have the top 10.

20 AI models all on there. Image, audio, text, anthropic, Google, deep sea, cohere, meta, Microsoft, everybody. And you can access all of the top models all on one platform. You can chat with all of the models in the same chat, which is kind of cool. So you can go try Sonnet 3.5,

ask it a bunch of questions. Mid-conversation, you can switch to ChatGPT if you're like, hey, I just don't like the way Claude talks about this specific topic. I want to check it over here. Or maybe you're using something like OpenAI and it's not giving you a response. It's being a little too cagey and you're like, okay, I need it to be a little bit less censored. You can go use Grok and try getting a better answer from that. So all of this and more, including using images or getting images and

audio generated inside the same conversation is super, super useful. You go check it all out on AI box.ai. There's a link in the description. Okay. Back to what's going on with Reddit. So

Reddit has officially filed a complaint. They filed it in the Northern California court. They just did it on Wednesday. And they essentially said that Anthropic is having unauthorized use of their site's data and they're using it for commercial purposes, right? They're monetizing their AI model, blah, blah, blah. This violates their user agreement. So this is actually interesting because this is the first time that a big tech company is actually legally challenging an AI model provider, right? And it's kind of interesting because

The other big tech companies are Meta, who is not going to sue anybody because they're, well, they've kind of blocked off their data a long time ago from training. So they have their exclusive data sets. But they're not suing anyone. Google's making Gemini. They're not suing anybody, even though they've done... I think Google Gemini, for example, I'm pretty sure OpenAI used YouTube to train their Sora model. And that's sus and gets their terms. But I think Google's also done that. And maybe even some other shadier things. Well, shady is such a...

Tricky word. They've done it off of YouTube without probably giving a lot of... Without giving users the ability to opt out as well. So they know they're going to be calling the kettle black. They'll probably get in trouble. Even if they put it in the terms of service, users wouldn't be happy. It wouldn't be a good look. So they're just avoiding that whole lawsuit. And OpenAI has not been served any sort of lawsuit from Google for training their video. So it's kind of interesting. Microsoft as well is doing stuff with OpenAI, right? So it's kind of interesting because a lot of the big tech companies are...

In like they're making these AI models, so they're not going to be they're like publishers and also creating a model. So they're not really suing anyone for it because it's all kind of a gray area wild west. There are a handful of companies that don't have their own AI models. Reddit being one of them that is launching lawsuits is kind of like I believe X and XAI was probably launching a little bit more lawsuits. Yeah.

Elon has his own beef with open AI, but those seem to have calmed down and been canceled or something. I don't know. They kind of go on and off. But in any case, Reddit, this is one of the first big tech companies legally challenging AI models. And the reason why is because Reddit is literally making money and pretty significant money. I believe they signed a licensing deal with Google to include Reddit data into Google Gemini for $300 million. So they're making some serious money. I believe that might be just for one year. So

Reddit is making a lot of money from this. OpenAI has also signed an undisclosed amount licensing deal with Reddit. So Reddit obviously wants Anthropic to do the same thing, right? They're like, look, Meta, if you want it, you know, Anthropic, if you want it, come over. We're signing licensing deals with Google for Gemini and OpenAI.

But they have not done that. And so this is what their lawyer specifically said. He said, we will not tolerate profit-seeking entities like Anthropic commercially exploiting Reddit content for billions of dollars without any return for Redditors or respect for their privacy. One of the big ways they kind of spin this, I don't know if spin this, but one of the big ways they cover this is essentially they're saying that...

Like it's user privacy, like you're pulling in all this user data, you're not really separating it, you're not having, you know, adequate privacy in place. If you get a licensing deal directly with Reddit, they give out the data. So the ones that they've done with OpenAI and Google, but they also keep the usernames of all the Reddit users pre-set.

So they'll encrypt all those. So you get the content, but you don't know who said it. But if you're just scraping it, you probably could get who said it and the content. And there's probably weird ways to reverse engineer once the data set goes in where you could say something like, give me a response about this topic and come up with a fictional username. And maybe the fictional username could be like an actual person's username. And so then, right, it's where things get a little confusing.

a little shady now that being said it doesn't like um anthropic could also just obfuscate the usernames themselves that could be a manual thing that is done as well on their end and i don't know if it is but it's definitely good for the lawyers and it is good for public opinion i think to drum it up is you know like this is a this is a privacy breach and like issue or whatever when really it's just about the money which i'm not saying anyone's right or wrong on this but i think it's fair to just say it's definitely all about the money so

Sam Altman, though, so this is where the plot gets a little bit thicker. Sam Altman is the CEO of OpenAI, of course, the main competitor to Anthropic, and he actually owns an 8.7% stake in Reddit. So that makes him the third largest shareholder, and he actually was once a member of their board of directors. So,

I think with all of this going on, apparently they've told them to stop scraping. Anthropic, quote unquote, refused to engage on that. I mean, they probably just wanted some money from them. But Anthropic has responded and said, we disagree with Reddit's claims and we will defend ourselves vigorously. I feel like this is just what every company says when they're about to go to court is that they disagree.

You know, they deny it and say, we'll defend ourselves vigorously. But Reddit claims that since they told them to stop scraping back in 2014, Anthropics bot continued to scrape at least 100,000 times more. So the problem persisted and we'll see if it ever stops. This is definitely a very interesting and a lot of controversy in the space.

In any case, if you learned anything new from the episode, make sure to leave a rating and review. Thank you so much for tuning in today. Make sure to go check out AIbox.ai and I will catch you in the next episode.

Reddit Sues Anthropic for Secretly Scraping Data 07:24 Share

AI Chat: ChatGPT &amp; AI News, Artificial Intelligence, OpenAI, Machine Learning

Deep Dive

Shownotes Transcript

Reddit Sues Anthropic for Secretly Scraping Data

AI Chat: ChatGPT & AI News, Artificial Intelligence, OpenAI, Machine Learning