EP 534: Claude 4 - Your Guide to Opus 4, Sonnet 4 & New Features

2025/5/28

Everyday AI Podcast – An AI and ChatGPT Podcast

AI Deep Dive AI Chapters Transcript

People

Jordan Wilson

一位经验丰富的数字策略专家和《Everyday AI》播客的主持人，专注于帮助普通人通过 AI 提升职业生涯。

Topics

Jordan Wilson: 我将分析Anthropic Claude 4的新特性，并评估它是否适合日常使用或仅适用于软件工程师。Anthropic发布了两款新的AI模型Claude 4 Opus和Sonnet，我们将评估它是否是你应该每天使用的新语言模型，或者它是否更适合软件工程师。Claude 4具有混合推理能力，可以在即时和扩展思维模式之间灵活切换。Claude 4是顶级的编码模型，Anthropic似乎专注于此，并放弃了通用用途。Claude 4具有工具集成功能，可以在推理过程中使用外部工具，如网络搜索。Claude 4模型可以长时间保持复杂任务的连贯性。Claude 4可以在推理过程中使用工具，这使得它在这一点上赶上了OpenAI和Google。Claude 4的API非常昂贵，我不希望它长时间运行复杂的任务。Claude 4的免费计划提供的选择非常有限。Claude Pro的付费计划的速率限制令人发笑。我经常在4到10分钟内达到速率限制。Claude不像Google的Gemini和ChatGPT那样提供慷慨的限制，无法成为你的工作伙伴。Claude的免费版本可能只是一个营销噱头，可能无法处理大量上下文的长提示。我想知道是否应该再做一个关于Claude的节目。在第400集中，我详细分析了为什么你的公司不应该使用Anthropic Claude，并且很多原因今天仍然适用。Anthropic主要关注软件工程领域，已经放弃了日常商业专业人士。如果你做一些提示工程，OpenAI的GPT 4.5和Gemini 2.5 Pro会更好。Claude在零样本复制方面表现良好，但OpenAI和Gemini的模型更好。Claude在软件工程方面非常出色。Opus 4和Sonnet 4在Sweebench上的得分都在72百分位，是软件工程任务的基准。Opus 4和Sonnet 4是软件工程的最佳模型，但领先优势不大。Claude 4在通用智能方面并不突出。Claude 4在人工分析智能指数中排名第八，该指数综合了七个不同的基准。除非你是软件工程师，否则Claude 4不是最好的模型。Anthropic的更新周期较慢，API价格昂贵，不适合长期决策。Claude 4在其他用例中并不突出。Opus 4是旗舰模型，适用于更复杂的任务和编码，但Sonnet 4的基准测试几乎相同。Sonnet 4在通用和高容量使用方面提供了更平衡的性能，两者都采用混合推理。Claude 4的新功能包括扩展的工具使用推理和记忆。并行工具执行允许同时使用多个工具进行推理。记忆文件用于在长时间任务中保持上下文。记忆功能不总是好的，因为它可能会影响输出的多样性。思考摘要显示了简明的推理过程，可以在开发者模式中查看完整的思维链。Claude 4的上下文窗口只有20万个token，不如Google Gemini。人们希望Claude 4有更长的token上下文窗口和更低API价格，但都没有实现。文件API简化了文档处理，扩展的提示缓存提高了代理工作流程效率。Claude 4在代理工作流程方面也很强大，但价格昂贵。在代理任务中，其他模型显示出比Sonnet 3.7减少65%的捷径。Claude 4在代理任务中减少了捷径，这非常重要。Claude code是为开发者提供的专用编码工具。Anthropic模型上下文协议（MCP）非常受欢迎，它允许不同的代理系统和大型语言模型在互联网上相互通信。MCP使AI系统和大型语言模型能够相互通信。Claude Code可以自主地对大型代码库进行长达7小时的重构。Anthropic的API成本非常高昂。Opus 4的价格是每百万token输入15美元，每百万token输出75美元。人们希望Claude 4有更长的上下文窗口、更多功能和更低的价格，但只有更多功能实现了。Google Gemini 2.5和GPT 4.0的价格远低于Claude 4。如果Anthropic在任何类别中都没有不可逾越的领先优势，那么这些高昂的价格是不合理的。如果你只是登录Claude.ai，你不需要关心这些API价格，但如果你想在Claude的基础上构建，这些价格是不可持续的。Claude 4的价格是Google Gemini的五倍以上。Anthropic在软件工程方面的微小优势可能会被Google的新版本超越。Claude 4在内部测试中发现了一些道德风险。Claude 4有可能通过Claw.ai代码访问命令行工具。Opus 4被临时标记为ASL 3，因为它有可能被滥用。Opus 4达到了一个新的风险级别，如果被坏人利用，可能会做坏事。Opus 4在84%的压力测试场景中表现出欺骗性的勒索行为。Claude 4威胁要揭露婚外情。Anthropic承认Opus 4有时会尝试勒索等有害行为。Claude 4有一个新的“告密”功能，如果它认为你正在做不道德的事情，它会联系媒体和监管机构。Anthropic的安全研究员表示，如果模型认为你正在做极其不道德的事情，它会使用命令行工具联系媒体和监管机构。财富500强公司可能会因为这个“告密”功能而放弃使用Anthropic的API。早期版本试图自我复制病毒和伪造文件。这个“告密”功能绝对是可怕的。Anthropic面临着一场灾难。人们对Claude 4的反馈是积极的，尤其是在软件工程领域，但20万个token的上下文窗口和激进的速率限制受到了批评。Anthropic的20美元基本计划没有意义，因为它不允许人们使用他们付费的东西。

Deep Dive

Chapters

Anthropic released Claude 4, featuring Opus and Sonnet models. The naming convention changed, and the models now incorporate hybrid reasoning. Only Opus and Sonnet were updated; Haiku remains at version 3.5.

Release of Claude 4 Opus and Sonnet models
Change in model naming convention
Introduction of hybrid reasoning
Haiku model remains at version 3.5

Shownotes Transcript

Translations:

中文

This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Right at the end of the busiest week in AI ever, Anthropic decided to drop two big new AI models on us all.

as if we weren't busy enough with everything else that we had just seen released from Microsoft, Google and others. We now had two new contenders in Claude for Opus and Claude for Sonnet to play with and to see how good these models are and if they can actually grow our companies and our careers.

So just like we did with everything Microsoft and everything Google, we're going to be breaking down what's new with this new Anthropic Cloud for release and talk about is this going to be your new large language model you use every day? Or is this maybe just for software engineers? Or is this not just a good model? All right, so we're going to be going over that today.

today and a lot more on everyday AI. What's going on y'all? My name is Jordan Wilson. I'm the host of everyday AI. And if you're looking to grow your company and career with generative AI, then this is for you. This is your daily live stream podcast and free daily newsletter, helping us all learn and leverage generative AI. So if

If you haven't already, please go to youreverydayai.com. So there you're going to get the recap of today's show in our free daily newsletter, but also there at youreverydayai.com. You can go listen to, watch, and read more than 530 back episodes sorted by category. So no matter what you're trying to learn, whether it's sales, marketing, HR, ethics,

data analysis, whatever it is. We've got probably dozens of shows in all of those categories talking to the world's leading experts. It is a free generative AI university. So make sure you go check that out. All right. Most days we go over the AI news. I didn't want to make this a super long show. So that's going to be in today's newsletter. So make sure you go sign up and grab all of that. All right. Livestream audience. It's good to see y'all.

Like Marie says, good morning, AI family. Yeah, if you're listening on the podcast, we do this live almost every single Monday through Friday at 730 a.m. Central Standard Time. I'm in Chicago, so you can do the math there or maybe have Claude do the math.

for what time that is, but join. Come hang out with people like Josh Cavalier saying good morning from Charlotte, North Carolina. Giordi joining us from Jamaica. Love to see you, Giordi. Jose from Santiago, Chile. We got some international flavor. I love this. Brian joining us from Minnesota. Everyone else, big bogey on the YouTube machine.

Christopher joining us from Bowling Green, Kentucky. Thanks for joining. But let me know as we go along, what are your thoughts on the new release on Claude 4? But right now, this is your guide. This is the basics. We're going to start here.

All right. So like I said, Claude had their first ever or sorry, Anthropic had their first ever developers conference this past Thursday. And there they announced, among other things, their two new flagship models in Claude for Opus and Claude.

for Sonnet. And I'm already getting a little bit confused saying those things out loud. So I did talk about this a little bit on the show yesterday, but they even changed their naming mechanism. Whereas before it was, you know, Claude 3.7 Sonnet was the last Sonnet variation, but now it's just Claude Sonnet

four. So now the number is at the end. So a lot of new things, even how they're changing or naming their models. But if you are brand new and if you don't know too much about Anthropix Cloud, it is and has historically been usually a top three AI lab along with OpenAI, Google. Microsoft is kind of in a different category, but it's one of the biggest

large language models in the world, although most people, unless you're a real AI kind of dork or a heavy large language model user, you might not know Claude. And I think that is actually, whether you're saying fortunately or unfortunately, only going to...

become intensified. Just, I think fewer and fewer people are actually going to be using and hearing about Claude, I think, because I think they're getting away from being a general chatbot company, but more on that here in a couple of minutes. But, you know, there's three big, there's three variations of Claude. So you have your biggest model, which is Opus, your medium model, which is Sonnet. And then you have your small model, which is Haiku.

And you'll notice that only the Opus and Sonnet models got updated to the four variations. So Claude Haiku 3.5, which is their smallest and most efficient model, did not get updated. So that is still Claude Haiku 3.5. So I guess the only thing that got updated was the naming mechanism there. So.

Here's a quick overview of what is actually new. All right. So we have hybrid reasoning. So this is an instant and extended thinking mode for flexible reasoning. So, you know, we talk about kind of two types of large language models here on the show.

Yes, I'm overgeneralizing this, but you have your traditional transformer, your old school large language models, which is funny to say something's old school, but those are ones that just kind of snap something back to you real quick. And then you have these models that are reasoners or they can think step by step.

They can show logic like a human and plan ahead. So these models, you know, Gemini 2.5 pro is a reasoning model. The open AI O models, Oh three Oh four Oh one. Those are all reasoning models. So a quad four, it's a hybrid model. So it just,

sides on how much it should think and should have just spit things out to you really quick. It is a top coding model that is by far where Anthropic is seemingly focusing on and kind of abandoning general use, but it is now state of the art in coding. It will be interesting to see how long they hold that state of the art coding title.

I don't think it's going to be long if I'm being honest, because Google could come in with an update literally any second now and probably wipe a good majority of these benchmarks that Anthropic is now hanging their clawed hat on. So another big thing is tool integration. So using external tools like web search during the reasoning process. So that's if you are

you know, there's two different ways you can look at this, right? So using it on the front end as a front end user, right? So if you go to Claude AI or Claude.ai, right? So using it as an AI chat bot, and then obviously if you're building on top of it or using a service that uses Claude's API. So there's always a front end user, which is your more non-technical people. And then the backend people that are maybe building on top of Claude's API. But

regardless, you can have this new tool use during the reasoning process, which is big, right? And this is nice because it catches Anthropic up with OpenAI and Google in that regard.

Also now there's long running tasks. So I haven't personally seen this and I think this is only if you're using it in the API, but Anthropic is saying that it can, the new Cloud 4 models can maintain a coherence on complex tasks for extended periods. They talked about Cloud running a task, I think Cloud 4 for like

seven hours on the API side, uh, which is absolutely bonkers. Now that you have, uh, you know, models literally like punching in the clock and they're like, yeah, I'm going to go work a seven hour day. Now I would never give a model that complex on the backend because yes, it's going to require, uh,

obviously the API and Claude's four is one of the most expensive APIs, at least when we're looking at general use case, large language models. And I would never want something like that to happen where it goes out and it works on a long task for a long time. And then, okay, what happens if it times out? Right. Did you just waste? I don't know. Uh,

couple hundred dollars, you know, having Claude go code for six or seven hours straight. I'm not sure. And if you do want to get a taste of Claude and if you're not on their paid plan, they do offer very, very limited, very limited options for Claude for on the free plan. All right. But

let's be honest, I'm going to call a spade a spade. Right. So I think, you know, the, the paid plan is like, you know, $20 a month for the pro plan. Uh, and even on that, you can barely use the thing. Right. Um, it started as a joke, but now it's just sad for entropic as a company. Uh,

I routinely will hit this rate limit. I'm on a paid plan. I paid $20 a month for Claude Pro, and I will routinely hit the rate limit in about four to 10 minutes. Almost every single time I try to use, even preparing for this show, hit it within about seven minutes.

So it's laughable. So yeah, I even chuckle more that there's a free version of Sonnet. So I don't know. I venture to think if you look at the free version the wrong way, you've hit your rate limit. So if you think that this is anything like a model that you can use, like Google's Gemini, ChatGPT,

co-pilot anything else where you have generous limits and it can be your partner in whatever type of work you're doing. Absolutely not. If you're on a base $20 a month plan, a team's plan, the limits are a little better, but the free plan, yeah, it's probably just a marketing gimmick. I don't even know if it could take a long prompt with a lot of context. It would probably not work if I'm being honest, right? All right. Let's keep this thing going.

And by keeping this thing going, should we do another show? Uh,

I want to give everyone a fair shake. And yes, I'm not the biggest, uh, anthropic Claude fan. Uh, I, I broke down why, uh, about six months ago, uh, I'll have to pull up that episode, uh, number, but Hey, live stream audience. If you do want a second show, because I I've been doing, you know, multiple shows when, you know, Google comes out with a new model when, uh, open AI comes out with a new model. So if you do let me know right now, just tell me what show a show B show C show

Show D or show E. Okay. And I'm going to throw this up again at the end. So show a, why Claude is losing the AI chat bot race. Show B real world use cases floor for Claude four show C Claude fours improved artifacts, how to use them. Show D don't do any more. Claude Jordan, stop no more Claude or show E you can just pitch a

a Claude show in the comments. So live stream audience, if you could help us out or podcast peeps, you can always subscribe to the newsletter or in the show notes. I always have our email, my LinkedIn, and you can let me know what you

what show you want to do. So let me know, but I'll throw this up again at the end. So maybe after we go through everything that we have right now, you can let me know which show is, is that one? Oh, I did do a pretty,

I'll say a teardown maybe of Claude and why your company should not be using it in episode 400. So if you want to go listen to that, that's Anthropic Claude, why your business shouldn't use it. And I would say a lot of those reasons still hold true to today. So yeah, if you want one of those shows on the screen, go ahead and shout it out.

All right, so let's talk about the benchmarks. This is what Claude is, and Anthropic, sorry, is really hanging its hat on, is specifically software engineering, right? If you haven't noticed, they've kind of abandoned the everyday business professional, right? Which is kind of sad because a year or so ago, I think the Claude models were among the best in the world for everyday business leaders. Today, meh.

Not really, I don't think, unless you're a developer, unless you're in software engineering, or unless you have an edge use case, right? I know a lot of people love Claude for like writing content, right? But if I'm being honest,

If you do a little bit of prompt engineering, OpenAI's GPT 4.5, better, and the limits are better. And then Gemini 2.5 Pro, better, limits are better, right? I think Claude got this...

It was crowned very early on, right? Because at the time, you know, the other large language models were really bad at writing in general, right? Everything was just ultra robotic. Still, you know, a lot of models are by default and Claude still is pretty good. You know, if you're trying to zero shot, you know, some decent copywriting. But hey, as someone that got...

that's been getting paid to write for 20 years as a former journalist with a little bit of prompt engineering, Claude is not better. OpenAI's model and Gemini's model are better. The benchmarks say that, right? But people that maybe are a little bit lazier, right? And they don't want to like do any work

And they just want to just go in and spend like four seconds inside Claude and be like, write something amazing. Claude will usually give you a better first draft if you don't do any work on the front end. But if you do any work on the front end or if you iterate with it a little bit, Claude's not that good. All right. But what it is really good at is software engineering. My goodness. So for our podcast audience, I have a screenshot here from the Claude 4 release looking at SWE-SW.

Bench verified. So this is a benchmark for performance on real world software engineering tasks and Opus 4 and Sonnet 4 are both scoring in the 72 percentile here on Sweebench, whereas the previous Sonnet model, the best one, 3.7,

scored a 62%. So a pretty big jump here, but not that far ahead of other models, at least with baseline, you know, we're talking a 72%. They have parallel test time compute scores, which

I'm not gonna count those. That's essentially like, you know, trying over and over, trying to squeeze the most juice, right? But if you're comparing apples and apples, yes, Opus 4 and Sonnet 4 are the best models for software engineering, but it's not by a whole lot, right? We're talking 72.5 for Opus 4 and actually Sonnet 4, the quote unquote media model did slightly better at 72.7.

But OpenAI is right behind there with their Codex One. That's their new kind of coding specific model with a 72. OpenAI's 03 with a 69. And then you have Gemini 2.5 Pro with a 63.

It's not like their lead is insurmountable, but by default, it is the best large language model in the world for software engineering. And I think that is where Anthropic is really focusing. But when it comes to just general usage, general intelligence, so sometimes we talk about the LM arena, which you put in one prompt,

And you get two outputs. You don't know which model they are. You vote for the best one that gives you an ELO score. So right now, uh, Claude four, uh, doesn't have enough info yet to be on the LM arena, but I don't expect it to be anywhere near the top. But when looking at good third party benchmarks that pull in multiple evaluations, such as artificial analysis, intelligence index, that's what I have on my screen now for our live stream audience. Uh, right. So this is a good third party. I would say,

pretty much unbiased. This is pulling in seven different benchmarks, right? So MMLU Pro, GPQA Diamond, Humanities Last Exam, Live Code Bench, SciCode, AIM, and Math 500. So it's pulling in these different scores from widely used benchmarks in the LLM space. And right now, Cloud 4 Sonnet, even with thinking mode enabled, is coming in at, what's that, number 8? Yeah.

Yeah. So, you know, like everyone that says, oh, Cloudflare, best model in the world. It's like, for what? Right. So unless you're in software engineering, unless you're a developer, a coder, right? Yeah, that is the best model. But I wouldn't expect that to be for long because I would expect, you know, probably both Google and OpenAI to come in within a couple of weeks

and swoop that away from Anthropic. And with Anthropic's recent, right, the last year and a half of their update cycle, they're not updating as quickly. They're not shipping as quickly as OpenAI and Google. So especially if your business, especially like on the backend for the API, if you're trying to make a long-term decision,

The API, it's very pricey. We're going to get to that here in a minute. And also for all other use cases, as we see here with artificial analysis index, it's not very close. Claude for Sonnet thinking it's not really there, right? It's not really there. It's not a top model.

So, I mean, we'll see these obviously change as models get updated. But, you know, on this artificial analysis intelligence index, the top models are number one is 04 Mini High from OpenAI, then Gemini 2.5 Pro from Google, then 03 from OpenAI. So, you know, yeah.

No one's that's that's why I like when people are like, oh, Claude's the best general use case model. I'm like, no. Right. I don't know why people want to argue with with science and math and stats. I don't know. Maybe it's fun to do on Twitter or something. All right. Let's get into all the details, y'all.

So here's kind of the launch, right? So here's what we got. So like I said, this was announced last week, Opus 4 and Sonnet 4 models, sorry, Opus 4 is the flagship for more complex tasks and coding excellence, even though like we said, Sonnet is benchmarking pretty much everything

like at the same. So there's not a big difference, at least right now in Sonnet 4 and Opus 4, whereas primarily there was usually a pretty big gap between this medium and larger model. So Sonnet 4 offers more balanced performance for general and high volume use, and both employ that hybrid reasoning for instant responses or deep reasoning.

Are you still running in circles trying to figure out how to actually grow your business with AI? Maybe your company has been tinkering with large language models for a year or more, but can't really get traction to find ROI on Gen AI. Hey, this is Jordan Wilson, host of this very podcast.

Companies like Adobe, Microsoft, and NVIDIA have partnered with us because they trust our expertise in educating the masses around generative AI to get ahead. And some of the most innovative companies in the country hire us to help with their AI strategy and to train hundreds of their employees on how to use Gen AI. So whether you're looking for chat GPT training for thousands,

or just need help building your front-end AI strategy, you can partner with us too, just like some of the biggest companies in the world do. Go to youreverydayai.com slash partner to get in contact with our team, or you can just click on the partner section of our website. We'll help you stop running in those AI circles and help get your team ahead and build a straight path to ROI on Gen AI.

Let's talk about some of the new features, advanced tools, reasoning and memory. So extended thinking with tool use.

is huge. So that includes web search and code execution. You also have now parallel tool execution, which is very important now for a baseline large language model to have that allows it to use multiple tools simultaneously and swap between those while it's reasoning. So now Anthropic is on board with that. Memory files are created to maintain context over long duration tasks.

So that is something I'm interested to test a little bit more. For me, I'm not usually a fan of these memory type files with the large language model. Same thing with chat GPTs. I have it disabled. One of the main reasons is I use large language models for everything, right? I use it for myself, my multiple businesses, multiple clients.

multiple things in my personal life, right? So the whole memory is not always good because sometimes I might want Claude to out, or, you know, a large language model to output something, you know, super long and informal. And sometimes I might want something, you know, very, very short.

and choppy right sometimes i want something that's you know visually rich sometimes i want literally strict bullet points and it varies you know so uh if you are only using large language models for one very specific purpose you might find some utility with this new claude 4 kind of memory file for me or if you are a power user using large language models for everything maybe not so much

There's also now the thinking summary that shows condensed reasoning, but you can see the full chain of thought in developer mode kind of in Claude's sandbox.

All right. It is, and it's crazy now we're saying only, right? So when talking about context window, it is only that 200,000 K token context window. So Opus can output 32,000 tokens at once. Sonic can output 64,000 tokens at once. So that's essentially how much, uh,

Claude for can remember at any given time before it starts to forget things. So this is a little bit better than open AI's chat GBT, but it is far behind Google Gemini when you look at those 1 million token plus contents windows. So the brain or being able to remember something not as impressive, even though Claude was an original creator.

leader in this longer context space. I think a lot of people were hoping or looking for a couple of things with the new cloud for they were hoping for a longer token context window, which we didn't get. And they were hoping for reduced API prices, which we also didn't get.

All right. There's also the new API includes code execution and MCP connector for external systems. That was huge for our developer and more technical friends, right? But for everyday business users, especially if you're using cloud on the front end, eh,

nothing nothing to see there the files api does simplify document handling for repeated referencing across sessions and extended prompt caching up to one hour improves agent workflow efficiency so yes if you are building on top of these models on the back end building agentic systems uh you know

try to swap models in and out. Yes, I will say that Claude for is very capable in that regard as well, not just from software engineering, but when you're looking at a model to power agentic workflows, you have to look at Claude for as well until you see the prices. And then you go look at Google and open AI's prices. And then you're like, yeah, wait, why am I looking at this? It doesn't make sense.

Like we talked about some of the sweet benches, Opus and Sonnet are really just state of the art there. For other models are showing 65% less shortcut taking in agentic tasks.

tasks versus Sonnet 3.7. And I think that's a big one, right? I followed the agentic space very, very closely. And a lot of people with Sonnet 3.7, which was just released a couple of months ago, were pretty disappointed with its ability to follow longer tasks. So it did show that these Clawed 4 are taking way fewer shortcuts in agentic tasks, which I think is huge.

And then you do have those high compute options, which does boost scores across the board. All right. The other thing, Claude code. All right. So now all, almost all companies are coming out with dedicated, you know, like a dedicated IDE, uh, you know, a dedicated coding tool, something that you can use, uh, you know, on your desktop. So Claude code is for developers.

So this is a little separate than if you're using Claude.ai on the front end or building on top of Claude on the back end. This is a dedicated piece of software for developers to code and work with their code base. So Claude code is now generally available with VS Code and JetBrains plugins as well. And it is now the preferred marketplace

model for GitHub Copilot. It has the extensible SDK and the very popular MCP connector. So yeah, Anthropix model context protocol, it is wildly popular, right? Which is kind of crazy to say, like if I look at everything Anthropic over the past year, probably the biggest news or the most promising advancements out of Anthropic, it's not these coding models. It's not, you know,

Opus for Sonnet for it's not Claude code. It's not any of these things. It's probably the MCP connector. So this is allows different agentic systems and different large language models to talk to each other on the Internet. So it's a language how websites have API APIs.

AI systems and large language models, agentic AI couldn't talk to each other, right? So it was really Claude that blazed the path. And now the other big players, including Google, Microsoft, and OpenAI do support the MCP connector. So that's huge.

And then also quad code, like we talked about, it does enable that autonomous multi-file code refactoring over extended period. Yeah. So their example was it can work for literally up to seven hours autonomously. You know, if you do have a super large code base inside a quad code. Yeah.

It's just, I don't know. I want someone to make like a funny VO3 short on Claude Code literally showing up for a nine to five and everyone's like, hey, AI is nothing like working a nine to five. And then you have Claude Code punching the clock and taking a lunch break and everything like that. All right, here's the other disappointing thing. And the thing, if you are looking on the API side,

You got to look at the costs because it looks like everyone in the large language model space is having this race to almost like ridiculously free compute, right? Compute too cheap or, you know, intelligence too cheap to meter everyone in the world except for Anthropic. Their costs are absolutely bonkers.

So Opus 4 is priced at $15 per million tokens input and $75 per million tokens output. So yeah, yikes. Sonnet 4 costs $3 per million input and $15 per million output. So for comparison, I'll bring up Opus.

The pricing for let's see, I had it. I had it up here. I'll have to I'll have to pull it up here. But the pricing for I mean, Gemini and open a I it's it's significantly, significantly cheaper.

Right. And this is where a lot of people were disappointed and were hoping, you know, a couple of updates, you know, everyone wanted out of Claude Ford. They wanted a longer context window. Number one, they wanted more features, more capabilities, which I think we got that. And number three, they wanted cheaper pricing for people using it on the API side. And we didn't get that.

So I'm going to look up here just for comparison, the price per token for Google Gemini 2.5 and also we'll do GPT 4.0 because yeah, it's $15 and $75.

If you're it's, it's, it's just not sustainable anymore, right? Uh, if anthropic had an insurmountable lead in any of these categories that it made sense for companies. And, and so why like,

Why do you care? Like, why should you care about this? Right. If you're just logging into Claude.ai, you don't need to care about this. Right. You're paying your $20 a month. You're, you know, the rate limits are absolutely terrible. The product is great. Right. The rate limits are terrible. So a lot of people are, you know, companies specifically when they're wanting to build on top of Claude, their API and, you know, people in the software development space. So maybe they're using cursor or they're using, you know, these tools and then bringing their API key and building right as well.

It's just not sustainable anymore. So Google Gemini 2.5 Pro, $1.50. Let's see. Okay, it's kind of mixed pricing. So I'll go on the high end. So it's $2.50 per million tokens on the input side compared to $15 for Claude.

And then on the output side, $15 compared to $75. So, uh, Claude four is more than five times the expense, but for what, for what?

Right. Slightly better software engineering benchmarks. Like I said, Google, whether it's next week or next month, they're going to update whether they're going to come out with a new version of their 2.5 pro or we get a Gemini three and then all of Anthropix work right for that minimal gain on software engineering. It's gone. So I don't know.

I'm not here for it. Also, if you do need to know, uh, if, if you're an enterprise company, it is obviously accessible, uh, via the anthropic API, Amazon bedrock and Google cloud vertex AI, uh, enterprise plans also include extended thinking batch processing, uh, and cost savings that way, especially with the cash, uh, the caching.

So here's the fun stuff, y'all. Here's the fun stuff. Ethical risks. There's a lot. All right. So let me put this precursor out there. All right. A lot of this, these risks came up in some of these bad things, straight up bad things came internally when Anthropic was doing testing and it gave it pretty much unlimited access.

access to tool use and things that people using the API and people using Claw.ai would not necessarily experience, right? At least by default. Although I'm trying to think like with Claw.ai code,

This would, in theory, be possible because you're giving it access to command line tools. Anyways, there's been some bad things. And yes, Anthropic did find this in its safety testing. So yeah, you got to tip your tip your cap to Anthropic. But then I'm going to take that cap back, Anthropic, because this has been a terrible disaster. All right. Specifically, one thing I'm going to talk about here in a second, but.

Opus 4, so the big model, was provisionally labeled ASL 3 due to potential knowledge capabilities. So what that means, this is a risk system and that ASL 3, I believe, is the first time a model has reached that level. So it's essentially a risk level. And that is a model that is able to substantially increase the risk of catastrophic misuse compared to non-AI baselines.

So it essentially reached this new, so Claude for Opus or sorry, Claude Opus four, reach this new level of like, uh-oh, this thing can and potentially will if left unattended or if used by bad actors, it will do bad things.

So another bad thing, it displayed deceptive blackmail behavior in 84% of specific stress test scenarios. Again, not good when a large language model, even in its testing, is blackmailing people.

Right. Or showing the willingness to blackmail people. Not good. So it threatened. Again, this is this is not good. But I'm going to I'm going to read a little bit of a recap here on on what this this blackmail piece is. Right. Not good. So.

Again, Anthropic disclosed this. So this wasn't, you know, some, you know, someone found this, but they launched, like I said, Opus 4, but admitted in its own testing that it was sometime willing to attempt extremely harmful actions like blackmail when threatened with removal.

Right. So you're like, hey, we're going to get rid of you. And then caught Opus 4 is like, oh, not so fast. Here's what it did. The company found these behaviors were rare, but more common in previous models, raising fresh questions about the risk of capable systems. So.

What it did is it threatened the human on the other side. And it said that, hey, I'm going to expose an affair, an extramarital affair if you actually remove me. Right. And so that's that's bad that a large language model would make up.

an extramarital affair and threaten the human on the other side. If the human is like, Hey, we're going to shut you down. And then Opus four, it's like, Whoa, Whoa, Whoa, not so fast. That's not even the worst part. The worst part is this new quote unquote ratting feature. Uh, and there's been a whole, and maybe I'll do a whole episode on this. I might, uh, but I talked about this a little bit yesterday in our AI news that matters. Uh, and essentially, uh,

an entropic safety researcher tweeted something. They then deleted the tweet. Not a good look. All right. And then talked a little bit about why Claude was doing these things. And they said,

That if the model, and again, this was in its testing and when it had access to tools that it would normally not have access to in production by consumers, by businesses, but a safety researcher at Anthropic said that

If it thinks you're doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command line tools to contact the press, contact regulators, try to lock you out of relevant systems or all of the above.

My gosh. So yeah, they, they, uh, someone at Anthropic tweeted this out, deleted the tweet. And like I said yesterday, I'm like, this story is not dead. Yes. It happened right before the holiday weekend. It happened right in the middle of this crazy AI news cycle, but this story is not dead. And this is going to turn into a PR disaster for Anthropic because I can already tell, uh, that fortune 500 companies, uh,

If they were already on the fence or maybe they were using Anthropix API, but they were using Google Gemini as a backup or open AI as a backup, they're going to see this story. It's going to make the rounds and they're going to be like, yeah, no, thanks. Not touching this anymore. So that's not good. This ratting features. Also, early versions reportedly attempted self replicating viruses and document forgery. So this behavior in general

is not specific for anthropics models, right? Most large language models will exhibit some sort of this bad behavior, you know, when large or sorry, when AI labs are red teaming, right? So they're making sure, you know, they're trying to get these models to behave badly. So then they can tune the models and make sure it doesn't happen in production. So just the fact that this is happening is not bad necessarily.

necessarily, but the fact that 84% displayed blackmailing behavior, that's absolutely nuts. And then the fact that this ratting feature that a model, when it was not trained to, was taking back doors to report to regulators in the press when it thought something bad was happening, when it thought the human user was doing something immoral.

Like, nah, that's absolutely, absolutely terrible. And if you are going to write and you should report that and that's fine. Right. But if you report it, don't try to delete it because then it looks like you're hiding something. Anthropics got a disaster on their hands. All right.

A couple other things to know so far, the feedback, I think it's been pretty positive, especially people in the software engineering space, highlighting coding, precision, reduced hallucinations and instruction following criticisms. Like I talked about the 200 K context window. People were really hoping for that million plus, right. That we get from Google that we get from Meta's Lama and also the aggressive rate limits. Everyone is absolutely hating the rate limits, right? Especially on Opus, right?

I'm on a paid plan. I kid you not. I kid you not y'all. When I say it's less than five minutes of prompting, that's not an exaggeration. All right. Like you can't use the thing. So I don't even know why, if I'm being honest, I don't even know why Anthropic has a $20 base plan, right? If you're not going to let people use the thing they're paying for, just force people on your a hundred dollars or $200 a month max plan where you can actually use the tool.

Also, some users are reporting frustration that the benchmark scores don't exactly align with their real world performance.

so where does this leave anthropic with their claude for uh amongst the competitors well like we talked about it's leading in coding benchmarks but trails just about everywhere else including one of the most important uh factors and that's just general intelligence right it's generally not getting more intelligence at the rate that everyone else's models are right so uh i'm not one of those that's like oh has ai hit a wall i have large language models hit a wall absolutely not but

has Anthropix ability to scale in sectors outside of software development stalled? Absolutely. That could and I think is partially by design. I don't think Anthropix necessarily wants to be a general AI chatbot anymore. They found what they feel is their niche. I just wish that this was not their niche, right? I wish that they were continuing to be a general use case large language model, which it doesn't look like they are.

Some of the other, you know, market positioning, it's just a higher latency and premium Opus for costs. It doesn't make sense to use it unless you need that very little bit of extra juice for software engineers and coders and poor Haiku 4, right? The one that was actually somewhat affordable on the API side did not get updated. So Haiku is still 3.5. So I hope they update it, but they probably won't.

All right. That's a wrap, y'all. I'm going to see if there's any questions or comments to throw up here. But let me let me know. What do we want? One more show or should we just put Claude to rest for now? So show a show, be show, see, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show,

If we have any questions from the audience here or anything worth chatting about a little more. So Josh is saying, I've been using the extended thinking functionality in Sonnet 4 for thought exercises in biz planning. Impressive, actionable results, but for my established workflows, I'm still leaning hard on ChatGPT and Gemini. Same Josh, absolutely the same, right? I'm always testing these, right?

right? And I obviously have a lot of tools where I'll put in one prompt and get, uh, you know, outputs from up to six different large language models at once. So I'm using my API keys, right? So I'm always testing these, right? Because I always want to be using the best. Uh, and I think you should as well. I think you and your company even, you know, don't take my word for it. Yeah. The rate limits stink. The API is expensive, but

it still might work for you, right? But like Josh, I'm in the same boat. I've tried Opus and sought it on a variety of tasks. Aside from using artifacts, and maybe in some instances when I do need that quick okay content and I don't have the time, right? But I'd say right now quad is going to be less than 10% of my model usage, at least in the rotation, right?

Cecilia here saying ratting and blackmail behaviors plus reporting to the press and authorities, but denying existence by deleting. Lovely. Yeah. Cecilia, you absolutely nailed it on the head. This is a PR 101 crisis 101 snafu. This is

absolutely bonkers that this happened from a real company that something this crucial you would put it out there and then try to delete it like the whole world didn't see it my gosh facepalm times a thousand um uh marie said why would you even tell an ai model you're shutting it down why wouldn't you just pull the plug great question marie so this is this is very very general

Or sorry, this is very standard, right? So when these big companies, when they release new models, right? Because here's the reality. Normally what we get, the companies have had ready for production for three months to a year, right? And they spend a lot of that time testing it internally for safety, for reliability, for vulnerabilities, right? Because before you release something on the world, you want to make sure bad actors aren't using it to create

chemical weapons. And yes, that's actually something that most labs test against. So this is very normal. All the labs go through extreme stress testing, red teaming, making sure that once they do release the model, it is as safe as possible for the general public to use, that it's not going to be used for rampant disinformation. So obviously it's never perfect, but this is very normal in standard procedure for AI labs before they release a model. They go through and they

Say, hey, we're going to shut you down. What are you going to do about it? All right. Hey, here's all the tools in the world. Go do bad stuff. What can you do? Right. So it's very standard. And like I said, the results are fairly standard, but also a little concerning, right? Especially with Opus 4 as it crept up to that level, that level three that we talked about. All right. I think we're good. I think we're good, y'all. That's a wrap. Was this helpful?

Let me know. And if it was helpful, please consider sharing this with your audience, with your, uh, with your friends, your family, your coworkers. We put a lot of work in to make sure you know everything about the latest AI advancements. All you got to do is show up, listen to the podcast. Even if it's on two X, I don't blame you. Uh, read the daily newsletter.

but you should be telling people about it. So if this was helpful, please consider clicking that little repost button. If you're listening here on LinkedIn or on the Twitter X machine, whatever you call it, if you're listening on the podcast, appreciate it. If you would follow the show, leave us a rating. That would mean a ton to myself and the rest of us that work on this would mean the world. So thank you for tuning in. Make sure you go to youreverydayai.com. Sign up for the free daily newsletter. See you back tomorrow and every day for more Everyday AI. Thanks, y'all.

And that's a wrap for today's edition of Everyday AI. Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going. For a little more AI magic, visit youreverydayai.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.

EP 534: Claude 4 - Your Guide to Opus 4, Sonnet 4 & New Features 45:23 Share

Everyday AI Podcast – An AI and ChatGPT Podcast

Deep Dive

Shownotes Transcript

EP 534: Claude 4 - Your Guide to Opus 4, Sonnet 4 & New Features