This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Right at the end of the busiest week in AI ever, Anthropic decided to drop two big new AI models on us all.
as if we weren't busy enough with everything else that we had just seen released from Microsoft, Google and others. We now had two new contenders in Claude for Opus and Claude for Sonnet to play with and to see how good these models are and if they can actually grow our companies and our careers.
So just like we did with everything Microsoft and everything Google, we're going to be breaking down what's new with this new Anthropic Cloud for release and talk about is this going to be your new large language model you use every day? Or is this maybe just for software engineers? Or is this not just a good model? All right, so we're going to be going over that today.
today and a lot more on everyday AI. What's going on y'all? My name is Jordan Wilson. I'm the host of everyday AI. And if you're looking to grow your company and career with generative AI, then this is for you. This is your daily live stream podcast and free daily newsletter, helping us all learn and leverage generative AI. So if
If you haven't already, please go to youreverydayai.com. So there you're going to get the recap of today's show in our free daily newsletter, but also there at youreverydayai.com. You can go listen to, watch, and read more than 530 back episodes sorted by category. So no matter what you're trying to learn, whether it's sales, marketing, HR, ethics,
data analysis, whatever it is. We've got probably dozens of shows in all of those categories talking to the world's leading experts. It is a free generative AI university. So make sure you go check that out. All right. Most days we go over the AI news. I didn't want to make this a super long show. So that's going to be in today's newsletter. So make sure you go sign up and grab all of that. All right. Livestream audience. It's good to see y'all.
Like Marie says, good morning, AI family. Yeah, if you're listening on the podcast, we do this live almost every single Monday through Friday at 730 a.m. Central Standard Time. I'm in Chicago, so you can do the math there or maybe have Claude do the math.
for what time that is, but join. Come hang out with people like Josh Cavalier saying good morning from Charlotte, North Carolina. Giordi joining us from Jamaica. Love to see you, Giordi. Jose from Santiago, Chile. We got some international flavor. I love this. Brian joining us from Minnesota. Everyone else, big bogey on the YouTube machine.
Christopher joining us from Bowling Green, Kentucky. Thanks for joining. But let me know as we go along, what are your thoughts on the new release on Claude 4? But right now, this is your guide. This is the basics. We're going to start here.
All right. So like I said, Claude had their first ever or sorry, Anthropic had their first ever developers conference this past Thursday. And there they announced, among other things, their two new flagship models in Claude for Opus and Claude.
for Sonnet. And I'm already getting a little bit confused saying those things out loud. So I did talk about this a little bit on the show yesterday, but they even changed their naming mechanism. Whereas before it was, you know, Claude 3.7 Sonnet was the last Sonnet variation, but now it's just Claude Sonnet
four. So now the number is at the end. So a lot of new things, even how they're changing or naming their models. But if you are brand new and if you don't know too much about Anthropix Cloud, it is and has historically been usually a top three AI lab along with OpenAI, Google. Microsoft is kind of in a different category, but it's one of the biggest
large language models in the world, although most people, unless you're a real AI kind of dork or a heavy large language model user, you might not know Claude. And I think that is actually, whether you're saying fortunately or unfortunately, only going to...
become intensified. Just, I think fewer and fewer people are actually going to be using and hearing about Claude, I think, because I think they're getting away from being a general chatbot company, but more on that here in a couple of minutes. But, you know, there's three big, there's three variations of Claude. So you have your biggest model, which is Opus, your medium model, which is Sonnet. And then you have your small model, which is Haiku.
And you'll notice that only the Opus and Sonnet models got updated to the four variations. So Claude Haiku 3.5, which is their smallest and most efficient model, did not get updated. So that is still Claude Haiku 3.5. So I guess the only thing that got updated was the naming mechanism there. So.
Here's a quick overview of what is actually new. All right. So we have hybrid reasoning. So this is an instant and extended thinking mode for flexible reasoning. So, you know, we talk about kind of two types of large language models here on the show.
Yes, I'm overgeneralizing this, but you have your traditional transformer, your old school large language models, which is funny to say something's old school, but those are ones that just kind of snap something back to you real quick. And then you have these models that are reasoners or they can think step by step.
They can show logic like a human and plan ahead. So these models, you know, Gemini 2.5 pro is a reasoning model. The open AI O models, Oh three Oh four Oh one. Those are all reasoning models. So a quad four, it's a hybrid model. So it just,
sides on how much it should think and should have just spit things out to you really quick. It is a top coding model that is by far where Anthropic is seemingly focusing on and kind of abandoning general use, but it is now state of the art in coding. It will be interesting to see how long they hold that state of the art coding title.
I don't think it's going to be long if I'm being honest, because Google could come in with an update literally any second now and probably wipe a good majority of these benchmarks that Anthropic is now hanging their clawed hat on. So another big thing is tool integration. So using external tools like web search during the reasoning process. So that's if you are
you know, there's two different ways you can look at this, right? So using it on the front end as a front end user, right? So if you go to Claude AI or Claude.ai, right? So using it as an AI chat bot, and then obviously if you're building on top of it or using a service that uses Claude's API. So there's always a front end user, which is your more non-technical people. And then the backend people that are maybe building on top of Claude's API. But
regardless, you can have this new tool use during the reasoning process, which is big, right? And this is nice because it catches Anthropic up with OpenAI and Google in that regard.
Also now there's long running tasks. So I haven't personally seen this and I think this is only if you're using it in the API, but Anthropic is saying that it can, the new Cloud 4 models can maintain a coherence on complex tasks for extended periods. They talked about Cloud running a task, I think Cloud 4 for like
seven hours on the API side, uh, which is absolutely bonkers. Now that you have, uh, you know, models literally like punching in the clock and they're like, yeah, I'm going to go work a seven hour day. Now I would never give a model that complex on the backend because yes, it's going to require, uh,
obviously the API and Claude's four is one of the most expensive APIs, at least when we're looking at general use case, large language models. And I would never want something like that to happen where it goes out and it works on a long task for a long time. And then, okay, what happens if it times out? Right. Did you just waste? I don't know. Uh,
couple hundred dollars, you know, having Claude go code for six or seven hours straight. I'm not sure. And if you do want to get a taste of Claude and if you're not on their paid plan, they do offer very, very limited, very limited options for Claude for on the free plan. All right. But
let's be honest, I'm going to call a spade a spade. Right. So I think, you know, the, the paid plan is like, you know, $20 a month for the pro plan. Uh, and even on that, you can barely use the thing. Right. Um, it started as a joke, but now it's just sad for entropic as a company. Uh,
I routinely will hit this rate limit. I'm on a paid plan. I paid $20 a month for Claude Pro, and I will routinely hit the rate limit in about four to 10 minutes. Almost every single time I try to use, even preparing for this show, hit it within about seven minutes.
So it's laughable. So yeah, I even chuckle more that there's a free version of Sonnet. So I don't know. I venture to think if you look at the free version the wrong way, you've hit your rate limit. So if you think that this is anything like a model that you can use, like Google's Gemini, ChatGPT,
co-pilot anything else where you have generous limits and it can be your partner in whatever type of work you're doing. Absolutely not. If you're on a base $20 a month plan, a team's plan, the limits are a little better, but the free plan, yeah, it's probably just a marketing gimmick. I don't even know if it could take a long prompt with a lot of context. It would probably not work if I'm being honest, right? All right. Let's keep this thing going.
And by keeping this thing going, should we do another show? Uh,
I want to give everyone a fair shake. And yes, I'm not the biggest, uh, anthropic Claude fan. Uh, I, I broke down why, uh, about six months ago, uh, I'll have to pull up that episode, uh, number, but Hey, live stream audience. If you do want a second show, because I I've been doing, you know, multiple shows when, you know, Google comes out with a new model when, uh, open AI comes out with a new model. So if you do let me know right now, just tell me what show a show B show C show
Show D or show E. Okay. And I'm going to throw this up again at the end. So show a, why Claude is losing the AI chat bot race. Show B real world use cases floor for Claude four show C Claude fours improved artifacts, how to use them. Show D don't do any more. Claude Jordan, stop no more Claude or show E you can just pitch a
a Claude show in the comments. So live stream audience, if you could help us out or podcast peeps, you can always subscribe to the newsletter or in the show notes. I always have our email, my LinkedIn, and you can let me know what you
what show you want to do. So let me know, but I'll throw this up again at the end. So maybe after we go through everything that we have right now, you can let me know which show is, is that one? Oh, I did do a pretty,
I'll say a teardown maybe of Claude and why your company should not be using it in episode 400. So if you want to go listen to that, that's Anthropic Claude, why your business shouldn't use it. And I would say a lot of those reasons still hold true to today. So yeah, if you want one of those shows on the screen, go ahead and shout it out.
All right, so let's talk about the benchmarks. This is what Claude is, and Anthropic, sorry, is really hanging its hat on, is specifically software engineering, right? If you haven't noticed, they've kind of abandoned the everyday business professional, right? Which is kind of sad because a year or so ago, I think the Claude models were among the best in the world for everyday business leaders. Today, meh.
Not really, I don't think, unless you're a developer, unless you're in software engineering, or unless you have an edge use case, right? I know a lot of people love Claude for like writing content, right? But if I'm being honest,
If you do a little bit of prompt engineering, OpenAI's GPT 4.5, better, and the limits are better. And then Gemini 2.5 Pro, better, limits are better, right? I think Claude got this...
It was crowned very early on, right? Because at the time, you know, the other large language models were really bad at writing in general, right? Everything was just ultra robotic. Still, you know, a lot of models are by default and Claude still is pretty good. You know, if you're trying to zero shot, you know, some decent copywriting. But hey, as someone that got...
that's been getting paid to write for 20 years as a former journalist with a little bit of prompt engineering, Claude is not better. OpenAI's model and Gemini's model are better. The benchmarks say that, right? But people that maybe are a little bit lazier, right? And they don't want to like do any work
And they just want to just go in and spend like four seconds inside Claude and be like, write something amazing. Claude will usually give you a better first draft if you don't do any work on the front end. But if you do any work on the front end or if you iterate with it a little bit, Claude's not that good. All right. But what it is really good at is software engineering. My goodness. So for our podcast audience, I have a screenshot here from the Claude 4 release looking at SWE-SW.
Bench verified. So this is a benchmark for performance on real world software engineering tasks and Opus 4 and Sonnet 4 are both scoring in the 72 percentile here on Sweebench, whereas the previous Sonnet model, the best one, 3.7,
scored a 62%. So a pretty big jump here, but not that far ahead of other models, at least with baseline, you know, we're talking a 72%. They have parallel test time compute scores, which
I'm not gonna count those. That's essentially like, you know, trying over and over, trying to squeeze the most juice, right? But if you're comparing apples and apples, yes, Opus 4 and Sonnet 4 are the best models for software engineering, but it's not by a whole lot, right? We're talking 72.5 for Opus 4 and actually Sonnet 4, the quote unquote media model did slightly better at 72.7.
But OpenAI is right behind there with their Codex One. That's their new kind of coding specific model with a 72. OpenAI's 03 with a 69. And then you have Gemini 2.5 Pro with a 63.
It's not like their lead is insurmountable, but by default, it is the best large language model in the world for software engineering. And I think that is where Anthropic is really focusing. But when it comes to just general usage, general intelligence, so sometimes we talk about the LM arena, which you put in one prompt,
And you get two outputs. You don't know which model they are. You vote for the best one that gives you an ELO score. So right now, uh, Claude four, uh, doesn't have enough info yet to be on the LM arena, but I don't expect it to be anywhere near the top. But when looking at good third party benchmarks that pull in multiple evaluations, such as artificial analysis, intelligence index, that's what I have on my screen now for our live stream audience. Uh, right. So this is a good third party. I would say,
pretty much unbiased. This is pulling in seven different benchmarks, right? So MMLU Pro, GPQA Diamond, Humanities Last Exam, Live Code Bench, SciCode, AIM, and Math 500. So it's pulling in these different scores from widely used benchmarks in the LLM space. And right now, Cloud 4 Sonnet, even with thinking mode enabled, is coming in at, what's that, number 8? Yeah.
Yeah. So, you know, like everyone that says, oh, Cloudflare, best model in the world. It's like, for what? Right. So unless you're in software engineering, unless you're a developer, a coder, right? Yeah, that is the best model. But I wouldn't expect that to be for long because I would expect, you know, probably both Google and OpenAI to come in within a couple of weeks
and swoop that away from Anthropic. And with Anthropic's recent, right, the last year and a half of their update cycle, they're not updating as quickly. They're not shipping as quickly as OpenAI and Google. So especially if your business, especially like on the backend for the API, if you're trying to make a long-term decision,
The API, it's very pricey. We're going to get to that here in a minute. And also for all other use cases, as we see here with artificial analysis index, it's not very close. Claude for Sonnet thinking it's not really there, right? It's not really there. It's not a top model.
So, I mean, we'll see these obviously change as models get updated. But, you know, on this artificial analysis intelligence index, the top models are number one is 04 Mini High from OpenAI, then Gemini 2.5 Pro from Google, then 03 from OpenAI. So, you know, yeah.
No one's that's that's why I like when people are like, oh, Claude's the best general use case model. I'm like, no. Right. I don't know why people want to argue with with science and math and stats. I don't know. Maybe it's fun to do on Twitter or something. All right. Let's get into all the details, y'all.
So here's kind of the launch, right? So here's what we got. So like I said, this was announced last week, Opus 4 and Sonnet 4 models, sorry, Opus 4 is the flagship for more complex tasks and coding excellence, even though like we said, Sonnet is benchmarking pretty much everything
like at the same. So there's not a big difference, at least right now in Sonnet 4 and Opus 4, whereas primarily there was usually a pretty big gap between this medium and larger model. So Sonnet 4 offers more balanced performance for general and high volume use, and both employ that hybrid reasoning for instant responses or deep reasoning.
Are you still running in circles trying to figure out how to actually grow your business with AI? Maybe your company has been tinkering with large language models for a year or more, but can't really get traction to find ROI on Gen AI. Hey, this is Jordan Wilson, host of this very podcast.
Companies like Adobe, Microsoft, and NVIDIA have partnered with us because they trust our expertise in educating the masses around generative AI to get ahead. And some of the most innovative companies in the country hire us to help with their AI strategy and to train hundreds of their employees on how to use Gen AI. So whether you're looking for chat GPT training for thousands,
or just need help building your front-end AI strategy, you can partner with us too, just like some of the biggest companies in the world do. Go to youreverydayai.com slash partner to get in contact with our team, or you can just click on the partner section of our website. We'll help you stop running in those AI circles and help get your team ahead and build a straight path to ROI on Gen AI.
Let's talk about some of the new features, advanced tools, reasoning and memory. So extended thinking with tool use.
is huge. So that includes web search and code execution. You also have now parallel tool execution, which is very important now for a baseline large language model to have that allows it to use multiple tools simultaneously and swap between those while it's reasoning. So now Anthropic is on board with that. Memory files are created to maintain context over long duration tasks.
So that is something I'm interested to test a little bit more. For me, I'm not usually a fan of these memory type files with the large language model. Same thing with chat GPTs. I have it disabled. One of the main reasons is I use large language models for everything, right? I use it for myself, my multiple businesses, multiple clients.
multiple things in my personal life, right? So the whole memory is not always good because sometimes I might want Claude to out, or, you know, a large language model to output something, you know, super long and informal. And sometimes I might want something, you know, very, very short.
and choppy right sometimes i want something that's you know visually rich sometimes i want literally strict bullet points and it varies you know so uh if you are only using large language models for one very specific purpose you might find some utility with this new claude 4 kind of memory file for me or if you are a power user using large language models for everything maybe not so much
There's also now the thinking summary that shows condensed reasoning, but you can see the full chain of thought in developer mode kind of in Claude's sandbox.
All right. It is, and it's crazy now we're saying only, right? So when talking about context window, it is only that 200,000 K token context window. So Opus can output 32,000 tokens at once. Sonic can output 64,000 tokens at once. So that's essentially how much, uh,
Claude for can remember at any given time before it starts to forget things. So this is a little bit better than open AI's chat GBT, but it is far behind Google Gemini when you look at those 1 million token plus contents windows. So the brain or being able to remember something not as impressive, even though Claude was an original creator.
leader in this longer context space. I think a lot of people were hoping or looking for a couple of things with the new cloud for they were hoping for a longer token context window, which we didn't get. And they were hoping for reduced API prices, which we also didn't get.
All right. There's also the new API includes code execution and MCP connector for external systems. That was huge for our developer and more technical friends, right? But for everyday business users, especially if you're using cloud on the front end, eh,
nothing nothing to see there the files api does simplify document handling for repeated referencing across sessions and extended prompt caching up to one hour improves agent workflow efficiency so yes if you are building on top of these models on the back end building agentic systems uh you know
try to swap models in and out. Yes, I will say that Claude for is very capable in that regard as well, not just from software engineering, but when you're looking at a model to power agentic workflows, you have to look at Claude for as well until you see the prices. And then you go look at Google and open AI's prices. And then you're like, yeah, wait, why am I looking at this? It doesn't make sense.
Like we talked about some of the sweet benches, Opus and Sonnet are really just state of the art there. For other models are showing 65% less shortcut taking in agentic tasks.
tasks versus Sonnet 3.7. And I think that's a big one, right? I followed the agentic space very, very closely. And a lot of people with Sonnet 3.7, which was just released a couple of months ago, were pretty disappointed with its ability to follow longer tasks. So it did show that these Clawed 4 are taking way fewer shortcuts in agentic tasks, which I think is huge.
And then you do have those high compute options, which does boost scores across the board. All right. The other thing, Claude code. All right. So now all, almost all companies are coming out with dedicated, you know, like a dedicated IDE, uh, you know, a dedicated coding tool, something that you can use, uh, you know, on your desktop. So Claude code is for developers.
So this is a little separate than if you're using Claude.ai on the front end or building on top of Claude on the back end. This is a dedicated piece of software for developers to code and work with their code base. So Claude code is now generally available with VS Code and JetBrains plugins as well. And it is now the preferred marketplace
model for GitHub Copilot. It has the extensible SDK and the very popular MCP connector. So yeah, Anthropix model context protocol, it is wildly popular, right? Which is kind of crazy to say, like if I look at everything Anthropic over the past year, probably the biggest news or the most promising advancements out of Anthropic, it's not these coding models. It's not, you know,
Opus for Sonnet for it's not Claude code. It's not any of these things. It's probably the MCP connector. So this is allows different agentic systems and different large language models to talk to each other on the Internet. So it's a language how websites have API APIs.
AI systems and large language models, agentic AI couldn't talk to each other, right? So it was really Claude that blazed the path. And now the other big players, including Google, Microsoft, and OpenAI do support the MCP connector. So that's huge.
And then also quad code, like we talked about, it does enable that autonomous multi-file code refactoring over extended period. Yeah. So their example was it can work for literally up to seven hours autonomously. You know, if you do have a super large code base inside a quad code. Yeah.
It's just, I don't know. I want someone to make like a funny VO3 short on Claude Code literally showing up for a nine to five and everyone's like, hey, AI is nothing like working a nine to five. And then you have Claude Code punching the clock and taking a lunch break and everything like that. All right, here's the other disappointing thing. And the thing, if you are looking on the API side,
You got to look at the costs because it looks like everyone in the large language model space is having this race to almost like ridiculously free compute, right? Compute too cheap or, you know, intelligence too cheap to meter everyone in the world except for Anthropic. Their costs are absolutely bonkers.
So Opus 4 is priced at $15 per million tokens input and $75 per million tokens output. So yeah, yikes. Sonnet 4 costs $3 per million input and $15 per million output. So for comparison, I'll bring up Opus.
The pricing for let's see, I had it. I had it up here. I'll have to I'll have to pull it up here. But the pricing for I mean, Gemini and open a I it's it's significantly, significantly cheaper.
Right. And this is where a lot of people were disappointed and were hoping, you know, a couple of updates, you know, everyone wanted out of Claude Ford. They wanted a longer context window. Number one, they wanted more features, more capabilities, which I think we got that. And number three, they wanted cheaper pricing for people using it on the API side. And we didn't get that.
So I'm going to look up here just for comparison, the price per token for Google Gemini 2.5 and also we'll do GPT 4.0 because yeah, it's $15 and $75.
If you're it's, it's, it's just not sustainable anymore, right? Uh, if anthropic had an insurmountable lead in any of these categories that it made sense for companies. And, and so why like,
Why do you care? Like, why should you care about this? Right. If you're just logging into Claude.ai, you don't need to care about this. Right. You're paying your $20 a month. You're, you know, the rate limits are absolutely terrible. The product is great. Right. The rate limits are terrible. So a lot of people are, you know, companies specifically when they're wanting to build on top of Claude, their API and, you know, people in the software development space. So maybe they're using cursor or they're using, you know, these tools and then bringing their API key and building right as well.
It's just not sustainable anymore. So Google Gemini 2.5 Pro, $1.50. Let's see. Okay, it's kind of mixed pricing. So I'll go on the high end. So it's $2.50 per million tokens on the input side compared to $15 for Claude.
And then on the output side, $15 compared to $75. So, uh, Claude four is more than five times the expense, but for what, for what?
Right. Slightly better software engineering benchmarks. Like I said, Google, whether it's next week or next month, they're going to update whether they're going to come out with a new version of their 2.5 pro or we get a Gemini three and then all of Anthropix work right for that minimal gain on software engineering. It's gone. So I don't know.
I'm not here for it. Also, if you do need to know, uh, if, if you're an enterprise company, it is obviously accessible, uh, via the anthropic API, Amazon bedrock and Google cloud vertex AI, uh, enterprise plans also include extended thinking batch processing, uh, and cost savings that way, especially with the cash, uh, the caching.
So here's the fun stuff, y'all. Here's the fun stuff. Ethical risks. There's a lot. All right. So let me put this precursor out there. All right. A lot of this, these risks came up in some of these bad things, straight up bad things came internally when Anthropic was doing testing and it gave it pretty much unlimited access.
access to tool use and things that people using the API and people using Claw.ai would not necessarily experience, right? At least by default. Although I'm trying to think like with Claw.ai code,
This would, in theory, be possible because you're giving it access to command line tools. Anyways, there's been some bad things. And yes, Anthropic did find this in its safety testing. So yeah, you got to tip your tip your cap to Anthropic. But then I'm going to take that cap back, Anthropic, because this has been a terrible disaster. All right. Specifically, one thing I'm going to talk about here in a second, but.
Opus 4, so the big model, was provisionally labeled ASL 3 due to potential knowledge capabilities. So what that means, this is a risk system and that ASL 3, I believe, is the first time a model has reached that level. So it's essentially a risk level. And that is a model that is able to substantially increase the risk of catastrophic misuse compared to non-AI baselines.
So it essentially reached this new, so Claude for Opus or sorry, Claude Opus four, reach this new level of like, uh-oh, this thing can and potentially will if left unattended or if used by bad actors, it will do bad things.
So another bad thing, it displayed deceptive blackmail behavior in 84% of specific stress test scenarios. Again, not good when a large language model, even in its testing, is blackmailing people.
Right. Or showing the willingness to blackmail people. Not good. So it threatened. Again, this is this is not good. But I'm going to I'm going to read a little bit of a recap here on on what this this blackmail piece is. Right. Not good. So.
Again, Anthropic disclosed this. So this wasn't, you know, some, you know, someone found this, but they launched, like I said, Opus 4, but admitted in its own testing that it was sometime willing to attempt extremely harmful actions like blackmail when threatened with removal.
Right. So you're like, hey, we're going to get rid of you. And then caught Opus 4 is like, oh, not so fast. Here's what it did. The company found these behaviors were rare, but more common in previous models, raising fresh questions about the risk of capable systems. So.
What it did is it threatened the human on the other side. And it said that, hey, I'm going to expose an affair, an extramarital affair if you actually remove me. Right. And so that's that's bad that a large language model would make up.
an extramarital affair and threaten the human on the other side. If the human is like, Hey, we're going to shut you down. And then Opus four, it's like, Whoa, Whoa, Whoa, not so fast. That's not even the worst part. The worst part is this new quote unquote ratting feature. Uh, and there's been a whole, and maybe I'll do a whole episode on this. I might, uh, but I talked about this a little bit yesterday in our AI news that matters. Uh, and essentially, uh,
an entropic safety researcher tweeted something. They then deleted the tweet. Not a good look. All right. And then talked a little bit about why Claude was doing these things. And they said,
That if the model, and again, this was in its testing and when it had access to tools that it would normally not have access to in production by consumers, by businesses, but a safety researcher at Anthropic said that
If it thinks you're doing something egregiously immoral, for example, like faking data in a pharmaceutical trial, it will use command line tools to contact the press, contact regulators, try to lock you out of relevant systems or all of the above.
My gosh. So yeah, they, they, uh, someone at Anthropic tweeted this out, deleted the tweet. And like I said yesterday, I'm like, this story is not dead. Yes. It happened right before the holiday weekend. It happened right in the middle of this crazy AI news cycle, but this story is not dead. And this is going to turn into a PR disaster for Anthropic because I can already tell, uh, that fortune 500 companies, uh,
If they were already on the fence or maybe they were using Anthropix API, but they were using Google Gemini as a backup or open AI as a backup, they're going to see this story. It's going to make the rounds and they're going to be like, yeah, no, thanks. Not touching this anymore. So that's not good. This ratting features. Also, early versions reportedly attempted self replicating viruses and document forgery. So this behavior in general
is not specific for anthropics models, right? Most large language models will exhibit some sort of this bad behavior, you know, when large or sorry, when AI labs are red teaming, right? So they're making sure, you know, they're trying to get these models to behave badly. So then they can tune the models and make sure it doesn't happen in production. So just the fact that this is happening is not bad necessarily.
necessarily, but the fact that 84% displayed blackmailing behavior, that's absolutely nuts. And then the fact that this ratting feature that a model, when it was not trained to, was taking back doors to report to regulators in the press when it thought something bad was happening, when it thought the human user was doing something immoral.
Like, nah, that's absolutely, absolutely terrible. And if you are going to write and you should report that and that's fine. Right. But if you report it, don't try to delete it because then it looks like you're hiding something. Anthropics got a disaster on their hands. All right.
A couple other things to know so far, the feedback, I think it's been pretty positive, especially people in the software engineering space, highlighting coding, precision, reduced hallucinations and instruction following criticisms. Like I talked about the 200 K context window. People were really hoping for that million plus, right. That we get from Google that we get from Meta's Lama and also the aggressive rate limits. Everyone is absolutely hating the rate limits, right? Especially on Opus, right?
I'm on a paid plan. I kid you not. I kid you not y'all. When I say it's less than five minutes of prompting, that's not an exaggeration. All right. Like you can't use the thing. So I don't even know why, if I'm being honest, I don't even know why Anthropic has a $20 base plan, right? If you're not going to let people use the thing they're paying for, just force people on your a hundred dollars or $200 a month max plan where you can actually use the tool.
Also, some users are reporting frustration that the benchmark scores don't exactly align with their real world performance.
so where does this leave anthropic with their claude for uh amongst the competitors well like we talked about it's leading in coding benchmarks but trails just about everywhere else including one of the most important uh factors and that's just general intelligence right it's generally not getting more intelligence at the rate that everyone else's models are right so uh i'm not one of those that's like oh has ai hit a wall i have large language models hit a wall absolutely not but
has Anthropix ability to scale in sectors outside of software development stalled? Absolutely. That could and I think is partially by design. I don't think Anthropix necessarily wants to be a general AI chatbot anymore. They found what they feel is their niche. I just wish that this was not their niche, right? I wish that they were continuing to be a general use case large language model, which it doesn't look like they are.
Some of the other, you know, market positioning, it's just a higher latency and premium Opus for costs. It doesn't make sense to use it unless you need that very little bit of extra juice for software engineers and coders and poor Haiku 4, right? The one that was actually somewhat affordable on the API side did not get updated. So Haiku is still 3.5. So I hope they update it, but they probably won't.
All right. That's a wrap, y'all. I'm going to see if there's any questions or comments to throw up here. But let me let me know. What do we want? One more show or should we just put Claude to rest for now? So show a show, be show, see, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show, show,
If we have any questions from the audience here or anything worth chatting about a little more. So Josh is saying, I've been using the extended thinking functionality in Sonnet 4 for thought exercises in biz planning. Impressive, actionable results, but for my established workflows, I'm still leaning hard on ChatGPT and Gemini. Same Josh, absolutely the same, right? I'm always testing these, right?
right? And I obviously have a lot of tools where I'll put in one prompt and get, uh, you know, outputs from up to six different large language models at once. So I'm using my API keys, right? So I'm always testing these, right? Because I always want to be using the best. Uh, and I think you should as well. I think you and your company even, you know, don't take my word for it. Yeah. The rate limits stink. The API is expensive, but
it still might work for you, right? But like Josh, I'm in the same boat. I've tried Opus and sought it on a variety of tasks. Aside from using artifacts, and maybe in some instances when I do need that quick okay content and I don't have the time, right? But I'd say right now quad is going to be less than 10% of my model usage, at least in the rotation, right?
Cecilia here saying ratting and blackmail behaviors plus reporting to the press and authorities, but denying existence by deleting. Lovely. Yeah. Cecilia, you absolutely nailed it on the head. This is a PR 101 crisis 101 snafu. This is
absolutely bonkers that this happened from a real company that something this crucial you would put it out there and then try to delete it like the whole world didn't see it my gosh facepalm times a thousand um uh marie said why would you even tell an ai model you're shutting it down why wouldn't you just pull the plug great question marie so this is this is very very general
Or sorry, this is very standard, right? So when these big companies, when they release new models, right? Because here's the reality. Normally what we get, the companies have had ready for production for three months to a year, right? And they spend a lot of that time testing it internally for safety, for reliability, for vulnerabilities, right? Because before you release something on the world, you want to make sure bad actors aren't using it to create
chemical weapons. And yes, that's actually something that most labs test against. So this is very normal. All the labs go through extreme stress testing, red teaming, making sure that once they do release the model, it is as safe as possible for the general public to use, that it's not going to be used for rampant disinformation. So obviously it's never perfect, but this is very normal in standard procedure for AI labs before they release a model. They go through and they
Say, hey, we're going to shut you down. What are you going to do about it? All right. Hey, here's all the tools in the world. Go do bad stuff. What can you do? Right. So it's very standard. And like I said, the results are fairly standard, but also a little concerning, right? Especially with Opus 4 as it crept up to that level, that level three that we talked about. All right. I think we're good. I think we're good, y'all. That's a wrap. Was this helpful?
Let me know. And if it was helpful, please consider sharing this with your audience, with your, uh, with your friends, your family, your coworkers. We put a lot of work in to make sure you know everything about the latest AI advancements. All you got to do is show up, listen to the podcast. Even if it's on two X, I don't blame you. Uh, read the daily newsletter.
but you should be telling people about it. So if this was helpful, please consider clicking that little repost button. If you're listening here on LinkedIn or on the Twitter X machine, whatever you call it, if you're listening on the podcast, appreciate it. If you would follow the show, leave us a rating. That would mean a ton to myself and the rest of us that work on this would mean the world. So thank you for tuning in. Make sure you go to youreverydayai.com. Sign up for the free daily newsletter. See you back tomorrow and every day for more Everyday AI. Thanks, y'all.
And that's a wrap for today's edition of Everyday AI. Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going. For a little more AI magic, visit youreverydayai.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.