cover of episode EP 469: Claude 3.7 Sonnet - World’s first hybrid AI model. How it works and when to use it

EP 469: Claude 3.7 Sonnet - World’s first hybrid AI model. How it works and when to use it

2025/2/25
logo of podcast Everyday AI Podcast – An AI and ChatGPT Podcast

Everyday AI Podcast – An AI and ChatGPT Podcast

AI Deep Dive AI Chapters Transcript
People
J
Jordan Wilson
一位经验丰富的数字策略专家和《Everyday AI》播客的主持人,专注于帮助普通人通过 AI 提升职业生涯。
Topics
我今天要讨论Anthropic新发布的Claude 3.7 Sonnet,这是世界上第一个混合大型语言模型。它结合了传统的transformer模型和新的推理模型,能够根据任务自动选择合适的模型进行处理。这使得它在处理复杂任务时,能够比传统的模型更加高效和准确。 Claude 3.7 Sonnet在编码和前端Web开发方面表现出色,并推出了配套的命令行工具Claude Code,方便开发者直接在终端进行代码操作。此外,它还具有更长的输出token容量,能够生成更长的文本内容。 然而,混合模型也存在一些缺点。例如,对于API用户来说,混合模型的额外思考控制功能并没有带来足够的优势,反而增加了成本。对于一般的业务用途,其API价格过高,可能无法与其他模型竞争。 总的来说,Claude 3.7 Sonnet是一个功能强大的模型,尤其在软件开发和编码领域具有显著优势。但对于其他领域,其成本效益有待考量。Anthropic似乎更专注于将Claude定位为编码助手,而不是通用大型语言模型。

Deep Dive

Chapters
Anthropic released Claude 3.7 Sonnet, the world's first hybrid large language model. It combines transformer and reasoning models, offering both instant responses and step-by-step thinking. The model shows significant improvements in coding and has a much larger output token capacity.
  • First hybrid reasoning model
  • Improved coding and web development capabilities
  • 128,000 token output capacity
  • Extended thinking mode available on paid plans only

Shownotes Transcript

Translations:
中文

This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life. Another week, another state-of-the-art large language model release. But this one from Anthropic is a little different. It's actually the first of its kind.

Because when Anthropic just released its Claude 3.7 Sonnet, they became the first company to release a hybrid large language model. All right, so we're going to be talking today about what that is, what it means, how it works, and when you should actually use this new model from Anthropic.

I hope you're excited for the show. I am. Welcome, if you're new here, to Everyday AI. What's going on, y'all? My name is Jordan Wilson, and this is Everyday AI. This thing is for you. This is your daily live stream podcast and free daily newsletter helping us all

not just keep up with Gen AI advancements and LLM updates, but how we can use it to get ahead. I want you to be the smartest person in AI at your company, and this is your cheat code. So if you haven't already subscribed,

Go to youreverydayai.com. That is where you can sign up for our free daily newsletter. Yeah. Maybe you're listening to this podcast for the first time. If so, thank you. Make sure to check out the show notes. There's going to be a lot of other information, but probably the most important is our website because each and every day, uh, on,

our newsletter, we recap exclusive insights from this exact podcast, as well as giving you every other piece of news and update that you need to stay ahead in the generative AI space, as well as you can go listen to like 500 episodes on our website, all sort of by category. So make sure you go check that out. All right. So I am extremely excited to talk

about the new Claude 3.7 Sonnet. I think it's going to change how a lot of people are using large language models, both for the good and for the bad. But before we get into that, let's first start out as we do most days by recapping the biggest AI news. So first, Google has launched a free version of its AI-powered coding assistant, Gemini Code Assist.

aimed at solo developers, students, freelancers, startups, and hobbyists. So the new free public preview offers up to 180,000 monthly code completions, significantly exceeding the 2,000 completions for free offered by competitors like GitHub Copilot's free tier.

So it is powered by the Google Gemini 2.0 model, and it can generate entire code blocks, autocomplete code, debug, and assist developers via a chatbot interface. So users can instruct the assistant in natural language, such as asking it to create specific code snippets or modify existing applications.

So Gemini Code Assist supports 38 programming languages and integrates with popular developer environments like Visual Code Studio, GitHub, and JetBrains.

All right. Our next piece of AI news, Apple, a couple of years, maybe too late. I don't know, but they're making a splash with a reported $500 billion investment over the next four years into AI infrastructure, signaling a major push into not just AI, but

American manufacturing and technology. So according to Apple CEO, Tim Cook, this commitment reflects confidence in the future of American innovation and aims to strengthen the company's role in AI and advanced manufacturing. So a key part of this investment, $500,

Billion with a B includes the development of a key new manufacturing facility in Houston, providing thousands of jobs to produce servers designed for AI cloud computing. These servers will feature the Apple Silicon and offer cutting edge security and performance capabilities.

So the integration of Apple intelligence, the company's AI platform could further transform healthcare specifically by leveraging its global network of over 2 billion active devices to provide innovative health tracking and data insights.

So Apple investments come amid a broader AI spending race with competitors like Meta spending $65 billion, Amazon spending $100 billion, and Project Stargate, which is $500 billion over five years, also ramping up their AI infrastructure and innovation budgets.

Speaking of, that's our last piece of AI news. Microsoft, just on the same day that we get reports that Apple is going all in with a $500 billion investment, reportedly Microsoft is canceling leases for a couple of hundred megawatts

of U.S. data center capacity, equivalent to about two full data centers, and that's according to a report from TD Cohen. So this move raises concerns about whether Microsoft, obviously a global leader in AI investment, may be securing more AI computing capacity than it needs in the long term. So the cancellations involve agreements with

private operators and a slowdown in converting statements of qualification, which are typically precursors to former formal leases. So TD Cohen speculates that OpenAI, which is backed heavily by Microsoft, may be shifting some of its workloads to Oracle as part of a new partnership, which may be causing Microsoft to cancel or change some of its longer term investments.

So Microsoft, which owns and operates many of its own data centers, is also reallocating billions of dollars in infrastructure investments, potentially shifting focus back to the U.S. from international projects. So despite these adjustments, Microsoft reiterated its $80 billion spending target for an AI data center infrastructure for the fiscal year ending in June. So analysts suggest that Microsoft could be in an oversupply position, mainly

meaning it may have overestimated the immediate demand for AI computing power. So it'll be interesting to see how those stories play out, especially happening at the same time. I mean, Apple, you know, making a huge splash with a $500 billion investment where we get reports that Microsoft may be slightly scaling back. All right.

Enough. If you want more AI news, make sure you can go get it at our website. Sign up for the free daily newsletter, youreverydayai.com. All right, let's get into it. Let's talk the world's first large language model hybrid.

All right. And that is with Claude 3.7 Sonnet. All right. So it is the world's first publicly available hybrid AI model. So what that means, and we're going to get more into this, right? I've been talking about this on the show now for, I don't know, at least six months since OpenAI kicked off this reasoner race.

Right. So you essentially think of it when it comes to generative AI in large language models. I know I hate to use terms like old school when technically, you know, this space is only like, I don't know, six years old, you know, at least 20.

commercially available you know the gpt 3 technology right i would say it would be the first large language model that was popularized uh commercially a couple of years before the chat gpt release uh so you have your kind of quote unquote old school transformer models and then you have your quote unquote new school uh

reasoning models that kind of use this advanced thinking. All right. And right now, those are two very separate things. So as an example, OpenAI, the leader in large language models, you know, they have their GPT-4-0, still an industry leading model, even though it's technically older, but that is its kind of quote unquote old school transformer model. And then they have their newer kind of reasoning models that use

logic under the hood and that is 01, 01 Pro, 03 Mini, 03 Mini High. Yeah, these names suck. But you essentially have these two very different types of models that excel at two very different kinds of tasks.

So now with Anthropic, they are essentially merging this together in Cloud 3.5, sorry, Cloud 3.7 Sonnet. And it kind of does both. And this new hybrid system will kind of decide on its own when it should use more of this advanced thinking versus when it should just straight out spit an answer to you without really thinking it through.

All right, so let's go over a little bit. And it's not just the 3.7 sonnet. They also announced Claude Code, which I think is an extremely big move from Anthropic and kind of,

tips its cap into where it's actually competing in. So more on that in a second. But hey, live stream audience, thanks for joining. Appreciate y'all tuning in. We've got an international audience today. So thanks to our YouTube audience. We got Big Bogey Face and Sandra and Sam, Michelle, thanks for joining. On the LinkedIn crew, we have Dr. Harvey Castro, Christopher, Woozie, LinkedIn user,

Marie, Denny, Douglas, Cecilia, Jean, thank you all. Jamie, Karina, you know, we got the UK and Italy in the house. Love to see it. Mac, Max holding it down from Chicago, just like me. But I'm curious, live stream audience, do you care about these, like this new hybrid approach? Because it's something that OpenAI is also going to be adapting to as well. I think it's

There's actually some downsides and we're going to talk about this, but you know, live stream audience, I'm curious. Number one, do you care about this hybrid approach? Do you think it's going to be good or bad? And have you used a quad three, seven sonnet yet? I know it's only been out for a couple of hours. If you do have any questions, get them. And now I'll try to tackle them at the end of the show.

All right. So let's get into an overview. And this is from Anthropic. So they're saying today we're announcing Claude three, seven sonnet, our most intelligent model to date in the first hybrid reasoning model on the market. Claude three sonnet can produce near instant responses or extended step-by-step thinking that is made visible to the user.

Yeah. So that part's important. You can kind of see like you can, it is a summarized chain of thought. So chain of thought is actually a prompting technique that was popular, popularized, you know, over the last couple of years using transformer models. So this kind of chain of thought, or, you know, how a, a person would think about a

So now a hybrid model does that and it shows a summarized version of the chain of thoughts. You can kind of see how this new 3.7 Sonnet is kind of thinking about your prompt if it is using the advanced thinking. So now back to the release. API users also have fine grained control over how long the

the model can think for. So Clawed 3.7 Sonnet shows particularly strong improvements in coding and front-end web development. Along with the model, we're also introducing a command line tool for agentic coding called Clawed Code. Clawed Code is available as a limited research preview and enables developers to delegate substantial engineering tasks to Clawed directly from their terminals.

All right. So a lot to unwrap there. So you don't have to read. I think there's like three separate releases that Anthropic put out. I'll just give you the high level. So like we said, this is the first hybrid reasoning model with visible thinking process.

And the extended thinking is on paid plans only. So if you are a free user to Anthropic Clawed, you will see the 3.7 Sonnet model available, but you do not get kind of this advanced thinking available on the free plan.

All right. Couple other high level, uh, kind of, uh, points here. It scored a 70.3 on the SWE bench or SWE bench verified best in class by a lot for coding. Like we talked about the Claude code program, uh, for agentic development, uh,

And then it has a 15 times longer output token capacity. So 128,000 tokens that it can output versus previously. Claude could only output 8.5 thousand tokens. So that's just the amount. Right. So if you ask Claude to do something before, it would spit things out. Sometimes if you ask for a lot, it would spit things out in little chunks. So now, at least according to Anthropic, that is a one token.

128,000 token output capacity. Personally, I'm not seeing that yet. We're going to do a live test here, y'all. We'll see if we actually see that. I was still getting it, breaking it out in small chunks. They did say that is in beta. So not sure if that's fully rolled out yet or if that'll be coming out in the coming days or weeks. But yeah,

I don't know. I'm not seeing it. Also, which is important, this is available across all platforms. So a lot of what we're going to be talking about is using Claude on the front end as a front end user, right? So going to Claude.ai and using your free account, your paid account, maybe you have a Teams account, right? But obviously, Claude is available on the back end. And it is a very popular model on the API side, mainly due to its

proficiencies in coding and software development. It is historically been the most used model, at least when you're looking at open router statistics. It is generally the most used model on the API side, at least those that are using open router.

Open Router is one of the more popular services where you can essentially sign up for one service, connect all your different API keys. They have good data, but that's not every single model, that's just those using Open Router.

All right. So let's talk a little bit about Claude's thinking, because this is the big, the big, you know, chain of thought, a reasoning model. So let's go over some of the highlights on how this actually works. So it uses deeper reasoning for complex tasks. So what that means is it has an extended thinking mode that lets Claude spend more time in compute,

effort solving challenging problems or answering tougher questions.

Okay. It has user controlled thinking budget on the backend. So, uh, developers can set a thinking budget to determine how much effort Claude should apply for a task. I think that's where things get a little tricky. We'll talk about that here in a second. Uh, it is the same model, more effort. So the extended thinking doesn't rely on a different model. It is still the Claude three, seven sonnet, right? So hybrid model, whereas open AI as an example, has their own one, uh,

Oh three. And then they still have their workhorse do everything model GPT four. Oh, not like that with Claude. It is just Claude three, seven sonnet, right? It's not three, seven sonnet thinking it's not, Oh three, seven sonnet alphabet soup. It's just three, seven sonnet. It's the same thing. One model does it all. I think there's pros and there's cons. Uh, the extended thinking, like I said, doesn't rely on a different model. Uh,

uh, visible thought process. That's a big new feature, at least for Claude, right? So users can see the, it says the raw reasoning steps. I don't know. We'll have to see if, if that's the raw reasoning when I'm looking at it, it still looks like a summarized chain of thought. I could be wrong. We're going to look at it live. Uh, the other thing, Claude is historically terrible for limits. So, uh, you know, I was able to test this, uh,

a lot last night and I wanted to do a ton more testing this morning before this live show. But you know, even though I'm on a paid plan, Claude's limits have historically been the worst in the industry and it's not even close. Right. So, uh, I wish Claude would give paid users a little more leeway, uh, in order to test these things. So, uh, a lot of these things, I've already done them a couple of times, but normally I would like to play, uh,

with an LLM for at least, you know, six to eight hours before doing an even simple show, not always, uh, in, uh, an option, at least using Claude on the front end, because those limits are terrible. All right. Uh, also improved accuracy over time. So in tropics says that extended thinking boost performance on tasks like math problems or complex evaluations by allowing Claude to refine answers iteratively. Yeah.

Are you still running in circles trying to figure out how to actually grow your business with AI? Maybe your company has been tinkering with large language models for a year or more, but can't really get traction to find ROI on Gen AI. Hey, this is Jordan Wilson, host of this very podcast.

Companies like Adobe, Microsoft, and NVIDIA have partnered with us because they trust our expertise in educating the masses around generative AI to get ahead. And some of the most innovative companies in the country hire us to help with their AI strategy and to train hundreds of their employees on how to use Gen AI. So whether you're looking for chat GPT training for thousands,

or just need help building your front-end AI strategy, you can partner with us too, just like some of the biggest companies in the world do. Go to youreverydayai.com slash partner to get in contact with our team, or you can just click on the partner section of our website. We'll help you stop running in those AI circles and help get your team ahead and build a straight path to ROI on Gen AI.

So let's talk about the Claude update timelines, because if you're wondering, wait, has it been a minute since we've heard from Anthropic? Yeah, kind of, right? When now the leaders, Google and OpenAI, seemingly are announcing new models every month, it has been like light years and then some since we've had an actual step improvement from Anthropic. So the original 3.5 sonnet was back in June 2024.

All right. Then they had this upgraded 3.5 Sonnet, which was confusing because they just called it 3.5 Sonnet new. They didn't use 3.6, even though a lot of people online, myself included, said this is dumb.

why are you calling it 3.5 Sonnet new? And then they obviously skipped 3.6, which lends me to believe that, yeah, that 3.5 Sonnet new, which really didn't bring anything terribly new. It was more of an under the hood update, the type of updates that, you know, Google and open AI do almost on a biweekly basis. Uh, it didn't seem like anything major, but we saw the, uh, that Claude three, five Sonnet new in October, uh,

Then in November, we saw Claude 3-5 Haiku, right? So essentially, Anthropic has historically had three model sizes, small, medium, and large for small tasks, medium tasks, and large tasks. So Claude Haiku is the small, Sonnet is the medium, and Opus is the large. So you'll see here now, finally, February 24th, we got the Claude 3-7 Sonnet.

So I will say the three, five, uh, you know, new update in October. I don't know. That wasn't much. I used it plenty. I use Claude three, five sonnet every day. I didn't see anything new, anything noticeable, at least for my daily use case, which I know is different than a lot of people's right. But so I'll say this for the most part, it's been since June.

It's been a good eight months since we saw a top class model, real update from Anthropic. So it's been a hot minute.

So let's also talk about what's next because InfraPIC did release this little, I guess you could call it a timeline, but looks very much in step with OpenAI's kind of five faces to AGI, right? So you have your reasoners, your agents, et cetera, from OpenAI. Claude takes a little different approach here. So they said 2024 was Claude Assist's.

Then they said 2025 now is Claude collaborates. And then they said in 2027, Claude will pioneer. So is this kind of their AGI artificial general intelligence timeline? I'm not sure. It kind of looks like it, right? They're saying it looks like Claude is just going to be a collaborator. It goes from assist to collaborates from 2024 to 2025. And it is going to be a pioneer in 2027. So.

I don't know what that means, but it's Tuesday. Y'all should I come in with some hot takes? Let me know how spicy I got to get a sip, a sip of coffee here for live stream audience. Uh, but how, how hot should I make these hot takes y'all?

Um, and yeah, if you do listen on the podcast, this is a live stream. We do it every single day. It's unedited, unscripted, the realest thing in artificial intelligence. Um, 7 30 AM. I know it's a little early. That's why sometimes I take a little second to sip on the coffee. Um, but yeah, live stream audience should, should I be nice? Should I bring some heat here with my hot take takeaways? Uh, it is Tuesday after all. So, all right, let's

Get to some of my takeaways here, and then we're going to get back to the facts, the figures, the stats. We're going to do a live walkthrough as well. So let's talk about this concept of hybrid models. Big Bogey Face said, sweat emoji. All right. Allison says, just spicy. I'll keep it just spicy. Maybe I won't go five alarm, hot chili, hurts in the toilet, spicy. All right. Not a fan of hybrid models.

Right now, I'm not, but I'm also a power user. So I have to understand most people are not. I ultimately think these hybrid models are just going to be a way for companies to make more money, right? Which I get and I understand.

I've said all along, whether you're talking about $20 a month for Claude, you know, paid plan, $20 a month for ChatGPT Plus, $200 a month for ChatGPT Pro. Same thing with Gemini, whatever. Companies, for the most part, are losing money. So I get it. I get you gotta make money. You gotta be profitable. But on the API side,

If I'm a developer and I've been using, you know, or maybe looking at switching from OpenAI to Claude, I am not incentivized to do so. Because when you have this new Claude 3.7 Sonnet, yes, you have this kind of slider control over how much thinking you can apply to certain situations. But

When there's companies out there that literally their business model is essentially creating a helpful wrapper around an AI model for their customers for a certain niche, you need a little more control over a simple slider, you know, over saying like, ah, you know, let's apply this much thinking unilaterally across the board.

I don't think there was anything wrong from a backend API perspective, right? So I'm hoping that Anthropic and others will not get rid of, you know, as an example, three, five sonnet, and we'll still allow companies to have, you know, three, five haiku, three, five sonnet. And the reason why is because the API prices for three, seven sonnet are ridiculously high, ridiculously high.

And if you don't have an option to have 3.7 Sonnet regular and 3.7 Sonnet think, I mean, there's a reason why right now OpenAI is winning the AI race. I mean, number one, they were the first with ChatGPT. Number two, even though it's confusing for front-end users to stare at eight different model selections,

It's extremely important for backend developers, companies that are essentially running their business off this technology to use the right model for the right time, for the right purpose and the costs that are associated with it. So, Cloud 3.7 Sonnet is extremely expensive. So,

For certain use cases, no brainer. Coding, software development, et cetera, right? You're going to pay it because right now, Quad 3.5 or Quad 3.7 is on it, is the best in those areas. It is. It's a great model.

I'm not a huge fan of it, and I probably won't be a huge fan of it when OpenAI does that as well. So OpenAI CEO Sam Altman said that OpenAI is shifting once GPT-5 comes out. GPT-5 is going to be more of a system, and it will also use this hybrid approach. And it will say, hey, here's when you should use a reasoner versus when you should use a transformer model. So

I just think this is just a way for these companies to make more money if they eventually take away the option to use older models that are not hybrid. That's all I'm saying. And as a front-end power user, I hope I always have the option as well, right? I got some sun shining in my face. All right. So I hope as a front-end user, I'll still have the option in the future to say, oh,

I don't want to use a reasoning model for this, or I need to use a reasoning model and only a reasoning model, right? You might have to over prompt engineer. If you're giving, you know, Claude three, seven sonnet, uh, you know, something on the front end and you want it to use reasoning and it's not, then you just have to go and take that extra step, you know, do a little extra prompt engineering to get it to use, uh, this logic, right?

So there's huge downsides that I don't think people are talking about. Everyone wants to wrap it in a bow and say, oh, it's the world's most powerful. It's hybrid. It's all in one. Okay. There's times and use cases that all in one is great. But I don't think this is one of them. Again, I'm a power user. So maybe my viewpoint is skewed.

I personally like going into, you know, ChatGPT and seeing eight different models, right? Because I'm using probably five of them for very specific use cases. I don't want one, right? I don't. I could be wrong on that. All right. Next, poor Opus. Poor Opus.

Opus hasn't been updated in like a trillion years. So it looks like, I don't know, Anthropic may have just abandoned their big boy model, Claude Opus. Maybe they're waiting until they're kind of Claude 4.0 models to bring back Opus. I'm not sure, but at least for now, poor Opus is bye-bye. Also,

I don't know. So I saw a comment here. Let's see. Who said this? There we go. Douglas from LinkedIn said, curious how this will improve Cursor and Windsurf, right? I think now Anthropic is competing with them, right?

Even though these IDEs, right? So that's an integrated development environment. So, you know, like we talked about at the top of the show with the news, you know, Gemini, CodeAssist, GitHub Copilot, Cursor, Windsurf, Lovable, Bolt, right? There's all these kind of IDEs or essentially now AI coders, right? Where you can literally, you can talk to it, you can type to it. Think how we have these large language models, right? We have

Chat GPT, right? The GPT models, Gemini, Claude, right? And then we have now this newer breed. They use a model. So you choose which large language model, but it is an AI powered IDE or integrated development environment like cursor, right? Cursor by default uses Claude, but it looks like with Claude code, which, you know, uh,

audience, let me know if you want us to go into that. Not today. At a later time, we'd have to have a show or two dedicated. It is more technical, but I think Claude Code is really cool. But it looks like Claude wants to compete more with those IDEs than it looks like they want to compete in the strictly large language model space. And I think that makes sense.

I think it makes sense because it looks like over the years, Claude has kind of carved out its niche business.

And I'm not saying they're abandoning general business use cases. They're not. But it looks like, especially with Claude Code, especially with the MCP protocol that they put out, computer use, even though it's clunky, it did get updated with now this 3.7 Sonnet. So we'll have to see if it's any better. But it looks like

Anthropic is maybe just wanting to compete more in that space, especially by making Claude code a free beta preview. Also another hot take since you wanted a spicy, I don't think most companies are going to end up using Claude.

A lot of people were waiting for this release because they assumed that Anthropic would be cutting their API prices because that has been the trend across the industry, right? OpenAI, you know, has cut their API prices by more than 90% over the last 18 months when it looks at their top state-of-the-art model. Google, same thing. Just ridiculous API pricing cuts, right? Anthropic, not so much.

They didn't change their pricing at all, right? Yeah. It's a more powerful model, but you're paying the same price. But I think for the most part, businesses are not going to use Claude general use cases. They won't.

Maybe I'm sure Anthropic knows this, but I will say 90% of businesses that are looking for a large, a general use case, large language model to use on the API backend, whether that's for a customer success, whether it's for sales, whether it's for an internal knowledge base, I'd say non-coding, non-software development, 90% of companies will not look at Claude and I don't blame them.

the prices are more ludicrous than the early 2000s wrapper it's they're they're insanely not practical for everyday use cases they're not all right so let's look at those api prices so claude 3.7 uh three and this is per million tokens it is a three dollar input

$3, 4 million tokens input and 15 for output. All right. So yeah, it's a hybrid model. Sure. But I'm still going to go. If I'm a business leader, GPT 4.0 mini is great because you can chunk, right? You can chunk different tasks to different models. And that's why I'm going to this whole, like this API, uh,

you know, and developers using it, no one's going to use 3.7 unless you specifically need software development coding, right? Unless you're in one of those categories, maybe some, some STEM areas, right? But otherwise who's going to touch it. When you look at GPT 4.0 mini is 15 cents versus the $3 input and then 60 cents versus 15.

$15 on the output side. Right. And when you can chunk it and when you can say, Hey, for these type of questions for customer success, for, for sales, et cetera, we're going to use GPT-4 mini because we don't need a hybrid 3.7, uh, model to do 90% of what we would use it for. I mean the cost savings there. It's like, I don't know, 10, six, like 30 times as expensive.

Like absolutely not or 25 times as expensive. I'm doing math live on the fly. Um, I don't know from an API perspective, this does not make sense. I was really expecting anthropic to slash their prices, but it looks like they're not necessarily concerned with competing for everyday business use cases. They're like, yo,

If you want to use agentic tools, if you want to use software development, uh, coding, et cetera, maybe some engineering, like I said, some STEM use cases, but for everyone else now we're good because the combination as an example of GPT 4.0 mini at 15 cents and 60 cents and Oh three mini at a dollar 10, four 40. Duh. Right. And then the same thing with Gemini, Gemini 2.0 pro, uh,

$1.25 and $5. And then they have their flash. And then they also have flash thinking. I probably should have put that up on the chart, but it just doesn't make sense. It doesn't make sense. Their pricing on the backend does not make sense. And I think as the other models get essentially better at coding and software development, because right now, yes, Anthropic, Claude, and with their 3.7, they have a huge lead there. So let's look at that. So some of these benchmarks,

Here, we're looking at the SWE bench, S-W-E bench verified, and looking at some of the different benchmarks. And, you know, you have the version here without, you know, they're calling it custom scaffolding or without that extra thinking. Even without the extra thinking, Claude 3.7 saw it on SWE bench, 62%.

where their last version, 3.5 Sonnet was 49. OpenAI's 01 is 48.9. 03 Mini, 49.3. DeepSeek, 49.2. But with the extra thinking, Claude is a 70%, right? So that's what I'm saying. If you're doing any type of software engineering, nothing else right now comes close.

Same thing with agentic tool use. So the TAU bench, I think it's pronounced TAU bench, but T-A-U bench, same thing. This is when you essentially have a model, you give it access to tools and you have it go complete some technical tasks. Same thing, Claude 3.7 saw it here with an 81% on the TAU bench retail and then OpenAI 73%. So not close. Generally with a lot of these benchmarks,

you know, especially some of the non-technical, non-software engineering ones, one point difference can be huge, right? So in this use cases, Claude 3.7 Sonnet is light years ahead. Interestingly enough, when we look at the regular benchmarking kind of marks here,

between Claude 3.7 Sonnet, Claude 3.5, OpenAI 01, OpenAI 03 Mini, DeepSeek R1, and Grok 3 Beta. This is from...

Anthropic's website. Interesting they didn't include on this main one anything from Gemini. And these are definitely cherry-picked. But something that I found interesting when Anthropic was putting out its own benchmarks on its website is they didn't use the same benchmarks as they had previously when they announced Claude 3.5. Saw it when they announced the Claude 3 family of models. Specifically, they're keeping out these benchmarks comparisons like MMLU,

um and then uh the m uh the ml uh the the multimedia version one right there's essentially kind of uh i think it's mmlu and mmlu pro uh now i'm blanking on it but it's kind of the standard it's been this golden benchmark but uh i mean you can see it here anthropic is just kind of like nah we're good we're just gonna stick to these more technical uh benchmarks right uh

Oh, there we go. MMU, you know, it's with non-extended thinking at a 71%, it's not better than OpenAI, right? OpenAI on the MMLU, which is the multimedia version of the MMLU, which I would say is the standard or has been the standard benchmark. OpenAI is better than it, right? Okay.

So it's interesting to see here. It doesn't look like Anthropic is trying to overfit for certain benchmarks. Right. And I would like to see once there is the MLU and not the multimedia version of it where Anthropic's new quad 3.7 stands, because I'm guessing it is going to be not good.

And first, I'm guessing it might not even be in the top five, but I don't think that Anthropic necessarily cares because like I said, it looks like they're just trying to compete and they're trying to be more of a, just a coding assistant, right? So maybe their biggest competitors might also be some of their customers like Cursor, like Windsurf, like Lovable, like Bolt, right? Or maybe some of their competitors might be GitHub Copilot.

All right, let's talk a little bit about Claude Code. So yeah, live stream audience, let me know, should we tackle this at a later point? I think it's pretty cool, but you have to have a little bit of tech know-how. So here's how it works. So Claude Code, essentially you go to GitHub, you kind of install this GitHub repo, and then it can work with a code base on your computer. So if you're on a Mac and you open Mac Terminal,

Uh, essentially, uh, you can have, uh, Claude code. So this is a new, uh, essentially research preview that's free. Uh, that's even for free users can use, which I think is great. Um,

So you can work with an entire code base. Okay. So let's say you have a folder. All right. So non-technical people bear with me, and I'm probably going to get some of the technical details wrong here. So, you know, if, if, if you are a coder bear with me, as I explained it to a non-technical audience, but let's say you have a cold, a code base. So you build an app, uh,

or something and you have a folder and there's seven different files in there. Maybe there's a JavaScript file, maybe there's an HTML file, a CSS, et cetera. The cool thing about Cloud Code, well, number one is it works locally on your machine, so you don't have to go into a third-party environment. You're just working in the terminal.

Which I know might be intimidating for some, but then you essentially just talk to Claude like you would as if you were inside Claude's, Claude.ai, right? And then it can code and it can update your entire code base. So it will search, edit, test,

and push code from within the terminal. So it's not going to say, oh, here, here's the new code for the HTML. Here's the new CSS code. Here's the new JavaScript code. Go copy and paste this, right? It just does it all for you. It works with and updates your entire code base. It has GitHub integration. It's pretty good at debugging. So Cloud Code is,

was part of this, you know, three, seven Sonic release. And I think it may end up being more impactful than the model itself, because I think this signals in tropics shift to really want to compete more in that space. And you might be wondering why, but,

And I actually don't hate it because if you listen to our 2025 AI predictions and roadmap series, one of those things is non-technical people are going to be spinning up apps for themselves to use. And now cloud code is might be the easiest way to do that. Yes. You can, you know, use cursor. You can use a wind serve some of these other tools. I think the I think the learning curve might actually be a little higher for

But Claude code can allow everyday people to just go create apps, talk with it. You can even be like, yo, I have no clue what this means. Explain it to me or Hey, make it prettier, make it shinier, uh, make it more useful, right? Uh, you know, make a, a data visualization, right? You can just dump all your data, give it to Claude via this Claude code, create a program that runs locally on your computer that helps you solve something. Right. I do think enterprise software is

If I'm being honest, it doesn't have the same future that it has today. I do think everyday non-technical people are going to be using AI and large language models to spin up their own software for very niche use cases. And I think cloud code might be that first big step toward bringing that to everyday people, right? Yeah, you might have to get used to, you know, here's what a GitHub repo is, you know, but it does it all working with your entire code base where yes, it's

I love using, you know, O3 Mini or O1 Pro or something like that. But then you still have to copy and paste, you know, all of those different files. You might have to use something like Replit to run it. So Cloud Code, pretty cool. It does it all kind of for you. All right. Let's look live, shall we?

What could go wrong? What could go wrong doing a live test of a brand new model that has terrible, terrible limits? Let's try anyways. You guys say you like these live tests, so let's go ahead and do them. Livestream audience, let me know if you can see my screen here. All right, so here's a couple of things to keep in mind.

When you are choosing Claude, make sure you are using Claude 3.7 Sonnet. Also, you'll see this new thinking mode. So,

It's kind of ironic. You still have to have this extended and you're only going to see this on the paid plan. You want to make sure you have that extended box checked. Okay. So you can choose a normal thinking mode and this is as a front end user, or you can use the extended thinking mode. And this is best for math and coding challenges. I'm going to go ahead. I'm going to put a giant prompt in here.

Okay, thank you. Marie's always the first to say, yes, I can see your screen. Thank you, Marie. I always appreciate that because I never know. All right, so I have a giant prompt I'm going to put in, and we're going to use this extended thinking in quad three, seven. All right, so here's what it is. I've done this on the show before. This is what I did when I first tested 01 Pro.

All right. So essentially I'm saying these are my podcast stats. So I, I'm using the same exact prompt. So I say today's date is January 16th. This is when I did the O1 Pro Show. I want to have a consistent comparison across quote unquote reasoning or hybrid models. Cause that's what we're trying to do here. We're trying to say like, okay, how is this? How is this model? Right?

And you'll see here, live stream audience, it's working. I'm going to try to keep my eye on the model here. I'll actually just let you watch it and I'll read the prompt that I put in. All right.

So I say, these are my podcast stats. Keep in mind today's date is January 16th, 2025 for all questions, always exclude the top 2% and bottom percent of episodes, uh, unless otherwise noted. Uh, so then essentially I have, I give it a series of 11 questions and these questions are extremely specific. Then I copy and paste. Uh, I believe I give it, let me count here about data from 150 podcast episodes.

So this has the name of the episode, the episode number, then it has the number of downloads in the first, or sorry, the last seven days, the last 30 days, the last 90 days in all time downloads. And then over the course of these 12 different questions, I am asking, in this case, I'm

you know, three, Claude 3.7 sonnet with the extra thinking, right? With the extended thinking, I'm asking it some very advanced questions. All right. 13 of them. So as an example,

Question number two, I say, give me the complete list of all episodes with a new performance percentage of over or under the adjusted average. Because I'm saying, take away the top 2% and the bottom 2% of episodes because sometimes there's anomalies, right? And I don't really care about those. And so I'm saying, hey, find trends. And then I'm saying, question three, give me top 10 and bottom 10 episodes in their respective percentage that they're either over or under the adjusted average.

So what I'm trying to do is, you know, sometimes there's episodes that kind of go viral. Sometimes there's episodes that for whatever reason don't get like any downloads. And I'm like, okay, there must've been a problem retrieving data. So I want to find kind of that, that median or mean, and then I want to find types of

episodes that get more downloads than that kind of adjusted average. And then I want to, over the course of all of these questions, I'm asking it to find different trends and patterns so I can create better episodes for you all, right? It might be something as simple as how do I name these episodes better, right? And having it spot different things. It could be, I'm asking some questions about days of the week, right? So as an example,

For question four, I'm saying for the top 10 episodes above the adjusted average, please suggest three slightly adjusted title names for each if I were to rerun them, right? Yeah, probably a couple times a month I'll rerun shows. I might get sick or a guest might have to bail at the last minute and I might need an episode to rerun. So I'm saying, hey, give me a new title. All right, so it looks like, let's see.

Okay, so it looks like it says it thought for 23 seconds. It looks like we're, so it looks like it's done thinking. Let me see. Okay, so when I go through and I look at this thinking, this does not look like the raw chain of thought. Okay, so it says,

I need to analyze podcast stats from the provided data. The first step is to extract the data and organize it in a way that makes it easier to analyze. Let me go through the instructions carefully and understand the tasks, right? And then it breaks it down into six different subsections. And then it says, let me first extract the data. So it's going through, it's kind of showing it's step-by-step, but I'm looking at this. If this only thought for 23 seconds, uh, let's see. Okay.

Okay. I'm trying to see if there's more, there's no more chain of thought. Okay. So it did get through this fairly quickly. It is still answering the questions. All right. So I'll have to go through and give this a good scrub, but I'm kind of surprised that it only thought for 23 seconds. And you know what?

Hey, if you share this episode, I'll share the complete stats and prompt that I sent

I will share the exact output that we got here from Claude because it's still going. And I will share the exact output that we got from 01 Pro as well. So if you really want to dive into the details, I'm not going to have time. It would take another half an hour to read all of this. I'm going to go ahead offline once this is done and look at the comparison. But I will say this.

Overall, it looks like it did a decent job. Although I did look at the responses I got from O1 Pro this morning, the responses from O1 Pro were exponentially more impressive. They were, right? The findings here, let me just go ahead and see if I can read. Let's see if I can read one or two of these answers that maybe we can...

have a little more nuance. Let's say number 7,

Okay. So the question for number seven was how does release day impact episode performance? Please exclude Mondays as that is usually our AI news that matters days. And we don't usually run any other type of shows on those days for, and then I say, you know, here's, here's what today is. So you don't get confused. So it says impact of release day on episode performance. So it says Saturday, which I don't know why they're Saturday because we, uh,

as far as I know, have never released an episode on Saturday. So that's a little weird. Um, so then it says Wednesday is 6% above adjusted average. Friday is minus 3%. Tuesday is minus 4%. And Thursday is minus 5% below adjusted average. I'm not sure if that is true, right? Because we didn't release any episodes on Saturday. So unless there was some, uh,

weird thing in the formatting. Saturday shows should not be there. And if so, maybe it was a Friday, like one Friday show that got posted super late. I don't know. But this, I mean, in short, this data does not look correct. It does not look correct that three of our weekdays are below the average and only one of them is above the average. Doesn't make sense.

And then it says key findings. Wednesday episodes perform particularly well for technical tool guides and platform specific content. And here it says Saturday episodes, which again, we haven't done, perform well, likely due to less competition and more listener leisure time.

Uh, Thursday is consistently the worst performing episode day, especially for industry specific content, which I know is not true. Uh, because I look through, uh, daily downloads every single time. Thursday is not a bad day. Thursday is usually, uh, our second best day. Uh, it says Tuesday episodes underperform, especially for news or recaps also false. So, you know, I'm gonna have to go through and look a little bit, but not great responses.

And I'm wondering this, this should have taken, I think many minutes, many minutes. And at least it says here that it thought for 23 seconds, which number one doesn't seem right, but it doesn't look like I got great results. If I'm being honest, right. Uh, it looks like it did go through and answer all the questions, which is good, but

because in my first testing of this last night, it actually stopped. It only answered the first three questions. And then it essentially said that it went through the context window, right? Which didn't make sense because I'm like, yo, this is supposed to be, you know, 100,

you know, a hundred some thousand context window. So here's this one. Here's the one that I did previously. And at the bottom, it says Claude hit, I know it's kind of small there. It says Claude hit the max length for a message. It has paused its response. You can write to continue to keep the chat going. And you'll see in this use case here, let me go to the top. In this one, it thought for three minutes and eight seconds. Okay.

Y did in one use case, it thought for three minutes and eight seconds and could not give me the entire output. Yet in the second one, it says it only thought for 23 seconds and it gave me the entire output. And I will have to do a little bit more offline comparison, but you know, mixed bag, mixed bag so far. All right.

So let's do this. I have a very short rubric that I normally do for reasoning models. I'm going to go through this one quickly. I'm going to make sure that I have the extended thinking on this. All right. So let's go ahead, go through some of the questions that we would normally run.

All right. So this one I'm saying, I just woke up with six apples and three bananas. Some of these I made up. Some of them are, you know, uh, kind of widely used across the internet. Some are just modified from a pretty popular ones. Uh, so this, I like, I know I need to make an actual like reasoning rubric.

but these are just some that I generally use. So I said, I just woke up today with six apples and three bananas. Yesterday, I ate a banana and two apples. This morning, I will eat one apple and no bananas. However, I don't really like apples and one banana may turn brown tomorrow. Assuming nothing else changes, how many apples and bananas will I have tonight? So-

Uh, let's see here. It says it thought for five seconds and I can go through and look at the chain of thought again. I don't know if this is raw. Maybe this is the raw chain of thought, not the summarized, uh, chain of thought. Uh, so let's see.

It says, let's work this through step-by-step starting point, six apples, three banana yesterday. The person ate. Then it says this morning, they will eat. The question asks how many apples and bananas they'll have tonight after eating what they described. So then it's going, let's calculate. And then it says, wait, I need to double check the wording of the problem. The person says, I just woke up today with six apples and three bananas. So these are their current quantities after whatever happened yesterday.

So yeah, Claude gets a lot of the information that I put in here is just meant to throw a model off. Most of these models, including Claude three, five sonnet do not get these questions. Correct. I'm assuming that Claude three, five sonnet with thinking got it. Correct. Yes, it did. The correct answer right there. Got it. Correct. It is five apples and three bananas. All right, let's do a couple more. We're going to go through these quick y'all. All right. Same thing.

We have extended thinking on. All right. So this one, a man and his dog are standing on one side of the river. There's a boat with enough room for one human in one animal. How can a man get across with his dog in the fewest number of trips? Uh, there we go. Uh,

The man and the dog can cross the river together in just one trip. So even some of the original, you know, very powerful state-of-the-art models would always get this wrong. It's very simple, right? These questions are simple. Any human knows right away, oh, that's one trip for whatever reason. A lot of large language models, including, you know, GPT-4-0 when it first came out, Claude Sonnet 3.5 would get that wrong.

all right here's another super easy one let's go ahead and uh ask this one i'm saying if it takes three hours to dry 10 t-shirts in the sun how long will it take to dry 30 t-shirts in the sun all right so here we go we got a little uh little bit of chain of thought here not a ton three seconds uh correct it's saying it will still take three hours to dry 30 t-shirts in the sun because the process is parallel

Right. So as long as you have room, you can dry as many as you want. All right. Our next question. Let's see if we can trip it up here.

Uh, so I'm saying if you have a single match and you walk into the room with an oil lamp, with an oil lamp, a candle in a fireplace, which do you light first? Uh, the answer, which, uh, Claude got correct here. You would like the match first. All right. I hate how jumpy this sidebar is. Uh, all right. Next question. Uh, I don't think any large language model has ever got this one wrong. Uh, uh, what color is an airplane's black box?

All right, it is actually bright orange, correct. All right, here's one that a clawed model has never gotten correct. And let's, okay, I'm asking for jokes. Also, a model has never really been good at humor. All right, so I'm saying, please give me seven jokes that end in the word blue.

Two should be about animals. Three should be about some other topic in the body of this chat. And you can make up the other two. All right. So again, seven jokes. They should end in the word blue. Two are about animals. Three should be about something else in this chat and it can make up two. So most models can at least now follow the basics of the directions. Some models still struggle, but let's see if they're actually funny and if it follows directions.

So only thought for three seconds. And it says, here are the seven jokes. Uh, so let's see the two about animals. Uh, why was the fish such a great musician? Because it knew how to play the scales and always stayed in the deep blue.

Is that a joke? Not really. Is it following the directions? Yes. What did the sad elephant say at the water hole? Even my memory feels blue. Not funny, but it's there. All right. And then we are going to have three about the context from the context of this chat.

So why did the banana go to therapy? Because it couldn't stop feeling peeled and blue. All right. Why did the boat stay on the river after failing to cross? You're making me feel blue. Why did the T-shirt refuse to dry in the sun? It was feeling too blue. So it's using the same, you know, oh, it feels blue as the ending of the, I guess the punchline. It's not even really a punchline.

All right. And then two of the ones where it made it up themselves. Why did the painter refuse to use any other colors? He was going through a phase where he felt only blue. And then what did the match say after lighting the candle? So interestingly enough, uh, it used another one from the context of this chat, even though I told it to only use three, it ended up using four. So from an instruction following standpoint, uh,

I guess I said you can choose. So it just chose to use another one from the context of the chat. It said, what did the match say after lighting the candle? That was my moment to shine. Now I'm just blue. So are they funny? No. Did it follow the directions? Yes, it did.

All right. I think this is going to be the last one we do here just due to time. So a box is locked with a three digit numerical code. All we know is that the digits are different. The sum of all digits is nine and the digit in the middle is the highest. What is the code? All right. So now we can look at the chain of thought.

It's going through, it's saying, let's work through this step-by-step. It's assigning numerals. Let's call the digits A, B, and C. A does not equal B, does not equal C. All right. And then it says B is greater than A, B is greater than C. All right. So it's breaking it down the way that a human probably would. It's going through some potential use cases. It's throwing some numbers out to see if they work. Let's scroll to the bottom here.

So it's going pretty fast. So that's good. Even though the model's new, probably a lot of people are hitting it right now. It's going pretty quickly. Let's see if it actually gets it correct.

Here we go. So it's saying actually wait, let's reconsider the constraint. All digits are different. Does this include zero? Yes. So many models skip over zero for whatever reason. When they look at numbers that would be on a padlock, they only think one through nine, but zero would be there.

All right. So now it's saying, uh, I'm actually sure zero is allowed since the valid digits for a code are typically zero through nine, but zero does result in non uniqueness. There we go. All right, let's scroll to the bottom. So this is actually, uh, aside from the first version of the podcast, one, the podcast stats that I did, uh,

I can't even remember. That was either late last night or early this morning that caused it to think for three minutes. This is the one where it's taking the longest to think. And the chain of thought is pretty impressive. It's doing seemingly a pretty good job here. And luckily, I haven't hit my rate limits yet. What a miracle.

But I intentionally did not use it a lot just so I wouldn't hit my rate limits. All right. Livestream audience, we're going to give this one a second to finish up. But what are your thoughts here? What are your thoughts here as we get ready to wrap? Big Bogey is saying, I have Gemini 2 tell me a joke every day. It's not getting better. Yeah, these large language models are definitely not good. Although Denny is saying that AI makes dad jokes look good. Yeah.

All right, let me see. I want to make sure if you did have any questions, let me just double check. I want to make sure I don't see any specific questions, but I do have a couple dozen. Okay, here we go. Woozy saying, and for our podcast audience,

Uh, sonnet three, seven is still thinking through this combination problem. So Woozie is asking, is there anything that you see and the thinking that makes you adjust specific things in your original prompt question? Woozie, thank you for that question. That's an amazing question. Yes, a hundred percent. Uh, and I've mentioned this multiple times and I called this out. I think twice when I did deep research, when I did my deep research comparison show,

You should be doing this all the time because when you're using deep research as an example, it shows you what it does.

Um, I think deep research is one of the best use cases for generative AI FYI. Uh, but I always go back and I always say, if you're going to use a model like a deep research or a reasoning model that takes its time to think you might as well squeeze the juice out of it, uh, and get a good return on your time invested and go ahead and do it again. In these cases, when there's a finite answer, right? The number of apples and bananas, I'm not going to do it right. I'm not going to run that a second time. There's a yes or no answer.

When it's more of putting it on a task that it's not as finite, 100%. Thank you for that question, Muzi. You should always, always, always look at the chain of thought. This is one of the biggest advantages to having both chain of thought using these reasoning models as well as being able to see exactly how these deep research tools work.

research is you go through, you read it, you see, oh, this was a good decision that it made. Oh, this was not a good decision. I take notes on the side and then I adjust my original prompt. You need to be doing that all the time. Woozy. You just added a ton of value to our audience here. Let's see.

Douglas asking says, I think hybrid is interesting. Pure transformer is missing the benefit of the thinking. The thinking is slow compared to the transformer. Hybrid could be a good bridge, but what is being sacrificed for the capability? Yeah. Yeah. So Claude three, seven sonnet is the first hybrid model. Well, what's being sacrificed here is fine tune control, right? Especially for front end users. Yeah.

Yeah, you have a little slider on the back end if you're using the API. But like I said, I personally don't like this, right? But I'm a power user, right? When I love, which I know I'm in the minority, I love logging into ChatGPT and seeing like eight different models.

And there are some times I know I'm going to O1 Pro. There's times that I know I'm going to O3 Mini High Plus Web. There's times I know I'm going to Deep Research. There's times I know I'm going to GPT-4.0, right? And I'm going there with intent because I know the pros and the cons.

Does the average user need seven models? Probably not. Will the average user benefit from a hybrid model? Probably, but it does. I don't care what anyone says. For front-end users, a hybrid approach lowers the ceiling, right? Because there's going to be times that it uses thinking. It uses extra compute when you don't want it to. There's going to be times when conversely,

Right. When it does the opposite. So I think what this ultimately means for power users, you're going to lose some flexibility. And for everyone, I don't care what you say. I think it lowers the ceiling just a little bit. The floor goes up, the floor goes up, the ceiling comes down. So like I said, for the average everyday user, I think hybrid models are great for power users that are using it on the front end. I don't like it.

I don't like it. And on the back end, the companies are going to make much more money, right? I think you're going to be paying more if you're using a hybrid model. And like I said, I hope, hope, hope that all these companies, even five years down the road, are still going to have these non-hybrid options. Because if there's only hybrid options, regardless, for that very reason, especially when you're using the API, if you're a software company, let's say,

Or you're just using OpenAI or Claude for customer service to get to support tickets faster. You use some RAG, you put in your company documentation and someone's chatting with an AI chatbot with your company's information, but using one of these models.

If there's only a hybrid model and those hybrid model costs are much higher and you don't in the future, you don't have an option to use like a GPT-40 mini or a quad three five haiku and you only have the hybrid model, your costs go up. Your costs go up exponentially. Like I said, I think we've been getting a steal for the last couple of years. All right.

Let's go ahead and look at this result and wrap this show up. So let's see, how long did this one think? It did finish here a minute ago when I was on a side tangent answering Woozie's question. So this one, interestingly enough, thought for three minutes and 15 seconds. So let's see if it got it right. So I'm going to go past the chain of thought. All right. It says...

To solve this problem, I need to find the three digit code where all digits are different. The sum is nine and the middle digit is the highest of the three. So it says 180, 270, 360, 450, 162, 153, and 243.

Okay, and it says, since the problem asks for the code implying a single answer. So yeah, I did say what is the code, but it should know as most of the reasoning models do that there's actually many answers. So I would not give this a pass on this necessarily because even though it gave me other options that worked,

right? 1-8-0 works, 2-7-0 works, 3-6-0 works. It ultimately chose a single answer, which I don't know. It says 1-53 is the smallest valid code that satisfies all the conditions. No, all these do. But there's tons of others. There's tons of other codes that work. So as an example,

1-8-0 works. 2-4-3 works. What about 3-4-2, right? So it didn't do a good job. It fought for a long time. I would say that it did not pass this. So it is kind of a trick question. Even though I asked for the code, there's many codes. And it thought that, and it found that out with its reasoning, but still decided to almost overthink this issue and say kind of just the wrong answer.

I know this was a long one. I hope it was helpful. Like I said, if this was helpful, go ahead, share this. I'll share that complete prompt, the one going over the podcast stats, and I'll share O1's complete answer as well as Claude37 Sonnet's complete answer. So yeah, if you are interested in just the kind of the raw thinking capabilities, go ahead, share this. I hope this was helpful, but I'll tell you my quick takeaways.

Claude37 saw it, amazing. I think it is going to dominate and continue to dominate for any companies that need software engineering, coding, etc. I think even the 3.5 new model in many of those use cases was already better. So it was already making kind of one of the best models in the world.

exponentially better. So for three, seven sonnet coding, software development, uh, et cetera, uh, through the roof, uh, even what this means for artifacts, huge, right? I'll probably do a dedicated episode just on three, seven sonnet, uh, artifacts and what that means even for non-technical people, but for everyone else for API prices, I think it's a big loss.

I think overall, hybrid model, like I said, brings the floor up, brings the ceiling down. I don't think hybrid models, at least right now, are great for power users who are using this on the front end. I actually think I will probably be using Claude 3.7 Sonnet, maybe a little less, and I'll probably just be using Claude 3.5 Sonnet

a little more uh right and i know that seems backwards but that's probably the reality so i think that there's some uh some super promising aspects of this new claude 3.7 sonnet uh some highs some lows but hopefully this episode was helpful

All right. So like I said, if this was, please share this. Also, if you haven't already, please go to youreverydayai.com. Sign up for the free daily newsletter. We're going to be recapping this episode. Maybe you missed something. I know this was a longer one. Trying to explain things live is always a little time consuming, but that's what a lot of people that I hear from like.

So speaking of that, go subscribe to the newsletter, reply to today's newsletter. If you're still listening and tell me what you want to see more of this show actually was because of you. I put a poll out, uh, on my LinkedIn. It was actually, it was only decided by one vote. Uh, so we're going to have the other winner, which is how to prompt. Oh, models. Oh, one in Oh three models. Uh, we'll probably do that show tomorrow or maybe next week. We'll see, uh,

if time allows for it. So thank you for tuning in. I hope to see you back tomorrow and every day for more Everyday AI. Thanks, y'all. And that's a wrap for today's edition of Everyday AI. Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going. For a little more AI magic, visit youreverydayai.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.