We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

First Reactions: Claude 3.7 Sonnet and Claude Code

2025/2/26

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

AI Deep Dive AI Chapters Transcript

People

Aaron Levy

Adam Paul

Adana Singh

Alex Albert

Benjamin Dekraker

Boris Power

Brad Lightcap

CJZZZ

Catherine Olson

Flowerslop

Harrison Kinsley

Math and Lambert

NLW (Narrator)

Pietro Sciorano

Professor Ethan Malek

Rowan Chung

Sam Altman

领导 OpenAI 实现 AGI 和超智能，重新定义 AI 发展路径，并推动 AI 技术的商业化和应用。

Tony Wu

Topics

Brad Lightcap: 我很荣幸地宣布，ChatGPT 的周活跃用户已超过 4 亿，这代表着我们每周都在为全球 5% 的人口提供服务。此外，企业对 AI 的采用需要时间，因为存在购买周期、学习过程以及人类和组织的固有惰性。DeepSeek 事件也证明了 AI 已经深入主流公众意识。 Sam Altman: 我认为 GPT-4.5 给高阶测试者带来了强烈的 AGI 体验，而即将推出的 GPT-5 将对 OpenAI 的产品线进行重大重构，它将整合推理和非推理能力，成为一个能够在两者之间切换的单一模型。 Boris Power: 我对 Grok 团队在评估中作弊和欺骗的行为感到失望。总而言之，在所有评估中，O3 Mini 都优于 Grok 3。Grok 3 确实是一个不错的模型，但没有必要过度宣传。 Tony Wu: 对单一指标（pass at 1）的过度关注是愚蠢的。为了进行公平的比较，必须固定测试计算预算，并且在没有公开 O3 Mini 背后使用的测试时间计算方法的情况下，我们无法真正进行比较。归根结底，这只是哪个产品更好。此外，根据产品的不同（例如，消费产品与 API），您可能对测试时间计算的延迟或总浮点运算有不同的要求。试试 Grok 3，告诉我您认为它是否比 O3 Mini 好或坏。 Math and Lambert: 我认为可以肯定地说，XAI 和 OpenAI 都在思考模型方面犯了一些小的图表错误。坦率地说，没有行业规范可以依赖。只需期待噪音即可。没关系。祝最好的模型获胜。无论如何，请自行进行评估。对于 99% 的人来说，AIME 实际上毫无用处。 NLW (Narrator): 我完全相信这些基准测试结果毫无意义。所有模型现在都处于这些指标的顶端，它们几乎无法提供任何有用的信息。我们需要新的评估方法。现有的基准测试结果意义不大，我们需要新的评估方法。Anthropic 的 Claude 3.7 Sonnet 是一个混合推理模型，能够在近乎即时响应和逐步思考之间切换。Claude 3.7 Sonnet 在大多数基准测试中只是略微改进，但在编码方面取得了显著进步。Anthropic 将 Claude 3.7 Sonnet 的重点放在了实际任务上，而非数学和计算机科学竞赛问题。 Rowan Chung: Anthropic 推出的 Claude 3.7 Sonnet 是世界上最好的 AI 编码模型，它让我大吃一惊，因为它能够在一个提示中创建可玩的游戏。 Professor Ethan Malek: Claude 3.7 Sonnet 非常好，它从语言到代码的转换非常令人印象深刻。 Aaron Levy: Box 公司对 Claude 3.7 Sonnet 的评估显示其在数学、逻辑、内容生成和复杂推理方面非常强大。 Adana Singh: Claude 3.7 Sonnet 能够创建一个交互式学习平台来帮助用户学习。 CJZZZ: Claude Sonnet 3.7 专为程序员而设计，不应以网页搜索和多模态评估来评估它。 Flowerslop: 根据我的测试，Claude 3.7 在编码方面领先于其他模型，它能够轻松完成 Doodle Jump 克隆。 Alex Albert: 我们正在开放对我们正在构建的新型代理编码工具 Claude Code 的研究预览版访问权限。在 Anthropic 内部，Claude Code 正在迅速成为我们不可或缺的工具。 Pietro Sciorano: Claude Code 能够完成需要 45 分钟人工操作的任务。 Adam Paul: Claude Code 是一个终端编码代理，它是前沿公司自 GPT-4 以来发布的最酷的东西。 Harrison Kinsley: Claude Code 非常好，界面很棒，我喜欢它的操作类型规则。但是，运行它的成本可能高达每小时 5 美元，甚至更高。 Catherine Olson: Claude Code 非常有用，但它仍然可能出错。我建议用户在干净的提交环境下使用它，并且可以与 Claude Code 并行工作。 Benjamin Dekraker: 我预感 Claude Code（终端编码器）比许多人意识到的更重要。

Deep Dive

Chapters

OpenAI's ChatGPT surpasses 400 million weekly active users, showcasing rapid growth. Discussion includes the upcoming GPT-4.5 and GPT-5 models, with speculation about their release dates and capabilities, including integration of reasoning and non-reasoning into a single model.

ChatGPT surpasses 400 million weekly active users.
GPT-4.5 expected release soon, GPT-5 in late May.
GPT-5 to integrate reasoning and non-reasoning into a single model.

Shownotes Transcript

Translations:

中文

Today on the AI Daily Brief, Anthropic has just launched Quad 3.7 Sonnet. Before that in the headlines, ChatGPT has hit 400 million weekly active users. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. ♪

Welcome back to the AI Daily Brief Headlines Edition, all the daily AI news you need in around five minutes. Quick note, for the next couple of episodes, we will be audio only. End of this week, we'll be back to our normal video format as well. We kick off today with an announcement from OpenAI at the end of last week that ChatGPT has hit 400 million weekly active users, surging a full 33% since December. OpenAI hasn't previously disclosed these figures, which show the service is still growing at a rapid rate.

Chief Operating Officer Brad Lightcap posted, ChatGPT recently crossed 400 million weekly active users. We feel very fortunate to serve 5% of the world every week.

2 million plus business users now use ChatGPT at work, and reasoning model API use is up 5x since the O3 Mini launch in January. That last number I think is hugely significant. O3 Mini has kicked up API reasoning model use 5x. Lightcap added that GPT 4.5 and 5 are coming soon with plans to offer unlimited use of GPT-5 to free users on low inference settings. In comments to CNBC, Lightcap discussed the gulf between hundreds of millions of free users and relatively slow business adoption, stating...

There's a buying cycle there and a learning process that goes into scaling an enterprise business. AI is going to be like cloud services. It's going to be something where you can't run a business that ultimately is not really running on these powerful models underneath the surface. However, the implication, which is completely true from our experience at Superintelligent, is that it just takes time. Even the most obvious things in the world come up against human and organizational inertia that has to be pushed through.

Turning to other topics, Lightcap discussed the DeepSeek moment as validation that AI has entered the zeitgeist rather than as a negative for open AI. He commented, DeepSeek is a testament to how much AI has entered the public consciousness in the mainstream. It would have been unfathomable two years ago. It's a moment that shows how powerful these models are and how much people really care. Many people pointed out when they saw these numbers that if this sort of rate of growth increases apace, we are going to see a billion ChatGPT users in extremely short order.

Speaking of GPT-4.5, some companies are getting ready. The Verge is reporting that GPT-4.5 could be released as soon as this week. And according to sources familiar with Microsoft's plans, the company is already readying server capacity for GPT-4.5 and GPT-5.

They expect GPT-4.5 to be released imminently. GPT-5, on the other hand, is expected to launch in late May, aligning with Microsoft's Build Developer Conference. This could represent a much closer working relationship between Microsoft and OpenAI for this year's releases. Microsoft was reportedly blindsided by the release of GPT-4.0 last May. It offered voice and translation services as well as a big speed boost, all at a cheaper price than Microsoft's services built on GPT-4 Turbo.

It took Microsoft until October to overhaul their services to catch up with OpenAI, who is of course supposed to be their biggest partner here. Now, there have been obviously lots and lots of rumors about the potential breakup of Microsoft and OpenAI, but it appears that in this case at least, Microsoft has been given the heads up this time around, and so presumably we could expect co-pilot updates ready to go shortly after OpenAI's releases.

Sam Allman, meanwhile, has been hyping it up, posting last week that, quote, trying GPT-4.5 has been much more of a feel-the-AGI moment among high-taste testers than I expected. GPT-5, meanwhile, will, remember, be a much larger rethink of the company's product line. It'll be the first model to integrate reasoning and non-reasoning into a single model. OpenAI have also suggested that they will devise a way to apply the right amount of inference to each query, doing away with the need for the model selector.

Already, the rumors are starting to build. Lisan Al-Gaib suggested that OpenAI could already be testing GPT-4.5 in public, routing some O3 mini queries to the new model. Meanwhile, OpenAI rumor monger Riley Coyote passed on whispers that Wednesday will be the release day.

Now, speaking of new models, there is a little bit of controversy swirling around Grok3's benchmarks, with some doubting the new model from XAI is really a match for OpenAI's O3 Mini. The controversy deals specifically with the AI-ME benchmark, a set of competitive math problems. XAI tested their model using a method known as CONSAT64 or best of 64. This involves generating 64 responses and selecting the one that appeared most frequently.

Best of 64 is a well-accepted benchmark standard, so there's no issue with using it per se. The problem was that XAI compared their result against O3 Mini's benchmark using a one-shot solution method referred to as PassAt1. OpenAI had presented this one-shot benchmark to demonstrate that O3 Mini was better than O1, even when the older model made 64 attempts. In other words, XAI wasn't making an apples-to-apples comparison.

It appeared particularly galling to the OpenAI team as XAI was promoting Grok 3 as the world's smartest AI. Boris Power, the head of applied research at OpenAI, posted, "...disappointing to see the incentives for the Grok team to cheat and deceive in evals. TLDR-03 Mini is better in every eval compared to Grok 3. Grok 3 is genuinely a decent model, but no need to oversell."

Tony Wu, the co-founder of XAI, commented, "...obsession with metric pass at 1 is just stupid. To compare fairly, you have to fix the test compute budget, and without disclosing what test time compute method is used behind O3 Mini, we cannot really compare. At the end of the day, it's just about which one is a better product. Also, depending on the product, e.g. consumer product versus API, you may have different requirements in terms of latency or total flops for test time compute. Try Grok 3 and tell me if you think it's better or worse than O3 Mini."

Now, this discussion, which on first glance one could be forgiven for viewing as just the inherent competitiveness of two teams, did spill over into the rest of the AI research community, who discussed how to deal with benchmarks moving forward. TeraTax has compiled all of the available benchmarks in a single chart, with both OneShot and Best of 64 variants commenting, I actually believe Grok looks good there, and OpenAI's test time compute chicanery behind O3 Mini High Pass at 1 deserves more scrutiny.

Math and Lambert wrote, I think it's safe to say that XAI and OpenAI both have committed minor chart crimes with thinking models. Frankly, there are no industry norms to lean on. Just expect noise. It's fine. May the best models win. Do your own evals anyway. AIME is practically useless to 99% of people.

And this, I think, is for sure the key point. Every model still pummels us over the head with these benchmarks as soon as they release their newest thing saying, look, we've improved, blah, blah, blah, blah, blah. And it fundamentally doesn't matter. I'm sorry, but at this point, I am fully on the train that these benchmarks are totally soaked. There's almost no relevant signal in that, that all of the models now are at the very high end of these things and that they just tell you almost nothing.

I hope we get some more good work on thinking about new types of evaluation because we desperately need it. But at this stage, I think that there's no other reasonable answer if you're willing to take the time and the resources to do it than to just try every type of query and every type of prompt and every type of challenge against all of the state-of-the-art and see which one does best. That or alternatively, just pick one, assume that it's going to be close to as good as the state-of-the-art and will be as good as the state-of-the-art in a couple of weeks when they ship the latest update.

Speaking of which, I think that leads perfectly to our main episode topic, which is Anthropic's launch of Claude 3.7 Sonnet. Today's episode is brought to you by Vanta. Trust isn't just earned, it's demanded. Whether you're a startup founder navigating your first audit or a seasoned security professional scaling your GRC program, proving your commitment to security has never been more critical or more complex. That's where Vanta comes in.

Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SOC 2 and ISO 27001. Centralized security workflows complete questionnaires up to 5x faster and proactively manage vendor risk. Vanta can help you start or scale up your security program by connecting you with auditors and experts to conduct your audit and set up your security program quickly. Plus, with automation and AI throughout the platform, Vanta gives you time back so you can focus on building your company.

Join over 9,000 global companies like Atlassian, Quora, and Factory who use Vantage to manage risk and improve security in real time.

For a limited time, this audience gets $1,000 off Vanta at vanta.com slash nlw. That's v-a-n-t-a dot com slash nlw for $1,000 off. If there is one thing that's clear about AI in 2025, it's that the agents are coming. Vertical agents buy industry horizontal agent platforms.

Agents per function. If you are running a large enterprise, you will be experimenting with agents next year. And given how new this is, all of us are going to be back in pilot mode.

That's why Super Intelligent is offering a new product for the beginning of this year. It's an agent readiness and opportunity audit. Over the course of a couple quick weeks, we dig in with your team to understand what type of agents make sense for you to test, what type of infrastructure support you need to be ready, and to ultimately come away with a set of actionable recommendations that get you prepared to figure out how agents can transform your business.

If you are interested in the agent readiness and opportunity audit, reach out directly to me, nlw at bsuper.ai. Put the word agent in the subject line so I know what you're talking about. And let's have you be a leader in the most dynamic part of the AI market. Welcome back to the AI Daily Brief. Anthropic has just launched Quad 3.7 Sonnet, what they call their most intelligent model to date.

Similar to how OpenAI appears to be describing what GPT-5 is supposed to be, Anthropic calls this a hybrid reasoning model that, quote, produces near-instant responses or extended step-by-step thinking. One model, two ways to think. Now, holding aside whether it actually does that well, it is extremely telling, I think, that this is the new norm going forward. No more the separation between reasoning and non-reasoning models. It's just one model to rule them all that can navigate between the two.

Now, of course, as you would expect, Anthropic announced a bunch of benchmarks to demonstrate how Cloud 3.7 Sonnet is a big improvement over its predecessor. They showed an increase in performance on everything from GPQA Diamond, the graduate-level reasoning, to the AIME. I've just been on my rant about evaluation benchmarks, so I won't repeat that again. Ultimately, I think what you can say is that even based on their own sharing, in most of these cases, it is a nudge forward rather than a leap forward.

The one exception to that, which we'll come back to, is around coding, where the SweetBench verified tests saw a huge improvement from 49% with Cloud 3.5 Sonnet all the way up to 62.3% to 70.3% with Cloud 3.7 Sonnet. Agendic tool use was also way up, showing a meaningful increase in performance over Cloud 3.5 Sonnet as well as OpenAI's O1.

Indeed, this is what led Anthropic to say that Claude 3.7 is a state-of-the-art model for both coding and agentic tool use. They write, in developing it, we've optimized somewhat less for math and computer science competition problems and instead shifted focus towards real-world tasks that better reflect the needs of our users. So at least someone is hearing these rants about benchmarks and what we should be thinking about.

Now, it's very clear that coding is the whole ballgame right now for Anthropic, so we're going to come back to that in a moment. But before that, let's get some first reactions. Rowan Chung from The Rundown writes, Anthropic just dropped Claude 3.7 Sonnet, the best coding AI model in the world. I was an early tester and it blew my mind. It created this Minecraft clone in one prompt and made it instantly playable in artifacts. Professor Ethan Malek writes, It is very, very good. Its vibe coding from language is impressive. Here's a one-shot prompted video game based on the Melville story, Bartleby the Scrivener.

Box's Aaron Levy writes, "Box has been doing evals on it with Enterprise Docs and it's extremely strong at hard math, logic, content generation, and complex reason and use cases." Box AI will support Cloud 3.7 Sonnet later today in the Box AI Studio. Adana Singh writes, "Dude, what? I just asked how many Rs it has. Cloud Sonnet 3.7 spud up an interactive learning platform for me to learn it myself." And indeed, while the general impressions were favorable, it's because a lot of those impressions were about coding.

CJZZZ writes, Claude Sonnet 3.7 is built for coders. Don't evaluate it on web search and multimodality evals. Claude is doubling down on what they know the best, AI coding. Matt Schumer shared the SweeBench verified benchmarks and said this seems to be a huge step up. Flowerslop writes, Claude 3.7 seems to be way ahead in coding compared to 01, 03 Mini High, R1, and Grok 3 according to my first vibe test.

A test I like is whether a model can build a fully functional Doodle Jump clone from scratch. It's right at the edge of what SOTA models almost get right, but not quite. Until now. O1 tried, but the window closed instantly with a console error. O3 Mini High made a basic version, but platforms were too far apart to reach.

R1 had no starting platform, so you'd just fail instantly. Grok 3, even with extra thinking, also crashed instantly. Cloud 3.7 nailed it. First try, one prompt, fully working, with the prettiest design and even a funny little doodler. It simply just did it without any flaws or bugs.

And indeed, this is perhaps why that was not the only part of the announcement. Head of Cloud Relations Alex Albert writes: "We're opening limited access to a research preview of a new agentic coding tool we're building: Cloud Code. You'll get Cloud-powered code assistance, file operations, and tasks execution directly from your terminal. After installing Cloud Code, simply run the `cloud` command from any directory to get started. Ask questions about your codebase, let Cloud edit files and fix errors, or even have it run bash commands and create git commits.

Alex continues, within Anthropic, ClaudeCode is quickly becoming another tool we can't do without. Engineers and researchers across the company use it for everything from major code refactors to swashing commits to generally handling the toil of coding. He shared a message from Slack that said, I just want to say ClaudeCode is very quickly taking over my life and becoming my go-to tool. Truly think there's something very special here.

Pietro Sciorano explains it a little bit further: "Claude Code is a command-line tool that lets developers delegate substantial engineering tasks to Claude directly from their terminal. In early testing, Claude completed tasks and went past what would normally take 45 minutes of manual work." Not Adam Paul writes: "Claude Code is an in-terminal coding agent and it's objectively the coolest thing a Frontier company has shipped since GPT-4. Here I get it to read my project specs and tell me what's left to implement against the codebase. Haven't even started coding with it yet and I'm hooked."

Now, to the extent that anyone had any concern, it was around price. Harrison Kinsley writes, Claude Cote is really nice. The UI is so wonderful. I like the action type rules. Well done. Prepare to spend up to $5 an hour running it, potentially more. Deja Vu Coder responded, more like 5 USD per 20 minutes. Others like Anthropic's Catherine Olson jumped in to talk about where it wasn't perfect. She writes, Claude Cote is very useful, but it can still get confused. A few quick tips from my experience coding with it at Anthropic.

One, work with a clean commit so it's easy to reset all the changes. Two, sometimes I work on two dev boxes at the same time. One for me, one for Claude Code. We're both trying ideas in parallel. And so on and so forth. And I actually think that this is a super valuable category of information. Not only does sharing this stuff build trust with your users, it also guides them to use your tools more effectively. Overall, I tend to agree with Benjamin Dekraker who writes, I have a hunch that Claude Code, the terminal coder, is a bigger deal than many people realize.

Certainly, there is a sense that combined with the other updates, we are in the middle of another big shift. Professor Ethan Malek again just published a new piece on his One Useful Thing blog called A New Generation of AIs, Claude 3.7 and Grok 3. Yes, AI suddenly got better again. For tomorrow's episode, I'm going to be doing a look at what's evolving faster and what's evolving slower in AI than people might have imagined. And so we'll definitely be coming back to some of this assessment.

For now, though, I'm excited to go dive into Claude 3.7 Sonnet myself. And I hope that when you test it out, you come back and tell us what you found as well. For now, that is going to do it for today's episode of the AI Daily Brief. Appreciate you listening as always. And until next time, peace.

First Reactions: Claude 3.7 Sonnet and Claude Code 15:53 Share

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

Deep Dive

Shownotes Transcript

First Reactions: Claude 3.7 Sonnet and Claude Code