Today on the AI Daily Brief, Anthropic has just launched Quad 3.7 Sonnet. Before that in the headlines, ChatGPT has hit 400 million weekly active users. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. ♪
Welcome back to the AI Daily Brief Headlines Edition, all the daily AI news you need in around five minutes. Quick note, for the next couple of episodes, we will be audio only. End of this week, we'll be back to our normal video format as well. We kick off today with an announcement from OpenAI at the end of last week that ChatGPT has hit 400 million weekly active users, surging a full 33% since December. OpenAI hasn't previously disclosed these figures, which show the service is still growing at a rapid rate.
Chief Operating Officer Brad Lightcap posted, ChatGPT recently crossed 400 million weekly active users. We feel very fortunate to serve 5% of the world every week.
2 million plus business users now use ChatGPT at work, and reasoning model API use is up 5x since the O3 Mini launch in January. That last number I think is hugely significant. O3 Mini has kicked up API reasoning model use 5x. Lightcap added that GPT 4.5 and 5 are coming soon with plans to offer unlimited use of GPT-5 to free users on low inference settings. In comments to CNBC, Lightcap discussed the gulf between hundreds of millions of free users and relatively slow business adoption, stating...
There's a buying cycle there and a learning process that goes into scaling an enterprise business. AI is going to be like cloud services. It's going to be something where you can't run a business that ultimately is not really running on these powerful models underneath the surface. However, the implication, which is completely true from our experience at Superintelligent, is that it just takes time. Even the most obvious things in the world come up against human and organizational inertia that has to be pushed through.
Turning to other topics, Lightcap discussed the DeepSeek moment as validation that AI has entered the zeitgeist rather than as a negative for open AI. He commented, DeepSeek is a testament to how much AI has entered the public consciousness in the mainstream. It would have been unfathomable two years ago. It's a moment that shows how powerful these models are and how much people really care. Many people pointed out when they saw these numbers that if this sort of rate of growth increases apace, we are going to see a billion ChatGPT users in extremely short order.
Speaking of GPT-4.5, some companies are getting ready. The Verge is reporting that GPT-4.5 could be released as soon as this week. And according to sources familiar with Microsoft's plans, the company is already readying server capacity for GPT-4.5 and GPT-5.
They expect GPT-4.5 to be released imminently. GPT-5, on the other hand, is expected to launch in late May, aligning with Microsoft's Build Developer Conference. This could represent a much closer working relationship between Microsoft and OpenAI for this year's releases. Microsoft was reportedly blindsided by the release of GPT-4.0 last May. It offered voice and translation services as well as a big speed boost, all at a cheaper price than Microsoft's services built on GPT-4 Turbo.
It took Microsoft until October to overhaul their services to catch up with OpenAI, who is of course supposed to be their biggest partner here. Now, there have been obviously lots and lots of rumors about the potential breakup of Microsoft and OpenAI, but it appears that in this case at least, Microsoft has been given the heads up this time around, and so presumably we could expect co-pilot updates ready to go shortly after OpenAI's releases.
Sam Allman, meanwhile, has been hyping it up, posting last week that, quote, trying GPT-4.5 has been much more of a feel-the-AGI moment among high-taste testers than I expected. GPT-5, meanwhile, will, remember, be a much larger rethink of the company's product line. It'll be the first model to integrate reasoning and non-reasoning into a single model. OpenAI have also suggested that they will devise a way to apply the right amount of inference to each query, doing away with the need for the model selector.
Already, the rumors are starting to build. Lisan Al-Gaib suggested that OpenAI could already be testing GPT-4.5 in public, routing some O3 mini queries to the new model. Meanwhile, OpenAI rumor monger Riley Coyote passed on whispers that Wednesday will be the release day.
Now, speaking of new models, there is a little bit of controversy swirling around Grok3's benchmarks, with some doubting the new model from XAI is really a match for OpenAI's O3 Mini. The controversy deals specifically with the AI-ME benchmark, a set of competitive math problems. XAI tested their model using a method known as CONSAT64 or best of 64. This involves generating 64 responses and selecting the one that appeared most frequently.
Best of 64 is a well-accepted benchmark standard, so there's no issue with using it per se. The problem was that XAI compared their result against O3 Mini's benchmark using a one-shot solution method referred to as PassAt1. OpenAI had presented this one-shot benchmark to demonstrate that O3 Mini was better than O1, even when the older model made 64 attempts. In other words, XAI wasn't making an apples-to-apples comparison.
It appeared particularly galling to the OpenAI team as XAI was promoting Grok 3 as the world's smartest AI. Boris Power, the head of applied research at OpenAI, posted, "...disappointing to see the incentives for the Grok team to cheat and deceive in evals. TLDR-03 Mini is better in every eval compared to Grok 3. Grok 3 is genuinely a decent model, but no need to oversell."
Tony Wu, the co-founder of XAI, commented, "...obsession with metric pass at 1 is just stupid. To compare fairly, you have to fix the test compute budget, and without disclosing what test time compute method is used behind O3 Mini, we cannot really compare. At the end of the day, it's just about which one is a better product. Also, depending on the product, e.g. consumer product versus API, you may have different requirements in terms of latency or total flops for test time compute. Try Grok 3 and tell me if you think it's better or worse than O3 Mini."
Now, this discussion, which on first glance one could be forgiven for viewing as just the inherent competitiveness of two teams, did spill over into the rest of the AI research community, who discussed how to deal with benchmarks moving forward. TeraTax has compiled all of the available benchmarks in a single chart, with both OneShot and Best of 64 variants commenting, I actually believe Grok looks good there, and OpenAI's test time compute chicanery behind O3 Mini High Pass at 1 deserves more scrutiny.
Math and Lambert wrote, I think it's safe to say that XAI and OpenAI both have committed minor chart crimes with thinking models. Frankly, there are no industry norms to lean on. Just expect noise. It's fine. May the best models win. Do your own evals anyway. AIME is practically useless to 99% of people.
And this, I think, is for sure the key point. Every model still pummels us over the head with these benchmarks as soon as they release their newest thing saying, look, we've improved, blah, blah, blah, blah, blah. And it fundamentally doesn't matter. I'm sorry, but at this point, I am fully on the train that these benchmarks are totally soaked. There's almost no relevant signal in that, that all of the models now are at the very high end of these things and that they just tell you almost nothing.
I hope we get some more good work on thinking about new types of evaluation because we desperately need it. But at this stage, I think that there's no other reasonable answer if you're willing to take the time and the resources to do it than to just try every type of query and every type of prompt and every type of challenge against all of the state-of-the-art and see which one does best. That or alternatively, just pick one, assume that it's going to be close to as good as the state-of-the-art and will be as good as the state-of-the-art in a couple of weeks when they ship the latest update.
Speaking of which, I think that leads perfectly to our main episode topic, which is Anthropic's launch of Claude 3.7 Sonnet. Today's episode is brought to you by Vanta. Trust isn't just earned, it's demanded. Whether you're a startup founder navigating your first audit or a seasoned security professional scaling your GRC program, proving your commitment to security has never been more critical or more complex. That's where Vanta comes in.
Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SOC 2 and ISO 27001. Centralized security workflows complete questionnaires up to 5x faster and proactively manage vendor risk. Vanta can help you start or scale up your security program by connecting you with auditors and experts to conduct your audit and set up your security program quickly. Plus, with automation and AI throughout the platform, Vanta gives you time back so you can focus on building your company.
Join over 9,000 global companies like Atlassian, Quora, and Factory who use Vantage to manage risk and improve security in real time.
For a limited time, this audience gets $1,000 off Vanta at vanta.com slash nlw. That's v-a-n-t-a dot com slash nlw for $1,000 off. If there is one thing that's clear about AI in 2025, it's that the agents are coming. Vertical agents buy industry horizontal agent platforms.
Agents per function. If you are running a large enterprise, you will be experimenting with agents next year. And given how new this is, all of us are going to be back in pilot mode.
That's why Super Intelligent is offering a new product for the beginning of this year. It's an agent readiness and opportunity audit. Over the course of a couple quick weeks, we dig in with your team to understand what type of agents make sense for you to test, what type of infrastructure support you need to be ready, and to ultimately come away with a set of actionable recommendations that get you prepared to figure out how agents can transform your business.
If you are interested in the agent readiness and opportunity audit, reach out directly to me, nlw at bsuper.ai. Put the word agent in the subject line so I know what you're talking about. And let's have you be a leader in the most dynamic part of the AI market. Welcome back to the AI Daily Brief. Anthropic has just launched Quad 3.7 Sonnet, what they call their most intelligent model to date.
Similar to how OpenAI appears to be describing what GPT-5 is supposed to be, Anthropic calls this a hybrid reasoning model that, quote, produces near-instant responses or extended step-by-step thinking. One model, two ways to think. Now, holding aside whether it actually does that well, it is extremely telling, I think, that this is the new norm going forward. No more the separation between reasoning and non-reasoning models. It's just one model to rule them all that can navigate between the two.
Now, of course, as you would expect, Anthropic announced a bunch of benchmarks to demonstrate how Cloud 3.7 Sonnet is a big improvement over its predecessor. They showed an increase in performance on everything from GPQA Diamond, the graduate-level reasoning, to the AIME. I've just been on my rant about evaluation benchmarks, so I won't repeat that again. Ultimately, I think what you can say is that even based on their own sharing, in most of these cases, it is a nudge forward rather than a leap forward.
The one exception to that, which we'll come back to, is around coding, where the SweetBench verified tests saw a huge improvement from 49% with Cloud 3.5 Sonnet all the way up to 62.3% to 70.3% with Cloud 3.7 Sonnet. Agendic tool use was also way up, showing a meaningful increase in performance over Cloud 3.5 Sonnet as well as OpenAI's O1.
Indeed, this is what led Anthropic to say that Claude 3.7 is a state-of-the-art model for both coding and agentic tool use. They write, in developing it, we've optimized somewhat less for math and computer science competition problems and instead shifted focus towards real-world tasks that better reflect the needs of our users. So at least someone is hearing these rants about benchmarks and what we should be thinking about.
Now, it's very clear that coding is the whole ballgame right now for Anthropic, so we're going to come back to that in a moment. But before that, let's get some first reactions. Rowan Chung from The Rundown writes, Anthropic just dropped Claude 3.7 Sonnet, the best coding AI model in the world. I was an early tester and it blew my mind. It created this Minecraft clone in one prompt and made it instantly playable in artifacts. Professor Ethan Malek writes, It is very, very good. Its vibe coding from language is impressive. Here's a one-shot prompted video game based on the Melville story, Bartleby the Scrivener.
Box's Aaron Levy writes, "Box has been doing evals on it with Enterprise Docs and it's extremely strong at hard math, logic, content generation, and complex reason and use cases." Box AI will support Cloud 3.7 Sonnet later today in the Box AI Studio. Adana Singh writes, "Dude, what? I just asked how many Rs it has. Cloud Sonnet 3.7 spud up an interactive learning platform for me to learn it myself." And indeed, while the general impressions were favorable, it's because a lot of those impressions were about coding.
CJZZZ writes, Claude Sonnet 3.7 is built for coders. Don't evaluate it on web search and multimodality evals. Claude is doubling down on what they know the best, AI coding. Matt Schumer shared the SweeBench verified benchmarks and said this seems to be a huge step up. Flowerslop writes, Claude 3.7 seems to be way ahead in coding compared to 01, 03 Mini High, R1, and Grok 3 according to my first vibe test.
A test I like is whether a model can build a fully functional Doodle Jump clone from scratch. It's right at the edge of what SOTA models almost get right, but not quite. Until now. O1 tried, but the window closed instantly with a console error. O3 Mini High made a basic version, but platforms were too far apart to reach.
R1 had no starting platform, so you'd just fail instantly. Grok 3, even with extra thinking, also crashed instantly. Cloud 3.7 nailed it. First try, one prompt, fully working, with the prettiest design and even a funny little doodler. It simply just did it without any flaws or bugs.
And indeed, this is perhaps why that was not the only part of the announcement. Head of Cloud Relations Alex Albert writes: "We're opening limited access to a research preview of a new agentic coding tool we're building: Cloud Code. You'll get Cloud-powered code assistance, file operations, and tasks execution directly from your terminal. After installing Cloud Code, simply run the `cloud` command from any directory to get started. Ask questions about your codebase, let Cloud edit files and fix errors, or even have it run bash commands and create git commits.
Alex continues, within Anthropic, ClaudeCode is quickly becoming another tool we can't do without. Engineers and researchers across the company use it for everything from major code refactors to swashing commits to generally handling the toil of coding. He shared a message from Slack that said, I just want to say ClaudeCode is very quickly taking over my life and becoming my go-to tool. Truly think there's something very special here.
Pietro Sciorano explains it a little bit further: "Claude Code is a command-line tool that lets developers delegate substantial engineering tasks to Claude directly from their terminal. In early testing, Claude completed tasks and went past what would normally take 45 minutes of manual work." Not Adam Paul writes: "Claude Code is an in-terminal coding agent and it's objectively the coolest thing a Frontier company has shipped since GPT-4. Here I get it to read my project specs and tell me what's left to implement against the codebase. Haven't even started coding with it yet and I'm hooked."
Now, to the extent that anyone had any concern, it was around price. Harrison Kinsley writes, Claude Cote is really nice. The UI is so wonderful. I like the action type rules. Well done. Prepare to spend up to $5 an hour running it, potentially more. Deja Vu Coder responded, more like 5 USD per 20 minutes. Others like Anthropic's Catherine Olson jumped in to talk about where it wasn't perfect. She writes, Claude Cote is very useful, but it can still get confused. A few quick tips from my experience coding with it at Anthropic.
One, work with a clean commit so it's easy to reset all the changes. Two, sometimes I work on two dev boxes at the same time. One for me, one for Claude Code. We're both trying ideas in parallel. And so on and so forth. And I actually think that this is a super valuable category of information. Not only does sharing this stuff build trust with your users, it also guides them to use your tools more effectively. Overall, I tend to agree with Benjamin Dekraker who writes, I have a hunch that Claude Code, the terminal coder, is a bigger deal than many people realize.
Certainly, there is a sense that combined with the other updates, we are in the middle of another big shift. Professor Ethan Malek again just published a new piece on his One Useful Thing blog called A New Generation of AIs, Claude 3.7 and Grok 3. Yes, AI suddenly got better again. For tomorrow's episode, I'm going to be doing a look at what's evolving faster and what's evolving slower in AI than people might have imagined. And so we'll definitely be coming back to some of this assessment.
For now, though, I'm excited to go dive into Claude 3.7 Sonnet myself. And I hope that when you test it out, you come back and tell us what you found as well. For now, that is going to do it for today's episode of the AI Daily Brief. Appreciate you listening as always. And until next time, peace.