O3 is OpenAI's second-generation reasoning model. The company skipped O2 to avoid an intellectual property dispute with a large British telecom company.
O3 outperformed O1 by nearly 23 percentage points on a standard coding benchmark and surpassed OpenAI's chief scientist on Codeforces, ranking among the top 200 in the world.
O3 achieved a near-perfect score on the AIME math exam, missing only one question.
O3 scored 85% on the ARC-AGI test, tripling O1's score. This test measures a model's ability to handle novel problems that are difficult to pre-train, focusing on reasoning capabilities.
Cholet acknowledged O3 as a significant breakthrough in AI's ability to adapt to novel tasks but noted that it is not yet AGI, as there are still easy tasks it cannot solve.
O3's coding abilities suggest it could outperform 99.95% of programmers on competitive coding platforms, raising concerns about job displacement in the coding industry.
While O3 excels in competitive coding challenges, it may not be as effective in real-world programming tasks that require broader problem-solving and collaboration skills.
Didi Das noted that O3 achieved a 25% success rate on a highly challenging math benchmark created by math professors, a feat no other model has come close to.
At $3,000 per task, O3 is already more cost-effective than hiring McKinsey, highlighting its potential as a labor-saving tool despite its high compute costs.
Malik argues that societal and organizational change will be slower than technological advancements due to human inertia, giving society time to adapt to AI's capabilities.
Today on the AI Daily Brief, did we just get AGI for Christmas? The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes. Hello, friends. For our last regular AI Daily Brief episode of the year, we are skipping the headlines. It's mostly small things, a couple of new AI appointments to the White House, stuff like that. And instead, we are going to spend all of our time on the big discussion from the last three days, which
which is whether OpenAI just gave us AGI. What's going on, guys? It has been a very interesting 12 days of shipments from OpenAI. We kicked it off with a full version of O1. Maybe the biggest announcement was Sora. But then by the end of last week, it seemed like just maybe we were actually going to get an entire new model. If you listen to my Friday episode, you heard all of the evidence that we were going to get O3. And indeed, that is what happened.
Specifically, on Friday, OpenAI announced their second generation of reasoning models, O3 and O3 Mini. Now, if you're wondering what the hell the name is about, the company skipped O2 in order to avoid an intellectual property dispute with the large British telco. Sam Altman said that the company was simply upholding its tradition of being truly bad at names.
And just to cut to the chase, while the announcement itself was relatively muted, the conversation that has followed has been all about whether this actually represents something close to AGI. So today we're going to explore all of those arguments and what we should actually think about this. Now, certainly from the numbers they shared, the model seems very good. On a standard coding benchmark, O3 bettered O1 by almost 23 percentage points.
It also bested the company's chief scientist on the competitive coding platform Codeforces. In fact, right now, there's less than 200 people in the world with a better score on Codeforces. 174 to be exact.
In somewhat understated fashion, then, Altman said that the model is, quote, incredible at coding. The model also achieved a near-perfect score on the AIME math exam, missing only a single question. It achieved an 87.7% on the expert-level science benchmark GPQA Diamond, far exceeding top human performance. Still, while those benchmark results are practical and important, a significant amount of focus, at least in the announcement, was on how O3 performed on the ARC-AGI test.
That test attempts to measure a model's ability to deal with novel problems that are difficult to pre-train. It's viewed as testing reasoning capability at the bare minimum and is one plausible benchmark for when AGI has been achieved.
O3 crossed the 85% human performance threshold for AGI, tripling the score achieved by O1. This year's Arc AGI prize winner scored 53.5% using a fine-tuned model of a novel design, and only a handful of attempts have managed to score higher than 30%, just to give a sense of how far the bar was raised. One of the interesting things about the test is that it's relatively easy to solve for humans using basic logic and reasoning, but has so far stumped AI models.
You might have seen these grids of red and blue boxes over the past few days, and this is one of the hardest problems on the test.
Francis Cholet, a legend in machine learning and the creator of the test wrote, Today, OpenAI announced O3, its next-gen reasoning model. We've worked with OpenAI to test it on Arc AGI, and we believe it represents a significant breakthrough in getting AI to adapt to novel tasks. It scores 75.7% on the semi-private eval in low compute mode for $20 per task in compute and 87.5% in high compute mode, which costs thousands of dollars per task.
It's very expensive, but it's not just brute. These capabilities are new territory and they demand serious scientific attention. Now, when it comes to the question of AGI, OpenAI is not claiming that title. They are using big language. OpenAI co-founder and president Greg Brockman wrote, O3 is a breakthrough with a step function improvement on our hardest benchmarks. But that's different, of course, than claiming AGI.
One of the first noteworthy opinions on whether this is AGI came from Cholet himself. In the thread announcing the test results, he commented, so is this AGI?
While the new model is very impressive and represents a big milestone on the way towards AGI, I don't believe this is AGI. There's still a fair number of very easy Arc AGI 1 tasks that O3 can't solve. And we have early indicators that Arc AGI 2 will remain extremely challenging for O3. This shows that it's still feasible to create unsaturated, interesting benchmarks that are easy for humans yet impossible for AI, without involving specialist knowledge. We will have AGI when creating such evals becomes outright impossible.
As a total aside on the ARC prize itself, that test is run on a fully private set of questions and must be completed using just 10 cents of compute per task. The team is committed to keeping those parameters until someone releases an open source model that can achieve an 85% score. Choualet believes version one of this test is now saturated and no longer a useful benchmark, but expects version two to present a much greater challenge.
He added,
The ARC Prize 2025 leaderboards will be the best place to monitor reproduction attempts.
Vanta automates compliance for ISO 27001, SOC 2, GDPR, and leading AI frameworks like ISO 42001 and NIST AI risk management framework, saving you time and money while helping you build customer trust. Plus, you can streamline security reviews by automating questionnaires and demonstrating your security posture with a customer-facing trust center all powered by Vanta AI.
Over 8,000 global companies like Langchain, Leela AI, and Factory AI use Vanta to demonstrate AI trust and prove security in real time. Learn more at vanta.com slash nlw. That's vanta.com slash nlw.
If there is one thing that's clear about AI in 2025, it's that the agents are coming. Vertical agents by industry, horizontal agent platforms, agents per function. If you are running a large enterprise, you will be experimenting with agents next year. And given how new this is, all of us are going to be back in pilot mode.
That's why Superintelligent is offering a new product for the beginning of this year. It's an agent readiness and opportunity audit. Over the course of a couple quick weeks, we dig in with your team to understand what type of agents make sense for you to test, what type of infrastructure support you need to be ready, and to ultimately come away with a set of actionable recommendations that get you prepared to figure out how agents can transform your business.
If you are interested in the agent readiness and opportunity audit, reach out directly to me, nlw at bsuper.ai. Put the word agent in the subject line so I know what you're talking about. And let's have you be a leader in the most dynamic part of the AI market. Now, of course, the question of whether a thing is or isn't AGI is ultimately less relevant to how good is it at doing things that people currently do now and what's that going to mean for jobs, the economy, and society.
The big place where this conversation was taking place was around developers. Florian Mai writes, "'03 is better than 99.95% of programmers. The public needs to wake up to what's happening so we can act responsibly. For that to happen, we first need the scientific community to acknowledge the evidence. This is the most important problem of our time.'" Entrepreneur Sully writes, "'Yeah, it's over for coding with 03. This is mind-boggling. Looks like the first big jump since GPT-4 because these numbers make zero sense.'"
Still, some are pointing out that coding competitions don't necessarily translate to real-life problems. Machine learning instructor Santiago wrote, O3 is better than 99.95% of programmers solving code forces problems. 99.99% of professional programmers don't need to do code forces problems to make a living. There's absolutely no proof that O3 is capable of doing what those professional programmers do to make money. He continued, I'm not downplaying how much the world is changing. My argument is about what exactly performing well on software engineering benchmarks tells us
and how it's related to the current work of software engineers.
The other benchmarks were no less impressive for paradigm shifting. Didi Das of VC at Menlo Ventures tried to describe just how wild the math benchmark is, commenting, 99.99% of people cannot comprehend how insane frontier math is. The problems are created by math professors and not in any training data. Math legend Terry Tao said these are extremely challenging. I think they will resist AIs for several years at least. OpenAI03 did 25% on this. At this stage, no other model has completed more than a single question.
And this is where we started to see some big think implication type conversations. Stability AI co-founder Ahmad Mostak wrote, my take on O3, the global economy is cooked. We need a new economic and societal framework. Any work that can be done on the other side of a computer screen, AI will be able to do at a fraction of the price. Harry Law, a Google DeepMind and Cambridge University alumni wrote, at $3,000 per task, O3 is already a more cost-effective solution than hiring McKinsey.
And while I think the analogy is not even close to perfect, there is an important point here that a lot of these numbers only seem expensive when they're placed in the context of software, not so much when they're labor replacement. Nick Camerata writes, I set my AI expectations to unrealistically high bonkers AI world and I still underestimated recent progress.
And while I don't want to go deep into the debates around what this means for the singularity and hard takeoff and all these sort of theories, at least not in this particular episode, what's important relative to O3 is that they're a part of the conversation. One thing that was notable was about how big a gap between the internal AI conversation and what's being reported.
Adam D'Angelo writes, wild that the O3 results are public and yet the market still isn't pricing in AGI. Bloomberg reported it as just another leg of the race between OpenAI and Google. The Wall Street Journal ran a feature story about delays to GPT-5 with the headline, the next great leap in AI is behind schedule and crazy expensive. And yet for all the discussion around how cooked people are and all this sort of stuff, I think it's really important to have some perspective here as well. Replit CEO Amjad Mossad said, the idea that O3 will automate software engineers is silly.
Object Zero writes, Matt Griswold pointed out that the replacement of developers is progressing at a much slower pace than the advancement of the technology, commenting,
Professor Ethan Malik makes a point that I do all the time. The reason everything will not change quickly, he writes, even if AI generally exceeds human capabilities across fields, is in large part the nature of systems. Organizational and societal change is much slower than technological change, even when the incentives to change quickly are there. Human social and organizational inertia is going to be a slowdown force that helps us have time to adapt.
Julia McCoy flips it around and says, hype about O3 misses the plot. This isn't about AI getting smarter. It's about humans getting freer. No more data entry, no more mundane tasks, no more trading time for money. Cushy has a similar point. If you view the O3 launch as anything less than irrefutable evidence that this is the most exciting time to be alive, you may need to take a deep breath and rekindle your optimism.
And for those trying to figure out where they spent their time now that this exists, Boyantungus writes, I've been telling you for a while not even to try to compete with machines at being a better machine. Instead, try competing with humans at being a better human.
Look, call me optimistic, but at the end of the day, I just think that the entire history of human experience points towards the output of this explosion of intelligence being a massive increase in human creation. We're going to make more stuff. We're going to make more code. We're going to make more products. We're going to make more entertainment. None of which is to say that the disruptions along the way won't be painful and we do need to deal with them.
But I continue to think that the future is going to be even more exciting than the present. And that seems to me to be a pretty good way to close out 2024. Now, this will be the last regular AI Daily Brief episode of the year. From here on, I've got a number of end of year episodes, which I'm really excited about. We've got the 15 most important AI products of the year, 25 predictions for agents, and a bunch more.
For now, though, can't tell you how much I appreciate you guys watching, listening, hanging out with me here every day. I hope that you are headed into a wonderful holiday season. Until next time, peace.