This is episode number 864 on OpenAI's O3 Mini.
Welcome back to the Super Data Science Podcast. I am your host, John Krohn. At the time of recording, I've been completely crushed by a brutal stomach infection. I'm heavily medicated right now to get through recording this episode. And so the show must go on. And so for today's Five Minute Friday style episode, I'm skipping the preamble and jumping straight to the meat of the show.
of the episode. Today's episode will fill you in on everything you need to know about an important model OpenAI recently released to the public called O3 Mini. OpenAI's O3 Mini is a reasoning model like DeepSeek's R1 model, which I detailed two weeks ago in episode number 860. And also, it's a reasoning model like the original super famous reasoning model O1, which made a huge splash when it was released by OpenAI back in September, and which I covered back in episode number 820.
As a quick recap, reasoning models like O1, R1, and now O3 work through problems step-by-step in the background before outputting a response to your query. Compared to models like GPT-40 and Cloud 3.5 Sonnet that immediately begin streaming their outputs, reasoning models are far more effective at the same kind of tasks that you might tackle step-by-step with pencil and paper, such as math problems or challenging coding problems.
There are two reasons why this new O3 Mini reasoning model is such an important release. First, when left thinking, quote-unquote thinking, long enough, O3 Mini has three modes. So it has a low mode, a medium, and a high mode where high carries out the most inference time compute. And when it's left thinking long enough in that mode,
Uh, Oh three mini high mode. Oh three mini achieves state-of-the-art performance relative to any other publicly available model on a number of key challenging benchmarks, including the aim math benchmark, the code forces coding benchmark, and the SWE bench verified benchmark.
that consists of challenging real-world software engineering problems. To be more explicit, this means that O3 Mini High outperforms not only O1 Mini, but also DeepSeek R1 and even OpenAI's much more expensive-to-run full-size O1 model.
Which brings me to the second reason why O3 Mini is such an important release. Because O3 Mini is relatively small, it's way cheaper than O1 to run. While O1 costs $15 per million input tokens and $60 per million output tokens, O3 Mini costs just 7% of that on both input and output. So you're getting comparable or even better performance on challenging benchmarks with O3 Mini relative to O1 and much lower cost.
And note that O3 Mini is about twice the cost to run relative to R1, DeepSeek R1, on DeepSeek's cloud infrastructure in China. But if you want to run R1 with a US cloud provider, O3 Mini actually costs about half as much to run. So to recap all that, the key points are that O3 Mini provides state-of-the-art performance on complex tasks that require step-by-step reasoning, all at bargain prices compared to the first generation of reasoning models.
So how can you access this powerful new O3 Mini model? Free tier users of ChatGPT can get a taste of O3 Mini by selecting reason in the chat box when you make your query. And if you have a paid ChatGPT plan, such as ChatGPT Plus, Team or Pro, you can access the O3 Mini High model that spends the most time doing inference time computation, but also provides the state-of-the-art capabilities I've been touting throughout this episode. You can also use the ChatGPT API to embed O3 Mini's reasoning capabilities into any application your heart desires.
I've got a link to instructions on how to use the API in the show notes. Depending on your exact application, you can experiment to determine whether O3 Mini Low, Medium, or High is ideal for your use case, noting that, of course, your compute time and financial cost will both go up if you opt for O3 Mini Medium, and even more so if you go for O3 Mini High.
Ultimately, this O3 Mini release isn't as earth-shattering for me as the DeepSeek R1 release was a few weeks ago because of how R1 is provided open source while O3 Mini is completely proprietary. This means that you have even more flexibility with R1 to adapt it to your heart's content and to use it on whatever infrastructure you desire.
But OpenAI does have another card up their sleeve that will presumably be released to the public soon, and that's quite exciting indeed. That's O3. So this whole episode, I've been talking about O3 Mini, but...
They are about to release, presumably OpenAI are about to release the full size O3 model. And that one, O3, has performance that absolutely crushes all other models available today, including DeepSeek R1. And of course, its predecessor, the full size OpenAI O1 model on O3.
all the complex and important reasoning benchmarks. So if you actually watched the video version of today's episode, I've got some charts to show this big delta for O3 relative to all other existing models today. This includes on the math on the AIM benchmark, which I mentioned earlier in this episode, but to go in a bit more detail, it stands for American Invitational Mathematics Examination, AAMC.
A-I-M-E. So yeah, on that benchmark, OpenAI 03 gets a score of 96.7.
which is far better than DeepSeek R1, which came in at 79.8 and was the next closest model, other than actually O3 Mini High, which came in at 87.3. On coding, like the Codeforces benchmark, again, O3 absolutely crushes all other existing models. It gets an ELO rating of over 2700, while the next closest models are O3 Mini High with about 2100 and DeepSeek R1 at about 2000.
the SWE verified benchmark. So the SWE bench verified benchmark, software engineering benchmark. I've got a link to more details on that benchmark in the show notes. Complex real world software engineering problems are handled in that benchmark. And again, the OpenAI 03 model absolutely crushes all other existing models. According, all of these things are
not independently verified yet. These are all stats from OpenAI themselves, so maybe potentially a grain of salt there, but I think they've been pretty reliable with their historical releases on this kind of thing. And yeah, so on the SweeBench verified benchmark, again, OpenAI 03 gets almost 72, a score of almost 72, whereas the next best models, 03 Mini High and DeepSeek R1 and OpenAI
01 come in at 49. It's a huge delta that would be very noticeable in a real world application. And then finally, there's a fourth and final benchmark here, which is related to just being able to answer natural language questions in English. So this is the graduate level Google proof Q&A benchmark GPQA and
And on this GPQA benchmark, it's not as stark as on the math and programming benchmarks. But again, OpenAI 03 comes top with a score of almost 88, whereas the next best model is OpenAI 03 Mini High at 80. So still a big delta, especially as you get closer and closer to 100.
So, yeah, exciting things to come. Another week, another major breakthrough in AI capabilities. I hope your brain is tingling with ideas for how you can streamline your own activities as well as potentially build world-changing applications with increasingly powerful and exponentially less expensive AI models at your fingertips. If not, try chatting with an LLM to get some ideas. All right, that's it for today's episode. If you enjoyed it or know someone who might consider sharing this episode with them, leave a review of the show on your favorite podcasting platform. Tell
Tag me in a LinkedIn post with your thoughts. And if you aren't already, be sure to subscribe to the show. But most importantly, I just hope you'll keep on listening. Until next time, keep on rocking it out there. And I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon.