We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

864: OpenAI’s o3-mini: SOTA reasoning and exponentially cheaper

2025/2/21

Super Data Science: ML & AI Podcast with Jon Krohn

AI Deep Dive AI Chapters Transcript

People

Jon Krohn

Topics

Jon Krohn: 我在本期播客中介绍了 OpenAI 最近发布的一个重要模型——O3 Mini。它是一个推理模型，与 DeepSeek 的 R1 模型和 OpenAI 的 O1 模型类似，这些模型都通过逐步推理来解决问题。与 GPT-40 和 Claude 3.5 Sonnet 等直接输出结果的模型相比，推理模型在需要逐步思考的任务（如数学题或复杂的编程题）上更有效。 O3 Mini 在高模式下，在多个具有挑战性的基准测试（包括 AIM 数学基准测试、Codeforces 编码基准测试和 SWE Bench 验证基准测试）中实现了最先进的性能，其性能优于 O1 Mini、DeepSeek R1，甚至 OpenAI 更昂贵的 O1 模型。 O3 Mini 的另一个重要特点是其运行成本相对较低，仅为 O1 的 7%。虽然与 DeepSeek R1 在中国的云基础设施上运行相比，O3 Mini 的运行成本约高出一倍，但如果使用美国的云提供商，O3 Mini 的运行成本实际上约低一半。总的来说，O3 Mini 在需要逐步推理的复杂任务上提供了最先进的性能，并且与第一代推理模型相比，价格更低廉。ChatGPT 用户可以通过选择“reason”来体验 O3 Mini，付费用户可以访问性能更强的 O3 Mini 高模式。也可以通过 ChatGPT API 将其集成到任何应用程序中。然而，与 DeepSeek R1（它是开源的）相比，O3 Mini 是专有的，灵活性较低。OpenAI 即将发布完整版的 O3 模型，其性能预计将超越所有现有模型。在 AIM、Codeforces、SWE Bench 和 GPQA 等基准测试中，O3 模型的性能都大幅领先于其他模型，这预示着人工智能能力的又一次重大突破。

Deep Dive

Chapters

This chapter introduces OpenAI's O3 Mini, a reasoning model that outperforms others in challenging benchmarks at a lower cost. It details its performance compared to models like O1, R1, GPT-40, and Claude 3.5 Sonnet, highlighting its cost-effectiveness and state-of-the-art capabilities.

O3 Mini achieves state-of-the-art performance on key benchmarks.
It's significantly cheaper to run than O1.
It offers three modes: low, medium, and high, with high mode providing the best performance.

Shownotes Transcript

Translations:

中文

This is episode number 864 on OpenAI's O3 Mini.

Welcome back to the Super Data Science Podcast. I am your host, John Krohn. At the time of recording, I've been completely crushed by a brutal stomach infection. I'm heavily medicated right now to get through recording this episode. And so the show must go on. And so for today's Five Minute Friday style episode, I'm skipping the preamble and jumping straight to the meat of the show.

of the episode. Today's episode will fill you in on everything you need to know about an important model OpenAI recently released to the public called O3 Mini. OpenAI's O3 Mini is a reasoning model like DeepSeek's R1 model, which I detailed two weeks ago in episode number 860. And also, it's a reasoning model like the original super famous reasoning model O1, which made a huge splash when it was released by OpenAI back in September, and which I covered back in episode number 820.

As a quick recap, reasoning models like O1, R1, and now O3 work through problems step-by-step in the background before outputting a response to your query. Compared to models like GPT-40 and Cloud 3.5 Sonnet that immediately begin streaming their outputs, reasoning models are far more effective at the same kind of tasks that you might tackle step-by-step with pencil and paper, such as math problems or challenging coding problems.

There are two reasons why this new O3 Mini reasoning model is such an important release. First, when left thinking, quote-unquote thinking, long enough, O3 Mini has three modes. So it has a low mode, a medium, and a high mode where high carries out the most inference time compute. And when it's left thinking long enough in that mode,

Uh, Oh three mini high mode. Oh three mini achieves state-of-the-art performance relative to any other publicly available model on a number of key challenging benchmarks, including the aim math benchmark, the code forces coding benchmark, and the SWE bench verified benchmark.

that consists of challenging real-world software engineering problems. To be more explicit, this means that O3 Mini High outperforms not only O1 Mini, but also DeepSeek R1 and even OpenAI's much more expensive-to-run full-size O1 model.

Which brings me to the second reason why O3 Mini is such an important release. Because O3 Mini is relatively small, it's way cheaper than O1 to run. While O1 costs $15 per million input tokens and $60 per million output tokens, O3 Mini costs just 7% of that on both input and output. So you're getting comparable or even better performance on challenging benchmarks with O3 Mini relative to O1 and much lower cost.

And note that O3 Mini is about twice the cost to run relative to R1, DeepSeek R1, on DeepSeek's cloud infrastructure in China. But if you want to run R1 with a US cloud provider, O3 Mini actually costs about half as much to run. So to recap all that, the key points are that O3 Mini provides state-of-the-art performance on complex tasks that require step-by-step reasoning, all at bargain prices compared to the first generation of reasoning models.

So how can you access this powerful new O3 Mini model? Free tier users of ChatGPT can get a taste of O3 Mini by selecting reason in the chat box when you make your query. And if you have a paid ChatGPT plan, such as ChatGPT Plus, Team or Pro, you can access the O3 Mini High model that spends the most time doing inference time computation, but also provides the state-of-the-art capabilities I've been touting throughout this episode. You can also use the ChatGPT API to embed O3 Mini's reasoning capabilities into any application your heart desires.

I've got a link to instructions on how to use the API in the show notes. Depending on your exact application, you can experiment to determine whether O3 Mini Low, Medium, or High is ideal for your use case, noting that, of course, your compute time and financial cost will both go up if you opt for O3 Mini Medium, and even more so if you go for O3 Mini High.

Ultimately, this O3 Mini release isn't as earth-shattering for me as the DeepSeek R1 release was a few weeks ago because of how R1 is provided open source while O3 Mini is completely proprietary. This means that you have even more flexibility with R1 to adapt it to your heart's content and to use it on whatever infrastructure you desire.

But OpenAI does have another card up their sleeve that will presumably be released to the public soon, and that's quite exciting indeed. That's O3. So this whole episode, I've been talking about O3 Mini, but...

They are about to release, presumably OpenAI are about to release the full size O3 model. And that one, O3, has performance that absolutely crushes all other models available today, including DeepSeek R1. And of course, its predecessor, the full size OpenAI O1 model on O3.

all the complex and important reasoning benchmarks. So if you actually watched the video version of today's episode, I've got some charts to show this big delta for O3 relative to all other existing models today. This includes on the math on the AIM benchmark, which I mentioned earlier in this episode, but to go in a bit more detail, it stands for American Invitational Mathematics Examination, AAMC.

A-I-M-E. So yeah, on that benchmark, OpenAI 03 gets a score of 96.7.

which is far better than DeepSeek R1, which came in at 79.8 and was the next closest model, other than actually O3 Mini High, which came in at 87.3. On coding, like the Codeforces benchmark, again, O3 absolutely crushes all other existing models. It gets an ELO rating of over 2700, while the next closest models are O3 Mini High with about 2100 and DeepSeek R1 at about 2000.

the SWE verified benchmark. So the SWE bench verified benchmark, software engineering benchmark. I've got a link to more details on that benchmark in the show notes. Complex real world software engineering problems are handled in that benchmark. And again, the OpenAI 03 model absolutely crushes all other existing models. According, all of these things are

not independently verified yet. These are all stats from OpenAI themselves, so maybe potentially a grain of salt there, but I think they've been pretty reliable with their historical releases on this kind of thing. And yeah, so on the SweeBench verified benchmark, again, OpenAI 03 gets almost 72, a score of almost 72, whereas the next best models, 03 Mini High and DeepSeek R1 and OpenAI

01 come in at 49. It's a huge delta that would be very noticeable in a real world application. And then finally, there's a fourth and final benchmark here, which is related to just being able to answer natural language questions in English. So this is the graduate level Google proof Q&A benchmark GPQA and

And on this GPQA benchmark, it's not as stark as on the math and programming benchmarks. But again, OpenAI 03 comes top with a score of almost 88, whereas the next best model is OpenAI 03 Mini High at 80. So still a big delta, especially as you get closer and closer to 100.

So, yeah, exciting things to come. Another week, another major breakthrough in AI capabilities. I hope your brain is tingling with ideas for how you can streamline your own activities as well as potentially build world-changing applications with increasingly powerful and exponentially less expensive AI models at your fingertips. If not, try chatting with an LLM to get some ideas. All right, that's it for today's episode. If you enjoyed it or know someone who might consider sharing this episode with them, leave a review of the show on your favorite podcasting platform. Tell

Tag me in a LinkedIn post with your thoughts. And if you aren't already, be sure to subscribe to the show. But most importantly, I just hope you'll keep on listening. Until next time, keep on rocking it out there. And I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon.

864: OpenAI’s o3-mini: SOTA reasoning and exponentially cheaper 08:17 Share

Super Data Science: ML & AI Podcast with Jon Krohn

Deep Dive

Shownotes Transcript

864: OpenAI’s o3-mini: SOTA reasoning and exponentially cheaper