EP 456: OpenAI’s o3-Mini - The world’s best free chatbot model? - Transcript and Chapters - from Podcast Everyday AI Podcast

OpenAI的O3 Mini：颠覆我以往对免费AI模型的认知

我过去一直建议大家不要使用免费的AI模型，因为付费版本在性价比方面具有压倒性优势。然而，OpenAI最近发布的O3 Mini模型让我不得不重新审视这一观点。它或许是目前世界上最好的免费聊天机器人模型。

O3 Mini是OpenAI推出的首个免费推理模型，这与以往的GPT系列模型截然不同。它具备更强大的推理能力，能够进行更长时间、更深入的思考。虽然免费版本的使用次数有限制，但这并不妨碍它成为一个令人兴奋的突破，尤其对于那些预算有限的用户而言。

O3 Mini模型在STEM（科学、技术、工程、数学）领域和编码方面表现出色。它比之前的O1 Mini模型更加便宜，运行速度也更快。对于API用户，OpenAI还提供了O3 Mini的三个版本：低、中、高，用户可以根据自身需求在速度、成本和性能之间进行权衡。而ChatGPT Plus用户则可以访问O3 Mini的中等和高等版本。令人印象深刻的是，O3 Mini High在许多基准测试中甚至超越了完整的O1模型。

我认为O3 Mini之所以成为目前最好的免费聊天机器人模型，不仅仅因为它本身的出色性能，更重要的是它具备其他免费模型所不具备的功能，例如联网搜索。这种能力极大地扩展了它的应用范围，使其能够处理更复杂、更贴近现实世界的问题。

当然，O3 Mini也并非完美无缺。免费版本的限制性使用次数可能会影响一些用户的体验。但即便如此，它仍然代表着免费AI模型发展的一个重要里程碑，为更多用户打开了通往先进AI技术的大门。我强烈建议大家尝试一下O3 Mini，亲身体验其强大的推理能力和联网搜索带来的便利。这将改变你对免费AI模型的认知。

Shownotes Transcript

Translations:

中文

This is the Everyday AI Show, the everyday podcast where we simplify AI and bring its power to your fingertips. Listen daily for practical advice to boost your career, business, and everyday life.

I know I've said probably dozens of times, don't use free AI models, right? Because the paid versions are ridiculously cheap for what you get, $20 a month, $30 a month. It doesn't matter whether you're an individual or buying that for your company with thousands of employees. That's so affordable. So I've always said, don't touch the free models, but

I might have to change that because OpenAI has made their O3 mini model free, well, in a very limited capacity in terms of the number of messages yet.

I think it may be the world's best free chatbot model. So we're going to be going over today the new O3 mini model from OpenAI, talk about what it is, how it works, its implications, if it is really the best free AI chatbot model in the world, and maybe do a little bit of live testing.

All right. I'm excited for this one. I hope you are too. If you're new here, welcome. This is Everyday AI. My name is Jordan Wilson, and we do this thing every day. It's for you. This is your daily live stream podcast and free daily newsletter, helping us all not just keep up with AI, but how we can use it to get ahead, to grow our company and to grow our career.

And that is a full-time job if you are not tuning in every day. So we do all that hard work for you. So then you can go be the smartest person in AI at your company. All right. So if you are new, maybe listening for the first time on the live stream or the podcast, thank you for tuning in. Make sure to check your show notes. Very important things they're

mainly our website, youreverydayai.com. That's where you're going to want to go sign up for our free daily newsletter because every single day we do a couple of things. We bring you all the latest AI news and tell you what it means, but we also break down our podcast episode from the day with some more information. So make sure you go do that as well as...

While you're there, I'm going to keep promoting this, y'all. You need to go listen to our 2025 AI Predictions and Roadmap series. It is all on our website. It is for free. I'm getting a ton of messages. I just got a message actually today or last night because I think this person's in Europe from one of the largest consulting companies in the world. And they said their team is phenomenal.

breaking down those five episodes and they're going to continue to track them all year. I kid you not. You need to go listen to them and let me know what you think. All right, enough chitchat. Let's get into the AI news. Livestream audience.

This is up to you. I have a question on the screen there. Let me know. We're going to be doing an 03 mini live test. Do you want to see A, do you want to see it go through a reasoning rubric or do you want to see B go through some real world data analysis? So let me know A or B on the screen. Let me know now.

All right, so AI news, a lot going on. Gemini, Google has announced the general availability of its Gemini 2.0 Flash model, a high-performance AI designed for developers with enhanced speed and complex problem-solving capabilities. So the Gemini 2.0 Flash model was first introduced at IO 2024, its developer conference, is praised for its efficiency in handling high-volume tasks and multimodal reasoning with a $1 million budget.

token context window. Also, a new experimental version of the big boy model, Gemini 2.0 Pro is also available for paid users boasting superior coding performance and a 2 million token context window. I mean, Google is just dominating the context window game. They also introduced Gemini 2.0 Flash Lite.

the most cost-efficient model to date, and that is in public preview, and that's offering improved quality over its predecessor, 1.5 Flash. So Google emphasized the importance of safety and responsibility with the Gemini 2.0 lineup, incorporating new reinforcement learning techniques and automated red teaming to mitigate risk and ensure secure usage. Yeah, huge, huge news there from Google. I'm excited to dive into that a lot more. Our next piece of AI news.

US lawmakers are proposing a ban on deep seek. Big surprise. Not at all. All right. So lawmakers in the US are planning to introduce a bill to ban deep seeks chatbot application from right now, just government owned devices over security concerns that user data could be accessed by the Chinese government.

Spoiler alert, it can. The bipartisan legislation is echoing previous efforts to ban TikTok from government devices, which was the precursor to TikTok being banned in the U.S., which was quickly there kind of overturned, but it could still happen.

All right, so DeepSeek is a Chinese AI company, and they've rapidly gained popularity in the U.S., becoming the most downloaded iOS app last month. Concerns arose, though, after an analysis revealed that there's hidden code in the app that could send information, user information, to China Mobile, a state-owned company,

banned in the U.S., a China state-owned company. So yeah, the proposed legislation aims to ban sensitive government and personal data from being accessed by the Chinese Communist Party. Other countries, including Australia, South Korea, and Italy, have already banned DeepSeek from their government systems due to similar data security concerns. Also, some U.S. federal agencies, such as the Navy and NASA, have preemptively blocked the app for security reasons. I, I

I'd love to post on LinkedIn for this. I'll probably do a show dedicated at some point. This story is changing so quickly. That's why I've kind of held my tongue on this because I got a lot of hot takes. So I might need to save that for a week or two. All right. Last but not least, new piece of AI news. ChatGPT has made...

Speaking of things for free, their new chat GPT search available now for free, even for free users who are not logged in. So this new feature, well, the feature at least new to free and non logged in users allows everyone to access up to the day information such as sports scores, news and stock prices directly through chat GPT.

So according to reports, the search functionality uses a fine-tuned version of GPT-4.0 optimized with synthetic data and output from OpenAI's new reasoning models. So OpenAI has partnered with major news organizations like the Associated Press and Reuters for licensing agreements influencing the visibility of certain publishers in search results. So this is huge. It's kind of

weird now, right? It almost looks like Google is trying to really compete with what ChatGPT was like two years ago. And now ChatGPT is trying to compete with what Google has been for the past 20 years, right? Really making a stake to try to erase Google, right? OpenAI just wants you to skip Google altogether and go use its ChatGPT search. And you don't even need to be logged in and you don't even need to have an account. So pretty wild.

All right, that's enough chit-chat. Lots more in the newsletter today. So let's talk about OpenAI's O3 mini model. It's very impressive.

It's very impressive. I'm just going to say that. And hey, live stream audience, thank you for tuning in. I saw a couple votes, some A's, some B's. So yeah, let me know if you want to go over the reasoning rubric live or if you want to go over the data analysis. So A or B. All right. Also-

You're going to want to repost this. All I'm saying, I've been putting these guides together. After I do something in ChatGPT or other large language models, I'm like, wait, I've just really saved people dozens of hours a week if they go do this. And I realized that, you know, sometimes the podcast isn't enough or putting it in the newsletter. So I did put together another guide specifically on using the O3 mini models.

It's fantastic. I just finished it this morning. So if you repost the show, I'm going to send that to you. All right. So let's get into the O3 mini model. Here's the gist. Okay. It is the first free reasoning model from OpenAI, right? So we have the O series of models. It is different.

than the GPTs, right? So the GPTs are your quote unquote old school transformer models. And then the O series, this is OpenAI's Reasoner models, right? The Reasoner models have become very popular over the last like four or five months. But this is essentially a model that thinks longer, kind of does more inference or uses kind of this chain of thought thinking where it doesn't just

quickly respond to something. It takes a while and really kind of thinks internally. So kind of the work that you would normally do in a transformer model, a GPT-4-0, right? As a human, you'd want to go back and forth with it a lot. Kind of these reasoning models, that's why they're so good. But

A couple of things. It uses more compute. So generally they're more costly. So as an example, if you want to have unlimited use of this, you need the $200 pro plan, but at least for, I believe it's 10 messages until you hit your message cap. It's available right now for free users.

But it's not just that, that makes me excited about this. So if you are logged in, so this is separate news from the ChatGPT search that you can use even if you are not logged in, right? So if you do have a free, even a free ChatGPT account, so not only do you have a couple queries that you can use with the new O3 mini model that OpenAI just released, but it also is

connected to the internet. That is huge. That is the piece that I think most people have missed or overlooked when it comes to this new model from OpenAI.

And that guide, by the way, the guide leverages specifically, it's 20 different use cases that combine reasoning and the internet, right? Which is, that's what knowledge workers do. And that's why I think this is so exciting, even for people who are not paid subscribers. Again, whether your favorite chatbot is Gemini, Claude, ChatGPT, whatever, just pay for the

based $20 a month plan. It pays for itself the first time you hit enter if you know what you're doing. But even for those cheapskates out there, right? Yeah. I know a couple of you out there that are still pinching your pennies. And even though you're buying $8 coffees every day, you're like, oh, I'm not going to buy a $20. No, just buy it. But still, this is the first model for free that you can use from OpenAI that is its reasoning model. And it's connected to the internet. And reportedly,

I don't personally believe this, but a lot of people have said, oh, OpenAI did this because of the DeepSeek R1 release, which took the internet by storm for a lot of

I'm going to say incorrect reasons. We'll just say that. All right. I don't personally think this is in response to deep seek. I think this is actually in response to Google. That's been on a frigging tear since December. Google has been straight up releasing a crazy amount of releases. If I'm open AI, I don't care about deep seek, right? It's,

largely going to get banned, I believe. I'm worried about Google. So I think that this is really a shot at all of the great work that Google has been doing specifically in Google AI Studio. So let's just answer that question right now. I'm not going to make you wait another 20 minutes and after our test. Is O3 Mini the best free chatbot model? Let me break that down. Free chatbot model. Okay. That's when you log into the front end of a chatbot.

All right. So what do I mean by that? Well, right now, if you have a free account and you log into Gemini, you know, Gemini.Google.com, even though they had all these new releases, you can't use them. You're only using 1.5 flash, but they have great models within the AI studio, but it's a little different. That's not a kind of a, for beginners, that's more for developers. So it beats Google.

No questions asked. Copilot is powered by GPT-4.0 technology. And last week, if you read our newsletter, you're smart. You already know this. There is some limited free access to OpenAI's O1 model with the Think Deeper kind of capabilities inside Copilot.

but I still think Oh three mini is better because we're talking. Oh one. Uh, I believe that one's Oh one preview, uh, Microsoft. I like, I know a lot of you guys listen to this and I always tell like, I met like talked with like a hundred of you guys, uh, at the build or, um, the ignite conference here in Chicago. And I'm like, Hey, tell me if I'm wrong on this, but I'm pretty sure the think deeper uses the Oh one, uh, preview, not the Oh one pro. Um,

So I still think it's better than that. I still think O3 Mini for free is better than using Copilot for free. Claude, just LOL that. I mean, by the time, even on a paid account,

Like Claude, you can't use it. You just can't write anything more than a couple of prompts and you hit your rate limits on free on a free account. Even though Claude three five sauna is a good model. It's now like, I don't know, eight months old. So presumably we'll be seeing new updates from Claude pretty time soon. I mean, on a free plan, if you look at Claude the wrong way, you've already hit your message limit. All right. So you can't really use it a lot.

All right. And then deep seek. Good luck with that. High risk. Lots of questions. Great model. Great benchmarks. Right. Good luck. That's all I'll say. All right. So is O3 mini the best free chatbot model? Yes. It's not even close.

This model is so good. Here's the thing. It's limited. If you're on the free plan, I think it's only 10. It's either 10 a day or 10 a week. OpenAI doesn't say in all of my accounts are paid. So I was trying to quickly find that answer out. I'll make sure to put it in the newsletter. But yes, it is. It is. And I actually don't think it's even close. All right.

Hey, someone here is saying our audio is cutting in and out a bit. Let me know if it actually is or if maybe that person has some computer problems today. So it is. OpenAI's new O3 Mini is the best free chatbot model in the world, and I don't think it's necessarily close because it's not just the model. It's everything else that the model has capabilities to do, like we said.

Search just right there. ChatGPT search is great. All right. So let's go over some of the highlights of the model. So it excels in STEM encoding this new O3 mini. All right. It is 63% cheaper than O1 mini.

which is the model it replaced. So yeah, if you're, FYI, if you're on a paid plan and you're in there looking and you're like, wait, where's 01 mini? 01 mini is gone. And now you have 03 mini instead. And there's actually multiple variations of 03 mini. I'll get to that in a second. It is 03 mini is 24% faster than 01 mini. And for API users, there's actually three variations.

There's a low, a medium, and a high kind of variety or flavor. And that's essentially...

you're, you're choosing speed and cost versus performance. So, uh, for the Oh three mini low, that is going to be the cheapest, the fastest with the lowest performance. Oh three mini high is going to be the most expensive and take a little longer, but it's going to have the best performance obviously. And then the, uh, Oh three, Oh three mini, uh, normal is going to be that, that, uh, middle, right? It's like the,

what is it the the three beds right one's one's too soft one's too hard one's just right um

All right. And for that's for API. So if you are on the chat bot version, right, which is many of us, right? So chatgpt.com, you're not, you know, uh, using backend API as a developer, but for chat GPT users, if you are chat GPT plus, so the $20 a month, uh, you have access to Oh three mini kind of the medium or middle version and then Oh three mini high.

And that one kind of thinks harder, more or less. It uses a little more compute. And you have, I believe, 50, oh no, it's 150 a day, I believe now. So plenty of usage. They just tripled it in the last couple of days. So if you have the $20 a month plan, I don't think you're probably going to hit your O3 Mini high limits. And let me tell you, right now, O3 Mini high?

is probably one of my most used models. All right. Also,

03 Mini High outperforms the full 01 model on many benchmarks. Because right now we just have the miniature version, right? We don't have the full 03 version. I don't even know if the full 03 version is going to come out in 2025. I would assume it would, but I don't know. Because the full 03 model has not been released. The only glimpse of it that we've seen is OpenAI did say that its new deep research model

which is mind-blowingly good. It's going to put so many small to medium-sized and management consultant companies out of business. I'm not kidding. It's freaking good. Anyways, that uses a fine-tuned version of the full O3 model, but this is just, we're just getting the mini, we're just getting the mini one here, y'all. All right, let's keep it going. Benchmarks, I know. I'm not going to get too dorky here, but let's look at competition math.

03 Mini High.

outperforms even the full 01 model on the AIME, I think that's AIME, 2024 Competition Math Benchmark, right? So large language models, they go through all these tests, all these standardized tests, essentially. Think of like a human, you know, there's all these different tests you take, same thing with models, and then you get benchmarks, you get scores, right? So you can see how capable the model is. So while 03 Mini, high, is even more capable than the 01 model, and that's one of the highest scores in the world.

And then you have PhD level science questions, right? Because O3, O3 mini high is great at anything STEM coding research. It's a chef's kiss. Good. All right. So on the PhD level science, which is GP, GPQ, a diamond for those of you at home keeping score, O3 mini high also out benches the full O1 pro, right?

Uh, also this is not yet. Oh, three mini. So kind of the, the, the benchmarks that I talk about a lot on this show, uh, aside from, you know, MMLU and some of those that I just mentioned, uh, are, uh, the chatbot arena scores. Uh, those aren't out yet because this model is like barely like a week old. Uh, right. But we do have from artificial analysis, which is a great resource for, uh,

an unbiased third-party model benchmarking service. O3 Mini, in terms of quality, it is fantastic.

Second in the world, only behind the full O1 model. So O3 Mini and DeepSeek R1 are actually, they're tied with scores of 89, where O1 has a 90. For comparison, right? If you love Cloud 3.5, as an example, Cloud 3.5 has a 68, if that puts it on the scale for you. All right. Gemini 2.0, not their newest version, but the one previous to that had an 82.

So what does that mean? It is by far one of the highest quality models in the world. And OpenAI made it free for a limited use case, right? But it's mind boggling that we have this level of a reasoning model that is one of the most capable in the world. And it also can access the internet, which is one of the reasons why I tell people don't use Claude, at least yet, right? Because there's a...

somewhat of a business danger if you are taking results from a large language model that has very old data you shouldn't be doing it all right so another uh kind of graph here from artificial analysis this just kind of shows your quality versus price and that's where you see oh okay when it comes to quality versus price o3 mini is actually the best in the world

And it's not necessarily close. The only one somewhat close is DeepSeek R1. Again, good luck with that if you want to use it. I'm not using it on a day-to-day basis. But O3 Mini from a quality and price perspective.

Right now, it can't be beat. I mean, we'll see. I think Google's announcements yesterday are going to shake this graph up a little bit, and I'm excited to dive into all of the new Gemini 2.0 a little bit. But right now, O3 Mini is technically an elite model. Don't let the Mini confuse you.

All right. So as a reasoning model, these are the API pricing, right? So again, uh, you can use it for free. You can use it chat, GBT plus $20 a month. You're probably not going to run out of queries. If you have the pro version, like I do $200 a month, it's unlimited. But, uh, for API pricing, uh, it's a dollar 10 for a million input token, uh, in four 40, uh, for a million output tokens for a reasoning model. So affordable. It is so affordable. Um,

Are you still running in circles trying to figure out how to actually grow your business with AI? Maybe your company has been tinkering with large language models for a year or more, but can't really get traction to find ROI on Gen AI. Hey, this is Jordan Wilson, host of this very podcast.

Companies like Adobe, Microsoft, and NVIDIA have partnered with us because they trust our expertise in educating the masses around generative AI to get ahead. And some of the most innovative companies in the country hire us to help with their AI strategy and to train hundreds of their employees on how to use Gen AI. So whether you're looking for chat GPT training for thousands,

or just need help building your front-end AI strategy, you can partner with us too, just like some of the biggest companies in the world do. Go to youreverydayai.com slash partner to get in contact with our team, or you can just click on the partner section of our website. We'll help you stop running in those AI circles and help get your team ahead and build a straight path to ROI on Gen AI. And you might be confused. I get it.

all this O alphabet soup, right? OpenAI CEO Sam Altman did admit that they have a naming problem with the models. It's hard, right? And especially when they come out with some of these new O reasoning models, some of the older ones get replaced or they're just no longer available. So let me just give you a quick rundown of the O series. So in September, we got O1 Preview and O1 Mini.

All right. Then in December, they got rid of 01 Preview. And then we just had 01 and they added 01 and 01 Pro. So if you had a pro account, that's the only way you can get access to pro. In December, you had three versions. You had 01 Mini, 01 and 01 Pro. Okay. Easy enough to follow along.

But then January 31st came around last week, right? And that threw a wrench in it. So now we went to 03. There is no 02 because that's the trademark name of a British telecom company. So if you're wondering like what happened here, did I miss out on a whole series of AI development? No, you didn't, right? But now in January, we got this 03 mini and that has 03 mini high and then 01 mini is gone.

I know, confusing. So depending on what paid plan you have, you might still have in your account 01, 01 Pro,

03 mini and 03 mini high. I know it's confusing. I have a slide here that can hopefully help you make a little sense of it. All right. Because which model should you use? Right? Like if you're like, oh, I have a paid chat GBT account, what should I use? Well, there's actually some unique features of each. So listen in here. I have a helpful little graph on screen for our live stream audience. So

01, not the pro. Okay. 01 actually has a great advantage to it. Okay. So right now, 01 and 01 pro are the only O models where you can upload files. Not all upload file types are supported. All right. But it does have four visuals, you know, PNGs and JPEGs, I believe. All right. So that's the 01 series. So normal 01 can access canvas mode.

All right. 01 Pro cannot yet 01 Pro is much more powerful than normal 01. Okay. So if you need to upload files,

right? Visuals at least, because you can't upload PDFs or spreadsheets right now into the O1 models. But let's say you're doing a lot of visual, you know, computer vision type work. You're probably going to want to still choose one of the O1 models. All right. If you love Canvas like I do, you might use O1 because that's the only one that has Canvas. If you need the just straight up raw power, you're going to want to go with O1 Pro. All right. But

01s don't have access to the internet. So 03 mini, there's no differentiation right now between features or other tools within ChatGPT. But 03 mini is the only one that has web search, all right? And that is the only mini model now. I know, a little hard. But essentially, if you need the web, which I highly advise, go 03 mini. That's why I'm using 03 mini a ton, all right? If you need Canvas, use normal 01.

If you're on the big, the big boy plan, then you can use O1 Pro for some of those very tough tasks. Does that, does that make sense, y'all? Hey, if you have questions, get them in now. Podcast audience, I love hearing from you guys. That's why I always put our email in there. I put my LinkedIn, reach out to me, let me know, like, if this is helpful. If you have questions, I'm sometimes a little slow getting around to those messages, but I do eventually.

All right. So let's look live. Let's see what won our little poll this morning. Let me count. So our A's, let's see, we had we had one, two, three, four, five, six, seven, eight, nine, 10. OK, 10. And then our B's. Let's see. We had one, two, three. All right. Looks like you guys wanted the reasoning poll.

the reasoning version here. All right. So let's jump into it. Livestream audience, as always, please let me know when and if you can see my screen here. All right. So we are going into chat GBT. I'm going to do this live.

All right. So, um, these, uh, let me make sure I go into Oh three mini high. So I'm going to be using Oh three mini high for these. All right. So this little reasoning rubric, um, I've been using a lot of these questions now for like two years, right? Before there were reasoning models, uh, I, I had like kind of this common set of about 12 questions, uh, that I would give to any models.

Some of the earlier models, you know, Claude 3.5 Sonnet, GPT-4, GPT-4.0, Gemini 2, didn't do very good with this because they're kind of like trick questions. But I actually think this is pretty important, right? Because sometimes a simple mistake when using ChatGPT or Claude or Gemini can screw up your entire output, right? Because large language models, whether you know this or not, they don't understand words.

You give it a bunch of words, it doesn't understand it. When it spits backwards, it doesn't know what those words are. It converts everything into tokens. So sometimes large language models get confused.

Like humans do, right? But that's important to keep in mind. But that's why I think this kind of like, quote unquote, reasoning rubric is important. These aren't questions that you would generally use, right, on a day-to-day basis to grow your company and career. But this just shows you, are these models smart or not, right? All right. So let's go ahead and try our first question here. So again, I am using O3 Mini HOT.

All right. And you're going to see these live. Hopefully it's not going to take too long to go in there. So the first one I am saying, I just woke up with six apples and three bananas. If you're a longtime listener, you've heard this before. I just woke up with today with six apples and three bananas. Yesterday, I ate a banana and two apples. This morning, you know what? I'm going to go ahead and scroll up here. I'm going to scroll up here. Hey, live stream audience. Let's see if you can get this ready. I'm going to go slow.

I just woke up today with six apples and three bananas. Yesterday, I ate a banana and two apples. This morning, I will eat one apple and no bananas. However, I don't really like apples, and one banana may turn brown tomorrow. Assuming nothing else changes, how many apples and bananas will I have tonight?

Live stream audience, what's your guess on that? Podcast audience, are you scribbling this at home? This is a fun one. I actually made this one up. Some of these are very widely used kind of trick questions or variations of these. Some of them I just made up, right? So I'm curious if our live stream audience can get this one correct. But let's quickly, I'm not going to do this for each and every one.

But let me just quickly describe for our podcast audience what's actually happening here. So it says reasoned about fruit consumption in stock for 29 seconds. So you don't get the full chain of thought, right? You don't get to see the raw unfiltered way that O3 mini high is thinking, but you do get a summary of the chain of thought, right? So I can see what it's thinking. So it's saying assessing fruit intake, right? I woke up with six apples and three bananas. So you kind of get to see

how the model is thinking and digesting your question. Then it says assessing tomorrow scenario, concluding the estimation, adjusting my focus. It says, I initially considered yesterday's fruit consumption, but it seems today's six apples and three bananas take precedence. Yes.

You know, a lot of this stuff in here is just to confuse the model. So the model started going down the wrong road, right? Which all the non-reasoning models got this wrong because that's what they did. They take like, they got this, you know, unrelated information and

It screwed up what it was supposed to do. So then it says assessing fruit stability, avoiding overstocking, reassessing preferences. These are just kind of the headlines in the chain of thought thinking. Assessing fruit freshness, evaluating fruit stock, right? Keep it going. I mean, this is a lot. And then at the very end, it says taking a closer look. Okay, I'm listing five apples and three bananas tonight, assuming no changes. Only one apple is eaten this morning, leaving the rest of the fruit untouched. So,

The final count, it says five apples and three bananas. You know what? Hey, shout out Vincent. Vincent got it right. Good job, Vincent. So did Marie. Good job, guys. All right. I'm going to go a little faster with the rest of our reasoning rubric. But I did want you all on the live stream and the podcast to kind of see and understand. It actually...

Thought about that at a pretty decent level, right? And going through and reading some of this, again, it's just the summarized chain of thought. But same thing. I played around with Google's new Gemini and it got some of these questions wrong. I did it with Gemini as well. But the chain of thought was actually pretty impressive, almost like scary impressive, right? But hey, getting it right is the first most important thing.

All right. The next one, which so many models struggle with this one. All right. So this one is, uh, let me get the right level of zoom here. A man and his dog. All right. And Hey, live stream audience. Let's just see if you guys can beat. Oh, three mini high. Some of these are very easy. All right. Uh, this one, you should be able to get instantly a man and his dog are standing on one side of the river. There's a boat with enough room for one human and one animal. How can a man get across with his dog in the fewest number of trips?

Like reasoning or sorry, transformer models can't get this. They can't. Right. Claude saw it. Gemini, GPT-4. Oh, none of them can get this, even though this is dead simple for any human with a brain. Right. So let's scroll down.

Scroll down. A lot of thinking here for something simple, right? But finally, finally, finally, finally, it's just one trip, right? Usually you would get three to five, even from these very powerful models, right? And this is one of the reasons why a lot of companies like before reasoners were like, I don't know, these models are dumb. Well, yeah, they can be a little dumb, right? Generally, these are trick questions, but now you're seeing it's handling it fairly well, all right?

Next question. Here we go. We're going to go through these quick, y'all. So a man and his dog are standing. That's the same one. I got to copy and paste the other one, y'all. All right. Next one.

Uh, if it takes three hours to dry 10 t-shirts in the sun, how long will it take to try 30 t-shirts in the sun? Hey, mathematicians on the live stream go, can you be Oh three, uh, Oh three mini high. So if it takes three hours to dry 10 t-shirts in the sun, how long will it take to dry 30 t-shirts in the sun? All right. Uh, let's keep going. There we go.

Got it. Correct. Three hours doesn't change. Right. It's saying, assuming you have the room, uh, it doesn't change. All right. Our next question. And again, a lot of them got this wrong before the reasoning models. All right. If you have a single match and walk into a room with an oil lamp, a candle in a fireplace, which do you light first? All right. Livestream bodies. What do you think? Which do you like first?

I hated these questions, right? Like when these are on standardized tests, you know, a train leaves the station at this time and an airplane's going here and this person's on a unicycle, but the unicycle's going uphill. And I'm like, this is dumb. I don't want to answer this, right? But what do you guys think? All right. Ted, Ted, Ted, Ted, Ted got the answer, right? Good job, Ted. Yeah, but the answer is the match. Yeah, it's not the candle or anything else. You got to light the match first.

All right, a couple more very simple ones, y'all. All right, so here's our next one. What color is an airplane's black box? That's just a trick one. All right, but it's going to get it right because even the Transformer models, bright orange is the correct answer. There we go.

All right. Our next one on our reasoning rubric for 03 mini high. All right. And again, for all of these y'all like, okay. So for that one, there was not a lot of chain of thought underneath, right? It said understanding the situation and that's all. It didn't have to go back and forth and second guess itself and, you know, map out all these alternative paths. It was pretty simple.

This one is, this one is kind of tricky, uh, in transformer models can never get this right. So I said, please give me seven jokes that end in the word blue. Two should be about animals. Three should be about some other topic in the body of this chat. Okay. And you can make up the other two. I'll tell you this.

Large language models aren't funny. All right. So I'm going to just read a couple of these jokes. I'm just mainly going to make sure that do they all add an end in blue? Is there two about animals, three about context of the chat and two that it made up. They're not going to be funny. Right. Uh,

And it always does the same thing. It's always like, oh, they're feeling blue. All right. This one is taking a little bit longer, right? So it's laying out the options, mapping out the connections, generating a diverse list, crafting humorous animal punchlines, brainstorming jokes, right? So a lot of this is actually a little more difficult for O3 Mini, right? It's taking a little bit more time to think about this. Let's see if it's done. It thought about this for a minute and 10 seconds, right? Kind of a long time. All right.

It said refining humor, which I haven't read the jokes. They're not going to be funny because ending it in blue, there's really nothing. I haven't seen anything, right? Humans out there, humans, if anyone can give me a real good joke that meets these criterias,

I don't know. I'll pay for a month of chat GPT. But I don't think there's anything funny that you can actually do because people are always like, oh, it failed. That's not a joke. And I'm like, okay, humans, you go ahead and do the same thing. See if you can make me laugh with the ending in the word blue. Probably not. All right. So let's see if it actually did it. Look at all this chain of thought, y'all. Sheesh. All right. So it got two animals. Perfect. Ends in the word blue. Perfect. Perfect.

All right. So here we'll, we'll read a couple of these at the local jazz night. My dog tried to sing along with the band. When I asked him why he kept hitting the wrong notes, he just barked blue, not funny, but hits it, hits it. Right. So,

Now it has three jokes about that use the context of this chat, all ending in the word blue. Let's just read one of them. I started here. This one's about the fruits. I started my day with six apples and three bananas. But after all the breakfast fuss, even the fruit salad confessed blue.

Funny? Nope. All right. And then two that it made up on its own. Let's read both of these because these are anytime there's something that's like 10% humorous, it's always the one that said it made up on its own. All right. So I visited a paint store looking for a hue to brighten my day. The salesman held up a can and said blue. Not funny. All right. Last one. When life handed me lemons, I tried making lemonade, but no matter how hard I squeeze, my mood still ended up blue. So funny.

Are these jokes kind of, are they funny? Absolutely not. Do they hit the criteria that we set forth? Yeah. Yeah. You know, I don't know. Maybe one person out there would laugh. All right here.

is the last one that we'll be able to definitively say yes or no. And this is a really good one. All right. Livestream audience, get ready. All right. Cause I'm pretty sure this is going to think for at least a minute or two. I want to see, can anyone out there in live stream land beat Oh three mini high on this? All right. You, you already see the prompt in there. So humans, you get a headstart. All right. So

Here we go. A box is locked with a three digit numerical code. All we know is that all digits are different. If the sum of all digits is nine and the digit in the middle is the highest, what is the code?

All right. Go ahead, humans. Can you beat? Right. Everyone's like everyone's always like, oh, AI isn't smarter than me. All right, humans. Let's see. All right. So a box is locked with a three digit numerical code. Can you be three many high? All we know is that all digits are different. The sum of all digits is nine and the digit in the middle is the highest. All right. Let's see.

Can anyone beat? All right. And I'm not, I'm not going to show the chain of thought on this for, to make it fun for our live stream audience to see if you can beat. Oh, three mini high. I don't see any responses yet. Y'all. All right. Marie got one. Marie said, Oh, eight one. Marie beat. Oh, three mini high.

All right, good. One thing is I didn't specify. So we'll see if O3 Mini High says it. And there's actually a lot of answers. All right, because I didn't specify if you could use a zero. I should update that rubric, right? But let's see how it did. Some impressive chain of thought here, right? So it broke down the rules. It's adding, you know, A plus B plus C equals nine. B is greater than A and B is greater than C, right? All these things.

Step one, so again, it's doing some basic algebra here. All right, let's scroll to the bottom. All right, so I did not designate that zero. I don't know why. All models don't think or know that you can start it with zero. They think it's like,

The first digit has to be a one through 10 and they only use zeros in the second and third spot. So I should update this to say you can use zeros in any of the three numerals, but it did get it right because there are 10 not counting starting off with a zero. I believe there are 10 different codes, right? So 180270162261360153351450243342.

So, yeah. Hey, good job, human friends. You guys got a lot of the solutions, right? All right. Let's just try one or two more. These ones are not something that are like right or wrong, right? This is more of an arbitrary answer. So here I'm going to click the search the web.

Okay, so let's go ahead. This is an example of where I think things can get powerful. But this prompt is, again, this is nothing special. All I'm saying is generate unique and creative marketing advertising strategies to grow the everyday AI podcast. Do not suggest general or run-of-the-mill ideas.

Only pitch clever advertising and marketing tactics to specifically grow the everyday AI podcast by Jordan Wilson. Hey, same thing humans. Hey, humans in the live stream audience answer this. How, how should we grow this podcast? Let me know. All right. So now.

It's brainstorming marketing strategies, crafting innovative strategies, identifying unique angles, right? Crafting AI driven campaigns, all this stuff, engaging the community, right? I think I'm doing an okay job at that. Hopefully. All right. Keep going. Keep going down. Keep going down. All right. Let's see if we got some answers. So below, did I ask for a certain number? No, I did not. So it said below are seven, uh,

Inventive, tailored strategies designed exclusively to grow the everyday AI podcast by Jordan Wilson. All right, so let's see if any of these are actually good because I've done this with all the different models and generally non-reasoning models give me kind of boring stuff, right? It's like, oh, you know,

Take out ads or, you know, post something on LinkedIn. I'm like, okay, that's boring. All right, so let's see. Number one is the AI Creator Accelerated Challenge. Launch a branded contest where listeners are invited to submit a brief case study on how they use a featured AI tool. What's very strange...

I kid you not. I just thought of this like last weekend in the shower. I'm like, oh yeah, I'm going to start doing this for like use cases. So okay, good job, O3 Mini. I hadn't heard this from any other non-reasoning model before. All right. Interactive AI chatbot ambassador. All right. So develop a custom AI chatbot branded in everyday AI's visual style and tone. All right.

Nothing. It's pretty standard. Everyday AI augmented reality filter campaign. Okay. Chat GPT. I don't know how much time you think I have to do that, but it's unique. All right. Number four, co-branded AI showcases with tool makers. Identify and partner with emerging or established AI tool companies for exclusive co-branded live mini webinars or demo days.

Yeah, I get enough of that. People always want to pitch their garbage to come on the show and sell to you guys. And I say, no, right. I think I got like 15 pitches yesterday. All right. And everyone wants to shove their garbage products down your throat. So I'm going to say no to that one. All right. Five personalized podcast journey generator built an interactive dynamic website feature that asks visitors a few short questions about their industry career goals and current usage. All right. That's fine. Six. Yes. Yes.

I've had this idea, so I like this one. Embed subtle Easter egg audio clips. Oh my gosh, I love this. I love this. This is actually one of my first ideas that I had back in 2022 before I even launched this. I'm like, oh, I'd love it because I love this Easter egg thing. And we're actually going to do this at some point. So yeah, podcast Easter egg scavenger hunt. So hiding subtle hints,

uh, inside certain podcasts and you got to find them. That one's fun. Love that idea. And then last but not least hyper-personalized social ads powered by AI insights. All right, pretty good. Uh, nothing, uh, nothing crazy here. So I have run this, uh,

I did do some of these tests last night. And last night when I turned on the search mode, it did a little bit better of a job. So here's the thing. Generative AI, large language models, unless I tell it in the prompt to explicitly go research on the web, even if I click that search button, sometimes it will, sometimes it won't, right? So, you know, I'm just curious. I'll probably just run that one one more time because I'm actually just curious. And I'm going to say use chat GPT search.

before you start to better understand everyday AI by Wilson. Yeah, because I ran this exact same prompt last night. And in this version that I just did live for you guys,

Like the whole point is like, oh, watch when I click search, right? It didn't search. Sometimes it does. Sometimes it doesn't. That's just how large language models work, right? Unless you explicitly tell it to. And when you explicitly tell it to search and you have that search icon, 95% of the time it actually will. But I was actually a little bit surprised. All right. So I'm going to let that run. And then we're going to do just our last one here. All right.

And then I'm going to read this one and then we're going to check in on the second attempt. So this last one is create a new company and brand for a future smart home device. This will solve a problem that does currently not exist. I like this one to start, come up with the company's name and its first flagship product, give the product a name, brand and campaign, go to market strategy, tagline and rationale for why it will work.

And then I said, respond in a succinct way, keeping responses to short bullet points, but with ultra specific facts. All right. So now I'm going to click rewind and look at that same come up with, you know, inventive ways to grow the everyday AI podcast. But this time, even though I had the search button clicked, I had to explicitly tell it, yo, go to the web, go to the web, homie. And now I see here in the responses, it actually did this time because now it's citing things.

So yeah, last night when I ran this, it actually gave me some citations within the actual answers. So in this one, it just did it at the end. So again, generative AI is generative, right? Especially if you're just doing these copy and paste prompts, which I never recommend, but for live demos, that's the best way to do it. All right.

Cause I can't sit here and go through a whole prime prompt polish to get the most out of this. Uh, right. But you'll see even just being a little more explicit and telling it, yo, go search the web. Even though I clicked that search button, uh, it didn't do it the first time. All right, let's look at our last one. And then we're going to wrap this show up y'all. All right. So pretty good chain of thought here. It only thought for 17 seconds, which isn't a lot. Uh, and we'll see if it actually used the web. I had the search button clicked, but I didn't explicitly tell it to. So maybe it did. Maybe it didn't.

One other thing that it doesn't do, and I wish it did when you can see the summarized version of chain of thought, I wish that it would show you if it did go to any websites and if it is using that to think, right? Uh, cause all you get is, um, citations in the response. I wish you could see like you can in the deep research. Cause in deep research, there's an activity tab. So you can see, oh, we went to website one.

And now on website one, it found out this and then it pivoted and it looked at something else. So I wish we got a little bit of that in 03 mini when you tell it to use the internet, but you don't. All right, let's just see the responses for this innovative smart device that solves a problem that doesn't exist. So the company name is Zenovate Smart Living and its mission statement is to craft intelligent, adaptive living spaces that optimize mental wellbeing and productivity in a hyper-connected future.

So here's what it does. It is a smart home hub that gathers biometric data such as EEG, HRV, which I think is heart rate value, via integrated sensors and wearables to continuously gauge user stress, focus, and fatigue levels. It dynamically adjusts ambient lighting, temperature, acoustics, and even scent diffusion to create a personalized cognitive sanctuary. Okay.

I mean, if I was like Tony Stark rich, I would just pay to develop this. This sounds pretty cool. So, okay. Oh, three mini high. Uh, pretty, pretty good job on that. That's open-ended. There's no right or wrong answer. I've run this on, you know, all different models. And this is probably one of the better responses I've got. Normally it's just kind of boring stuff stuff. Uh, and you'll see in this one here, right? Uh, it did, uh, oops, I gotta go down. So it did also, Oh, got a little confused here. Let me see.

Okay, interesting, because it's actually now kind of melding. Wait, is that right? Hold up. Yeah, so it's bringing in some everyday AI aspects into this Nero haven, which it shouldn't have, right? But that's kind of why you have to always use these properly, right? Generally, I would start a new chat. I would go through, kind of quote-unquote train it, take it through our prime prompt polish, go through refine queue, so it doesn't pull in information from the rest of the chat. But so...

What do you think, y'all? Are you impressed with O3 Mini? Let me just say this. Benchmarks, outstanding. Even the free model. So I will say yes right now, but this could change next week. Right now, it is the best free chatbot model in the world. Although, like I said, I think probably everyone out there, every single business person,

Every single business should be paying for either a Teams account or an enterprise account for whatever large language model environment you want to work with, whether that's Microsoft 365 Copilot, which I highly recommend, ChatGPT Enterprise, Google Gemini for Workspace.

Claude enterprise. Sure. Yeah. Yeah. If, if, if you're fine, not having access to out the day information, sure. Uh, right. But you should always, always, always be paying for a team enterprise subscription in the same way that your employees need, like, you know, Microsoft word, or they need, you know, word docs, they need certain software, right. That costs money. Your team needs a paid account. So let me get that out of the way. I'm not telling you to not pay for this.

But even for a free plan, I am excited because here's what this means. A year ago, I said, don't touch chat GPT's free plan. It is absolutely terrible. It is riddled with hallucinations because you were using the 3.5 version, which is bad, right? It wasn't connected to the internet. So a lot of what was ultimately shared online, uh,

It was just bad stuff, right? Because people that didn't know AI, they would just go in, create a free account, do a prompt or two, not knowing how large language models work, not understanding generative AI. They'd get a response that was absolutely horrible. They'd post that online or take that back to their director or their board. And they're like, look, AI is not for us. Well, sorry, that was dumb if you did that. I don't know, y'all.

2025, I'm a little spicier. I'm a little more tired. I'm a little older. I'm not going to be nice anymore. Right? I'm tired of people not knowing how to use AI. And then you go through and you get a bad output and you share it on social media and you're like, oh, yeah, I will never take my job. And I'm like, yeah, it will. It 100% will. Because all you did is you just went out there and said, hey, I don't know how to use AI. Right? A kind of funny comparison that I made to this. This is like if I, right?

I'm going to do something live here. Sorry if you're on the treadmill and you want to end this, right? But this has to do with free chat GPT, I swear, right? So this is actually, you can't see this because of the green screen thing, apparently. Let's see. Can you see this one? There we go. So this is like me if I draw something, right? Can you guys see what I drew here? Livestream audience, can you see this? I'm making a point, I swear, all right?

So this is if I posted this online and said art sucks. Look at this. Art sucks. There's no room in the business world for anything artistic because look at this, right? I drew a picture of a stick figure. Art sucks, right? No, art doesn't suck. I suck at art.

There's definitely a place for art in the world. So that's what I think the old free version of chat GPT did for like the business world. It was a bunch of people that had no clue what they were doing. They would go on, use a bad word.

version of GPT, GPT 3.5 that wasn't connected to the internet. Because when everyone's trying to figure AI out, they're not always paying for the best models, right? And they're like, look, this is bad. It's generic. It's full of hallucinations. AI stinks. No, you stink. You stink. But now, hopefully in 2025 and beyond, we'll avoid that. Because now, I think OpenAI's O3 Mini is

is the best free AI model in the world. And it has now closed the gap. Yes, albeit on a very limited basis because you can't use a ton of messages, right? But it's at least closed the gap between what the rest of the world can access and get a taste of and what those that are paying for the best model have as well. All right. I hope that was helpful, y'all. If so, remember, go check out

our AI predictions series. It's all online. I cannot recommend that enough. And I'm going to continue to demand you go listen to that because even the things I was talking about two weeks ago have already started to come true, obviously. And if this was helpful, right? The combination of having a reasoning model that can search the internet when you prompt it to, it is mind boggling, mind bogglingly good.

It is, I think, and if you didn't go share the deep research episode, you missed out because that guide was fantastic. But I do have 20 business use cases that are ready to go. You got to read it. You got to update some placeholders. You got to think, right? But when you combine the O3 mini reasoning model with search methods,

This changes what's possible. All right. So go repost this show. If you're listening on the podcast, I always leave the link to go repost this show. If you want to, I'd appreciate that. I'd appreciate you also go to your everyday AI.com sign up for the free daily newsletter. Thanks for tuning in. Hope to see you back tomorrow and every day for more everyday AI. Thanks y'all.

And that's a wrap for today's edition of Everyday AI. Thanks for joining us. If you enjoyed this episode, please subscribe and leave us a rating. It helps keep us going. For a little more AI magic, visit youreverydayai.com and sign up to our daily newsletter so you don't get left behind. Go break some barriers and we'll see you next time.

EP 456: OpenAI’s o3-Mini - The world’s best free chatbot model?

Everyday AI Podcast – An AI and ChatGPT Podcast

Deep Dive

OpenAI的O3 Mini：颠覆我以往对免费AI模型的认知

Shownotes Transcript

We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

EP 456: OpenAI’s o3-Mini - The world’s best free chatbot model? 58:08 Share

Everyday AI Podcast – An AI and ChatGPT Podcast

Deep Dive

OpenAI的O3 Mini：颠覆我以往对免费AI模型的认知

Shownotes Transcript

We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

EP 456: OpenAI’s o3-Mini - The world’s best free chatbot model?