We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

OpenAI Chief Research Officer Mark Chen: GPT 4.5 is Live and Scaling Isn’t Dead

2025/2/27

Big Technology Podcast

AI Deep Dive AI Chapters Transcript

People

Mark Chen

Topics

我：GPT-4.5是我们在可预测的扩展范例中的最新里程碑。它代表着比以往模型数量级的提升，这种提升与GPT-3.5到GPT-4的飞跃相当。至于为什么不是GPT-5，我们的命名决策是基于可预测的扩展趋势。GPT-4.5的性能符合我们对该名称的预期，它反映了模型在计算量、效率等方面的改进。我们现在有两个可以扩展的维度：无监督学习和推理。GPT-4.5是无监督学习扩展的最新成果，但我们也在大力发展推理模型。GPT-5可能是这两个方向的集大成者，将融合两者优势。关于扩展的局限性，我的观点是：无监督学习的扩展可以通过增加计算量、算法效率和数据来实现，GPT-4.5就是证明。无监督学习和推理是互补的，需要知识来构建推理。GPT-4.5在日常使用和知识工作方面优于GPT-4和O1，因为它拥有更多世界知识。 GPT-4.5和推理模型（如O1）在响应速度和思考深度上有所不同。GPT-4.5响应迅速，但思考较少；O1思考时间长，但答案更佳。在创意写作、部分编码和特定科学领域，GPT-4.5的表现优于推理模型。 GPT-4.5的规模是我们目前发布过的最大模型，我们观察到在该规模下，增加计算量、数据量仍然能获得与之前相同的回报。开发过程中，我们确实会中途停止、分析和重新启动模型训练，但这并非GPT-4.5独有的情况。模型的效率提升与核心能力的开发是相对独立的。我们一直在努力提高推理效率，降低服务成本。混合专家技术等架构改进也适用于GPT-4.5以提高效率。关于大型通用模型与小型专用模型的关系，我们既开发大型基础模型，也提供更小、更经济高效的模型。我们的目标是推动智能前沿，并使这些能力更经济高效地服务于所有人。大型模型的改进会内在地提升产品的性能，例如深度研究。 GPT-4.5在传统基准测试中取得了数量级的提升，同时在情感智能方面也有改进。这并非目标的转移，而是模型新能力的体现。我们希望用户能发现更多GPT-4.5的有趣用例。最后，关于OpenAI的人才状况，我认为我们仍然是世界一流的AI组织，人才流动是AI领域发展的自然现象。我们内部人才济济，有很多人愿意承担责任。

Deep Dive

Chapters

This chapter explores the release of GPT-4.5, highlighting its significant improvements over previous models and addressing the question of why it's not called GPT-5. The discussion also touches upon the complementary nature of unsupervised learning and reasoning in AI scaling.

GPT-4.5 represents a significant advancement in OpenAI's predictable scaling paradigm.
The model demonstrates an order of magnitude improvement compared to previous versions.
OpenAI's research program explores both unsupervised learning and reasoning as parallel approaches to scaling AI models.
GPT-4.5 shows a 60% preference rate in everyday use cases and a 70% preference rate for productivity and knowledge work compared to GPT-4 or O1.

Shownotes Transcript

Translations:

中文

OpenAI Chief Research Officer Mark Chen is here to talk about the release of GPT-4.5, the company's largest and best model yet, which is coming out today. We'll dive in right after this.

Will AI improve our lives or exterminate the species? What would it take to abolish poverty? Are you eating enough fermented foods? These are some of the questions we've tackled recently on The Next Big Idea. I'm Rufus Griscom, and every week I sit down with the world's leading thinkers for in-depth conversations that will help you live, work, and play smarter. Follow The Next Big Idea wherever you get your podcasts.

Welcome to Big Technology Podcast, a show for cool-headed, nuanced conversation of the tech world and beyond. We're joined today by Mark Chen, the Chief Research Officer at OpenAI, who's here to talk about the company's newest release, GPT 4.5. Yes, it's finally here, and it is debuting today. Mark, great to see you. Welcome to the show. Thank you so much for having me on.

Thanks for being here. This is in four and a half years of the show, our first OpenAI interview. So hopefully the first of many. We appreciate you jumping into the water like this. And it's on big news with the release of GPT 4.5.

Yeah. So GPT-4.5, really, it signifies the latest milestone in our predictable scaling paradigm. So previous models that have fit this paradigm have been GPT-3, 3.5, 4, and now this is the latest thing. It signifies an order of magnitude improvement over the last models, kind of commensurate with the jump from 3.5 to 4. I think the question that most of our listeners are going to be asking, and certainly we've asked

on our show in the past couple of months is why isn't this GPT-5? I mean, what is it going to take to get to GPT-5?

Yeah. Well, I think GPT-5, you know, whenever we make these naming decisions, right, we try to keep with a sense of what the trends are. So, again, when it comes to predictable scaling, right, going from 3 to 3.5, you can kind of predict out, you know, what an order of magnitude of improvements in, you know, amount of compute that you train the model with in terms of efficiency improvements will buy you.

We find this model aligns with what 4.5 would be. So we want to name it what it is.

Okay. But there's been so much talk about, um, when GPT-5 is going to come, correct me if I'm wrong, but I think there's been a longer wait between GPT-4 and 4.5, uh, than there has been between, let's say, uh, GPT-3.5 and 4. And I don't know, is, is this, uh, like, because we're seeing a lot of hype from, uh, opening eye folks on Twitter about what's coming next or, uh,

Maybe this is probably, it probably is the most impatient industry in the world and the most impatient users in the world. But it seems to me like the expectations for GPT-5 are built up pretty high. And so I'm curious from like your perspective, do you think it's going to be hard to meet those expectations whenever that happens?

GPT-5 model does come out? Well, I don't think so. And one of the fundamental reasons is because we now have two different axes on which we can scale, right? So GPT-4.5, this is our latest scaling experiment along the axis of unsupervised learning, but there's also reasoning. And when you ask about kind of like why there seems to be, you know, a little bit bigger of a gap in release time between 4 and 4.5, we've been really largely focused on developing the

the reasoning paradigm as well. So I think our research program is really an exploratory research program. We're looking into all avenues of how we can scale our models. And over the last one and a half, two years, we've really found a new, very exciting paradigm through reasoning, which we're also scaling. And so I think like GPT-5 really could be the culmination of a lot of these things coming together.

Okay. So you talk about how there's been a lot of work toward reasoning. We, of course, have seen that with a one. There's a lot of buzz about deep seek. And now we're talking about, again, like one of the more traditional scaled up large language models with GPT 4.5. So the big question here, I think that was on a lot of people's mind when it came to this upcoming release, we thought it was going to be 4.55. Anyway, it doesn't matter. The big question is,

Can AI models continue to scale when you add more compute, more data, and more power to them? It seems like you have an answer to this. So I'm curious to hear your point of view on what you've learned about the scaling wall, given your development of this model, and whether we're going to hit it, whether we're already seeing some diminishing returns from scaling.

Yeah, I really kind of have a different framing around scaling. So when it comes to unsupervised learning, right? You want to put more ingredients like compute algorithmic efficiencies and more data. And GPT-4.5 really is proof that we can continue the scaling paradigm. And this paradigm is not the antithesis of reasoning as well, right? You need knowledge in order to build reasoning on top of, right? A model can't kind of go in blind.

and just learn reasoning from scratch. So we find these two paradigms to be fairly complementary. And we think, you know, they have feedback loops on each other. So yeah, GPT-4.5, again, it is smart in different ways from the ways that reasoning models are smart, right? When you look at the model today, it has a lot more world knowledge. When we look at kind of

comparisons against GPT-4 or O. You see that everyday use cases, people prefer it by a margin of 60%. For actually productivity and knowledge work against GPT-4 or O, there's almost like a 70% preference rate. So people are really responding to this model. And it's this knowledge that we can leverage for our reasoning models in the future.

So what are the examples, like you talk about everyday knowledge work, what are some of the examples that you would use GPT 4.5 for that you would prefer it over a reasoning model? Yeah, so I...

I wouldn't say-- it's a different profile from a reasoning model. So with a larger model, what you're doing is it takes more time to process and think through the query, but it's also giving you an immediate response back. So this is very similar to what a GPT-4 would have done for you. Whereas, I think with something like O1, you get a model where you give a query, and it can think for several minutes.

And I think these are fundamentally kind of different trade-offs, right? You have a model that immediately comes back to you, doesn't do much thinking, but comes up with a better answer versus a model that thinks for a while and then comes up with an answer. And we find that in a lot of areas like creative writing, for instance.

Again, this is stuff that we want to test over the next one or two months. But we find that there are areas like creative writing where this model outshines reasoning models. Okay, so writing, any other use cases? Yeah, so there's writing, I think some coding use cases as well. We also find that kind of like, you know, there are some particular kind of scientific domains where this outshines in terms of the amount of knowledge that it can display.

Okay, and I'm going to come back to benchmarks in a moment, but I want to keep on this scaling question because I think there's been a lot of conversation about it in public, and it's great to be speaking with you from OpenAI to sort of get to the bottom of what's happening. So the first is the question that folks have is,

Do you end up at this size? And you don't talk about the size of the models, which is, you know, which is fair. But they're big, right? This is the largest model that open AI has ever released GPT 4.5. So I'm actually curious to hear at this size, does adding, you know, similar amounts of compute similar amounts of data, get you the same returns that you did? Or are we already starting to see the returns of adding these resources tail off?

We are seeing the same returns. And I do want to stress that GPT-4.5 is that next point on this unsupervised learning paradigm. And we're very rigorous about how we do this. We make projections based on all the models we've trained before on what performance to expect. And in this case, we put together the scaling machinery, and this is the point that lies at that next order of magnitude.

So what's it been like getting here? I mean, again, we talked, okay, so there was a period of time that was longer than the last interval. And part of that was focused on reasoning. But there's also been some reports that opening eyes had to start and stop a couple times to get this to work.

And it really had to fight through some thorny issues to get it to be this step change, as you're saying. So talk a little bit about the process and maybe you can confirm or deny some of the things that we've heard about having to start and stop again and retrain to get here. Actually, so I think it's interesting that this is a point that's attributed to this model because actually in developing all of our foundation models, right?

they're all experiments, right? I think running all the foundation models oftentimes does involve stopping at certain parts, kind of analyzing what's going on, and then restarting the runs. And I don't think that this is a characteristic of GPT-4.5. It's something that we've done with GPT-4, with O-series models. And they are largely experiments, right? We want to go in, diagnose them in the middle, and if we want to make some interventions, we should make interventions. But

I wouldn't characterize this as something that we do for GPT-4.5 that we don't do for other models. We've already talked a little bit about reasoning versus these traditional GPT models, but it makes me think of DeepSeq and

I think you already gave a pretty compelling answer as to like what you would use one of these models for versus a reasoning model. But there's another thing that DeepSeek did that is worth discussing, which is that they made their models much more efficient. And it's kind of interesting, like when I talked to you about like, all right, so you need data, you need compute, you need power. You're like, yeah, and you need model optimizations, which is something that people often overlook. And just going back to DeepSeek for a moment,

The model optimization, the fact that they went from basically querying the entire knowledge base to a mixture of experts where they were able to sort of route the queries to certain parts of the model instead of lighting it all up,

is credited with help them helping them get more efficient. So I just want to turn it over to you without commenting on what they did, or if you can, if you want, but I'm actually more curious what OpenAI is doing on that front. And what sort of whether you did similar optimizations with GPT 4.5? And are you able to run these large models more efficiently? And if so, how?

Yeah, so I would say kind of the process of making a model efficient to serve, I often see as fairly decoupled from developing the core capability of the model, right? And we see a lot of work being done on the inference stack, right? I think that's something that DeepSeq did very well. And it's also something that we push on a lot, right? We care about serving these models at cheap cost to all users. And we push on that quite a bit.

So I think this is irrespective of GPT-4 or reasoning models. We're always applying that pressure to be able to inference more cheaply. And I think we've done a good job of that over time. The costs have dropped in many orders of magnitude since we first launched GPT-4.

And so are there like, I mean, maybe tell me if this is too elementary, but the move towards, for instance, mixture of experts. Is that more of a reasoning thing or can you apply that in GPT 4.5? Yeah, so that is an architectural element of language models. I think pretty much all large language models today utilize mixture of experts. And it's something that applies equally to efficiency wins in language.

foundation models like GPT-4 or 4.5 as it does to reasoning. So you were able to use that here as well, basically? No, we've definitely explored a mixture of experts as well as a number of other architectural improvements in GPT.

Okay, great. So we have a Discord with some members of the big technology listeners and reader group. And, you know, a theme that's come up recently, it's kind of interesting to be talking with you right now about an extremely large model, because a theme that they can't stop talking about the people in Discord is just that how small and niche models to them are going to, you know,

potentially be the future. I'll just read you one comment that we had over the past few days. "For me, the future is very much aligned with niche models existing in workflows and less so of these general purpose God models."

So clearly open AI has a different thesis here. And I am curious to hear your perspective on what we get with the big models versus the niche models. And do you see them in competition or as complements? Help us think through that. Yeah, yeah. So I think one important thing is we also serve models that are smaller, right? Like we serve our flagship frontier models, but we also serve mini models, right? Which are cost efficient ways that you can access the capabilities or fairly close to frontier capabilities.

capabilities for much lower cost, right? And we think that's an important part of this comprehensive portfolio here. Fundamentally at OpenAI though, we're in the business of advancing the frontier of intelligence. And that involves developing the best models that we can. And I think really kind of what we're motivated by is really pushing that out as much as possible. We think there's always going to be use cases at the frontiers of intelligence.

We think that going from 99.9 percentile in mathematics to the best in the world in mathematics, right? That difference means something to us. I think what the best human scientists can discover is tangibly different from what you or I can discover. So we're motivated by pushing the intelligence frontier as far as possible. And at the same time, we wanna make these capabilities cheaper and more cost effective to serve for everyone.

So we don't think the niche models will go away. We want to build these foundation models and also figure out how to deliver these capabilities at cost over time. So that's always been our philosophy. There's always going to be some juice there in those last bits of intelligence. Yeah, so let's talk about that because we have a debate on the show often, what matters more, the products or the model? I'm on team model. We have Ranjan Roy who comes on on Fridays. He's team product.

product. He's basically like, just take what you have now and prioritize it. And I say, well, you could probably do more with a better model, but I have to be honest, I'm kind of at a loss for words sometimes about what that getting from that 99th percentile in math to the best in world in math will do. So actually I'm curious to hear your answer on this one. What does building the best model in the world do that you couldn't do otherwise? Yeah, 100%. And I think really,

It signals a shift, right? Like I think if you just think about, hey, you take the current models and you build the best surface for them, that's certainly something you should always be doing and exploring that exercise. I think three years ago, that looked like chat, right? We launched chat GBT. And today, when you take the best models and the best capabilities, I think it looks a little bit more like agents, right? And I think reasoning and agents, they're very, very much coupled,

When you think about what makes a good agent, it's something that you can kind of sit back, let it do its own thing, and you're fairly confident it'll come back with something that you want. And I think reasoning is the engine that powers that. You have the model go and try something out, and if it can't succeed on the first try, it should be able to be like, oh, well, why didn't I succeed, and what's a better approach for me to do?

I think very much the capabilities are always changing and the surface is always changing as a response. We're always exploring what the best surface for the current capabilities looks like. I'm on your team here. Again, just to hammer home on this, what does that improvement in model get you? What do you think that it will enable?

Yeah, yeah. So, I mean, I think, I mean, agents of all forms, right? When you look at stuff like deep research, for instance, right? It gives you the ability to essentially kind of

get a fully formed report on any single topic that you might be interested in. I've used it to even put together hour-long talks. And it goes and really synthesizes all the information out there and really organizes it, comes up with lessons, allows you to do deep discovery. It allows you to dig into almost any topic that you're interested in. So I feel like...

just the amount of information and synthesis that's available to you now is just really rapidly evolving.

So basically it's not as simple as like just go make deep research better with the product, with the model you have now. Am I reading between the lines the right way saying that what you're expressing here is that if you make the model better, then the product is going to get better inherently take deep research, for instance. 100%, 100%. Yeah. And that's something that is not enabled unless you have models of a certain level of capability, both in reasoning and in the foundational unsupervised learning sense.

Okay. You know, it's interesting. I guess like this one question I've had in the back of my mind is, and I'm just going to ask it to you again, just so I'm sure I'm clear on it is my view, maybe erroneously was that we were just going to, or your industry was just going to move from these massive models to the massive models with reasoning, but you're actually saying that there's a dual track here.

Yeah, yeah. So I think we're always pushing the frontier, right? And we, I think, even since, you know, five, six years ago, the prevailing way to do that was to up the scale, right? And so we've been upping the scale in unsupervised learning. We've been upping the scale in reasoning. But at the same time, right, you care about serving mini models. You care about serving models that are cost effective, that can deliver capabilities at a cheaper cost.

And that will often be sufficient for a lot of use cases, right? And the mission isn't just about pushing the biggest, most costly models. It's about having that and also a portfolio of models that people can use cheaply for their use cases.

Okay, so let's quickly talk before we leave about the upgrades that you're seeing in 4.5 compared to 4. So I'm curious, like, if you can just run us through a very high level, the benchmarks it hits versus the benchmarks of the previous models. And then I'll just throw a double question in here. I've already read your blog post, and so I have an idea of what's coming. By the way, we're going to release this just as the news is released. So

It seems like you're also saying, making a statement in some ways, saying like, yes, we have the traditional benchmarks, but we also need to measure how this model works with EQ as opposed to just pure intelligence. So yeah, just hit us with the benchmark improvements and then why you think that it's important for us to look at both of these in conjunction. So, I mean, along all traditional metrics, like things like GPQA, AMI, the traditional kind of benchmarks that we track, this does signify an order of magnitude improvement

about at the same level of jump from 3.5 to 4. There's a kind of interesting focus here also on, I would say, more vibes-based benchmarks. And I think that's actually important to highlight because every single time we've launched a model, there is a discovery process of what the kind of interesting use cases out there are going to be. We notice here it's actually a much more emotionally intelligent model. You can kind of see examples in the blog post later today

but like how it responds to, you know, queries about, you know, a hard situation or, you know, advice in a particular difficult situation. It responds more emotionally intelligent. I think there's also just kind of like, you can kind of see like, this may be a kind of silly example, right? But if you ask any of the previous models to create ASCII art for you, right? Actually, they mostly just fall down. This one can do it.

Almost flawless. Pretty well. And so there's just so many kind of like footprints of improved capabilities. And I think things like creative writing will showcase this.

One of the things that I think I picked up in the examples that you've given so far is that it doesn't seem like it feels the need to write a thesis for every response. Like one user was like, I'm having a hard time. And it actually succinctly wrote like as if a human would, as opposed to like maybe the traditional, here's three paragraphs of self-care routine you can do for yourself. Yeah, yeah, yeah. And that speaks to the emotional intelligence, right? It's not like, oh-

I see that you're feeling bad. Here are like five ways you could feel better, right? It just doesn't feel like a grounded kind of compassionate response. And here you just get something that's direct to the point and really invites the user to say more. So I think there's going to be a criticism. I can't, I'm anticipating it. And let's, let's talk about it right now that people will say, okay, open AI was talking about these traditional benchmarks. Now it's talking about emotional intelligence. It's shifting the goalposts and wants us to pay attention to something else. What's your response there?

Well, I really don't think that the accurate characterization is that it doesn't hit the benchmarks that we expect it to. So when you look at kind of the development of 3 to 3.5 to 4 to 4.5, this does hit the benchmarks that we expect. And I think the main thing is like, you know, it's all about use case discovery every time you put a new model out there.

And in many senses, like, GPT-4 is already very smart, right? And kind of when we were putting that, this parallel is kind of like when we were putting GPT-4 out, right? It's like we saw it hit all the right benchmarks that we expected to, but what are users going to resonate with? That was the key question. And I think that's the question that we're asking today with GPT-4.5 as well. And we're inviting people to be like, hey, you know, we did some early explorations. We see that it's more emotionally intelligent. You know, we see that it's a better creative writer. But what do you see here?

Yep. All right, Mark. So I've been seeing you and we mentioned this before we started recording. I've been seeing you in all the OpenAI videos about every release. So first of all, great to speak to you live. But also,

Over the past year, we've seen a lot of exodus out of OpenAI. Maybe the media plays it up too much. Probably we do. But I am kind of curious what it's like working within OpenAI and how you see the talent bench inside the company. You recently became chief research officer just a few months ago. And now, look, we have a new foundational model. So just give us a sense as to what the talent situation is inside the government organizations. Honestly, it's still, I think, the most world-class AI organization. I...

would say that there's a separation between the talent bar at OpenAI and any other firm out there. And when it comes to kind of people leaving, you know, like the AI landscape, it changes a lot.

a lot. Probably more so than any other field out there. The field three months ago looks different from the field three months before that. And I think it's kind of just natural in the development of AI that some people will have their own theses about, here's the way I want to develop AI, and go try it their own way. I think that's healthy, and it also gives an opportunity for people internally to shine. And

Um, we've never had a shortage of people internally who are willing to step up and we've seen that a lot. And I really just love the bench that we have here. Very cool. All right, folks, GPT 4.5 is out today for open AI pro users next week. It's coming out for plus team enterprise and edu. Uh,

Mark, great to see you. Thank you again for spending time. You're about to go and do the live stream. So I'm very grateful that you spent the time with me today. I really appreciate your time, too. Thanks for having me. Well, let's do it again soon. And folks, so we shouted out the Ranjan and I argument. We'll go into that and more everything we can share about GPT 4.5 coming up tomorrow on the Friday show. Thanks for listening. Thanks again to Mark and OpenAI for the interview. And we'll see you next time on Big Technology Podcast.

OpenAI Chief Research Officer Mark Chen: GPT 4.5 is Live and Scaling Isn’t Dead 24:55 Share

Big Technology Podcast

Deep Dive

Shownotes Transcript

OpenAI Chief Research Officer Mark Chen: GPT 4.5 is Live and Scaling Isn’t Dead