We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Ep 64: GPT 4.1 Lead at OpenAI Michelle Pokrass: RFT Launch, How OpenAI Improves Its Models & the State of AI Agents Today

2025/5/8

Unsupervised Learning

AI Deep Dive Transcript

People

Michelle Pokrass

Topics

Michelle Pokrass: 我领导了 OpenAI 的 GPT-4.1 模型的后训练工作，该模型的重点是提升开发者的使用体验，而非仅仅追求基准测试分数。我们收集了大量的用户反馈，并将其转化为可用于模型训练的评估指标。在开发过程中，我们特别关注指令遵循和长文本上下文处理能力。模型评估指标的有效期大约为三个月，因为模型进步速度很快，我们需要持续收集新的评估数据。当前的 AI 智能体在范围明确的领域表现出色，但在处理模糊和复杂的问题时仍面临挑战。我们正在努力改进模型在处理长文本上下文、遵循复杂指令以及应对模糊情况的能力。在代码生成方面，GPT-4.1 在局部范围内表现出色，但在需要全局上下文和复杂推理的场景下仍有不足。我们正在努力改进模型的全局理解能力，并使其生成的代码更符合代码规范。我们推出了强化学习微调 (RFT)，这是一种数据效率极高的微调方法，尤其适用于拥有独特数据的深度科技领域。RFT 可以帮助突破模型能力的边界。选择合适的微调方法取决于具体需求：对于风格调整，建议使用偏好微调；对于简单错误修正，可以使用监督式微调；对于突破能力边界，则应使用强化学习微调。未来，OpenAI 的模型家族将朝着更通用化的方向发展，目标是减少模型数量，简化产品选择。我们正在努力将不同模型的能力整合到一个模型中，并在保持其在不同任务上的出色表现的同时，提升其在不同任务间的平衡性。模型的个性化将通过增强记忆和可控性来实现，用户将能够根据自己的偏好调整模型的个性。我们正在积极探索利用模型来改进模型，特别是利用模型信号来判断模型是否在正确的方向上。我们也在努力提高模型迭代的速度，以便能够更快地进行实验和研究。 Jacob Efron: 作为访谈的主持人，我与 Michelle Pokrass 进行了深入的探讨，涵盖了 GPT-4.1 的开发过程、模型评估方法、AI 智能体的现状、代码生成能力的提升以及未来模型的发展方向等多个方面。我特别关注了模型在实际应用中的表现，以及如何帮助公司利用 AI 技术取得成功。 Patrick Chase, Erica Brescia, Jordan Segall: 作为联合主持人，我们与 Michelle Pokrass 就 AI 领域的一些热点问题进行了简短的讨论，例如模型的过高和过低估计、以及未来模型的发展趋势等。

Deep Dive

Shownotes Transcript

Translations:

中文

Michelle Pokras is one of the key people behind GPT 4.1 and OpenAI. As a post-training research lead, she played a crucial role in making these models so much better for developers. I'm Jacob Efron, and today on Unsupervised Learning, we dug into everything GPT 4.1 and more. Some of my favorite parts from my conversation with Michelle include the current and future state of agents, whether future models will be purpose-built for different groups. We talked about RFT and what it will mean for builders, and tactics for figuring out what's just out of reach for the models versus way in the future.

We also talked about how companies can set themselves up for success with rapid AI progress and what kind of founders will win at the app layer. And we finally hit on what's next for OpenAI's agent products. It was an awesome episode with someone who's helping define the cutting edge. Before we get to the episode, I just have one plug. If you're enjoying this on Spotify or Apple Podcasts, please consider leaving a rating on the show. Ratings help us grow, which helps us to continue bringing on the best guests and keep the conversations coming. Now here's Michelle Prokras. Michelle Prokras

Well, Michelle, thanks so much for coming on the podcast. Really appreciate it. Yeah, thanks for having me. Very excited to be here. Yeah, there's a ton of different things I'm excited to explore with you today around GPT-4.1. You mentioned the model has like more of a focus on real world usage and utility and less on benchmarks. And I feel like that's definitely resonated in the Twitter discourse and just people playing around with the model. How do you actually go about making that happen in practice? Yeah, it's a good question.

The real goal of this model was something that's a joy to use for developers. Often, and we're not the only ones who do this, but sometimes you optimize a model for benchmarks and it looks really great, and then you actually try to use it and you stumble over basic things like, oh, it's not following my instructions, or the formatting is weird, or the context is too short to be useful. With this model, we really focused on what have developers been telling us for a while now that they want

and how can we reproduce this feedback?

So a lot of the focus was on talking to users, getting their feedback, and then turning that into an eval that we can actually use during research. So I would say there was a pretty long lead-up before we even got to model training. We were just kind of getting our house in order on evals and understanding where actually are the biggest problems in our models. And so we actually put this in the blog post, but we have an internal instruction following eval. It's based on real API usage. It's based on what people have told us.

And this is kind of one of the North Stars while developing this model. Yeah, I'm curious because I've heard you talk about this idea of picking evals and basically going to startups and people that are building on top of the API and saying, what are the things that the models can't do? And let's try and hill climb on those. How do you go about figuring out? I'm sure everybody has their pick of 15 things they want you to optimize for. How do you go about figuring out what are the evals that matter? And any learnings on that over the course of building this? Yeah. Yeah.

I will say it's actually more of the opposite problem. They're not coming to us with, like, oh, I have these 100 evals. Please fix all of these. It's more like they're saying, ah, it's kind of weird in this one use case, and then we have to be like, what do you mean by that? We, like, actually, you know...

get some prompts going and figure it out. So I'll say a lot of the legwork has been just like talking to users and really pulling out the key insights. There's actually an interesting insight I got recently from talking to a user where it turns out our models could do better on kind of, sometimes you want to tell them, ignore everything you know about the world and only use the information in context.

This is something we would never see in an eval. Like AIME, GBQA, none of them look at this. But for this specific user, what's most important is the model will attend only to the system instructions and ignore everything it already knows.

Back to the question of how do we determine what's most important. Basically, just see what comes up over and over again in themes with customers. Then we also use our models internally. We have a sense for where they're not doing well. We also have internal customers building on top of our models. Basically, all of these things put together make it...

That's how we determine which set of evals to really go after. Do you have a request for evals for our listener base? Are there some areas where you're like, oh, we really would love more examples or things to test around certain areas? Yes, yes, always requesting more. I'm always pitching, like, we have this evals product where you can opt in and you get free inference on the evals in exchange we get to use them. But in particular, the things I'm interested in are more long context, like real-world evals.

It's really hard to make a long context eval. Synthetic evals are nice to target really niche use cases, but if you want to get holistically, does this work in long context, we could use more of those. The other one is instruction following. This is the hardest thing to define in ML, I feel. Everyone is like, the model didn't follow this instruction, it's not good at this. But people actually mean hundreds of different things.

And so anything more there, I'm always interested. Did you have any favorite random evals that emerged in this process? I mean, you mentioned obviously some examples already, but any that were surprising, I guess, in things that weren't working or you thought were particularly fun ones to hill climb on? This is interesting. We tested a few different versions of 4.1.

with real alpha users and got their feedback. One customer just really preferred the first version over the fourth one, which is the one we ended up shipping. They were the only user to feel this way. All of the evals were up and to the right between these and we just could not figure out what it was. It was just some really niche use case that wasn't covered anywhere.

Hard to please everyone with these models. It's nearly impossible, but if you make something that follows instructions pretty well, then you can try to please more people by teaching them to prompt better. And then the fine-tuning offering, I think, is a really great way of pleasing more people.

A hundred percent. Well, we'll definitely dig into, to, to both of those aspects, um, you know, here, I guess like, I'm, I'm curious that, you know, the model's been out for a few weeks now. Um, I'm sure you had all these, you know, and you were obviously testing this with plenty of people. So you had some sense of how people would use it, but then it's always fun to get it in the wild and see, you know, all sorts of unexpected ways, any kind of like unexpected things that it's been able to like bridge or solve, um, that have been kind of fun to see these last few weeks.

Yeah, I've really loved seeing a lot of the cool UIs people have been building. So actually this is something we snuck in near the very end of the model is like much improved UI and coding capabilities. So I've seen really cool apps there. I've also loved seeing people make use of Nano. It's small and cheap and fast. And I saw, I think Box has some...

product feature where you can read 17 pages of docs. I know Aaron tweeted some results using the models and it was a pretty impressive uplift in the core product. Yeah, it's very cool to see the hypothesis behind Nano was can we just spur on a ton more AI adoption with

models that are cheap and fast. And it looks like the answer is yes. Like people just have demand at all points in the cost latency curve. I feel like that answer seems to have generally been yes throughout this. You know, you guys are always cutting prices and it seems to always keep spurring more demand. You know, I feel like you've been acknowledged by Sam, I know, by all sorts of folks as like, you know, really one of the ringleaders of making this whole thing happen. What is actually involved in like shipping a model like this end to end? And what work are you guys doing behind the scenes to kind of like make this happen?

Yeah, it's a great question. So obviously there's a large team behind the scenes. And so we have basically these three models are each kind of a semi-new pre-train. So we have the standard size, the mini, and the nano. So really great work from the pre-training teams. What does a semi-new pre-train mean?

Yeah, it's a good question. I mean, it's kind of like, we call it a mid-train. It's a freshness update. And so the larger one is a mid-train, but the other two are new pre-trains. And then my team works a lot on post-training. So we've been focusing a lot on, you know, how do we determine the best mix of data, or how do we determine the best parameters for RL training, or how do we determine the weighting of different rewards? And so back to, like, how this all came to be, I think

We started realizing a lot of developers had a lot of pain points with 4.0. And we went, I would say, three months in on evals and figuring out what the real problems were. And then I would say the next three months was kind of a flurry of trading. And so we would just run tons of experiments. Like, how does this data set work? Or what if we tweak these parameters? And then that all kind of linked up with these new pre-trains and

And then we finally had, like, about one month alpha testing where we were trading stuff really rapidly and getting feedback and trying to incorporate that as much as possible. You know, a part of this, it was gathering these evals. Like, does it feel like that set of evals is still relevant? Or is it now, like, you have to go gather a whole new set of stuff that, like, you know, maybe is the right stuff to hill climb on for, you know, improving upon 4.1? Yeah.

Yeah, I think the shelf life of an eval is like three months, unfortunately. Like progress is so fast. Things are getting saturated so quickly. So we're still on the hunt as always. And I think we always will be. I mean, one of the things that's so clear in the model is that you improve instruction following, you improve long context, obviously both incredibly beneficial for agents. You know, I think our listeners are always trying to figure out, like, where are we with agents? Like, how do you characterize today? Like what does work? What doesn't work? Like what is kind of the state of the field post 4.1?

I think where we are is that agents work remarkably well in well-scoped domains. So a case where you have all the right tools for the model, it's fairly clear what the user is asking for, we see that all of those use cases work really well. But now it's more about bridging the gap to the fuzzy and messy real world. It's like the user typing something into the customer support box

doesn't actually know what the agent can do and the agent maybe is missing an awareness of its own capabilities. Or maybe the agent isn't connected enough to the real world to know a certain piece of information. Honestly, I think a lot of the capabilities are there, but it's just so hard to get the context into the model. And then one area I do think we can improve is ambiguity. We should make it easier for

For developers to tune, if it's ambiguous, should the model ask the user for more information or should it proceed with assumptions? It's obviously super annoying if the model is always coming back to you and be like, should I do this? Are you sure? Can I do this? I think we need more steerability there. We've all worked with interns like that before, so I get that there's a fine balance to strike.

You want some delegation, but not too much. It sounds like the underlying capabilities of the models in many ways aren't being fully shown just because we haven't connected enough context in or tools into the models themselves. And it seems like there's a lot of improvement on just doing that. Yeah, exactly. Yeah, I will say when we look at some of the external benchmarks for function calling or agentic tool use, when we actually dig into the failure cases, like where our models are graded incorrect,

I see that they're mostly misgraded or maybe it's ambiguous or maybe they're using a user model and the user model isn't following instructions well enough. And so we're actually struggling to find cases where the model actually just does the wrong thing.

There obviously are those, but most of the benchmarks, I would say, they are saturated. I imagine over the next 6-12 months, a lot of that stuff gets added in. There's more tools, more context. I feel like one of the gaps remains longer-term task execution. How do you think about what needs to be done to continue making progress toward some of these longer, more ambiguous, many-step tasks? Yeah. I think we need changes on the

on the engineering side and the model side. So on the engineering side, we need APIs and UIs where it's much easier to follow along with what the agent's doing, a summary of what they're up to, a way to jump in and change the trajectory. We have that in Operator. It's pretty cool. You can kind of jump in and steer. But you don't have that as much for other things in our API. And so I think that's a core capability on the engineering side. And on the modeling side, I think...

we need more robustness like when things go wrong. Obviously, sometimes your API will have a 500 and the model will kind of get stuck. And so I think we're hoping to train in more of more robustness and like grit is another way we think about it sometimes. Another part of the models that I think everyone's noted on and obviously you have in the benchmarks is just how much better they are at code. And so I guess, you know, to start there, like how do you kind of, you know, characterize where we are with like what, you know, where we are with AI code, like what works, what doesn't? Yeah.

Yeah, totally. So I think where we are for code is that 4.1 and some of our other models are remarkably good when the problem is locally scoped. So maybe you're asking the model to change some library and all of the files are near each other and it makes a lot of sense.

But we see the sweep-bench tasks that we're missing, for example, are those where the model really needs global context. And it needs to reason about many various parts of the code. Or maybe there's some extremely technical details in one file and you're trying to pass them into another. So I would say we're still improving that global understanding.

I also think we've made a really big improvement on the front-end coding, but I still would love to keep improving. We should not only produce front-end code that's beautiful, but a front-end engineer should be proud of it. There's some linting stuff there, and code style is another top focus area for us. And finally, I think another thing we're always going to keep improving is changing only what you ask for and not everything else.

The model should adapt to the style of your code and not inject its own style too much. On our internal evals, we see it went from, I think, 9% to 2% from 4.0 to 4.1, these irrelevant edits. But obviously 2% is not zero, and so it's something we're going to continue improving. What does that mean for how you end up using it in your day-to-day coding?

Yeah. I manage a team now, so there's not that much. Alas, the inevitable trajectory of doing well at these companies. But I do use Codex, and I have honestly still been using GitHub Copilot. It's still a great product, and I also dabble with Windsurf and Cursor. So in and out. But Codex is really cool, the way it does stuff independently. And I think...

The main model I use there is 04 Mini, just for speed. You know, obviously you've kind of alluded to this. There's like lots of benchmarks and, you know, I feel like people are always debating, like, are benchmarks still relevant? You know, I think you guys even added some, you know, into, you know, 401. I think there's been this feeling in coding for a while, for example, like benchmarks don't tell the full story and you kind of like know it when you use it. Like, to what extent is that true? And what's like your overall view on like the state of these benchmarks today and how useful they are?

Yeah, I do think SweeBench is still a useful benchmark. Like, the actual differences from a model that can achieve, like, 55 versus 35 are staggeringly different. I think the AIDER evals are still super useful. But then there's the ones that are just, like, fully saturated and not useful. Basically, you got to, like...

use the most out of an eval during its lifespan and then move on and create another one. The three-month shelf life definitely is tough. Yeah, there's going to be successors to SweeBench once that's saturated for sure.

Yeah. I mean, one thing I think that's so interesting about 401 is that I think you guys have been very explicit. Like, this was built for developers. And like, you know, there's all these, like, evals you did to make it better for the things that developers were asking you for. And it kind of does beg the question, like, how does the OpenAI model family evolve from here? Because obviously you could imagine, like, a pre-trained model that's post-trained for different end users or, I don't know, domains or tasks. Like, I'm sure you guys learned a ton, you know, kind of building this model for this explicit end group. How do you think about that? In general, my philosophy is that

we should really lean into the G in AGI and try to make one model that's general. And so ideally, I think going forward, we're going to try to

simplify the product offering, try to have one model for both use cases, and simplify the model picker situation in ChatGPT as well. But for 4.1, we thought there was a particularly acute need, and we thought we could move a lot faster at this problem if we could decouple from ChatGPT. So this let us train models, get feedback much quicker, ship on a different timeline.

And it also let us make some interesting choices with model training. So we were able to remove some of the data sets specific to ChatGPT, and we were able to up-weight the coding data significantly. And so this is stuff you can do when you're kind of targeting a separate domain. But in general, I do expect us to simplify. And I think the models get better when the creative energies of all researchers at OpenAI are working on them.

rather than the subgroup focused on the API right now. Well, it also seems like there's been massive cross-domain generalization anyway, where in general it feels like putting it all into one model has been beneficial. But it's interesting, obviously, that this has been such a success with that more targeted approach. Yeah, there's room for both, I think. Sometimes it makes sense to eject and ship the thing for a user you know really well. Do you think that's something you guys might do again? Yeah, I think it's possible. I mean, we...

We make a lot of changes on the fly as we see what demand is there, and it's definitely possible. Well, one thing I obviously hear from folks all the time is you guys ship models very rapidly. I know the naming has always been debated ad nauseum about how many different models there are. I feel like companies are trying to stay on top of what the cutting edge of model capabilities are. Any best practices you've seen from what companies do to just stay on top of? It feels like a new model drops every month in this space.

And how would you be thinking about it if you were one of the users of these APIs? It's all back to evals, unfortunately. The most successful startups are the ones who know their use case really well, have really good evals, and then they can just spend an hour running evals on the new models when they drop. There's also, I think, the customers that are really successful are the ones who can

switch their prompts and their scaffoldings and tune them to the particular models. So that's what I would recommend. And then the other thing is to

build stuff which is maybe just out of reach of the current models, or maybe it works one out of ten times and you'd love it to be nine. If you have these kind of use cases in your back pocket, the new models drop and things just work, then you're first to market. Do you have a heuristic you use for what's just out of reach? Obviously, I feel like it's hard to tell sometimes how soon some of these things might work. Yeah, I think if you see significant improvements in fine-tuning,

Like, let's say you're getting a 10% pass rate. You can fine-tune it to 50%. It's probably not good enough for your product yet. That's something that's right on the cusp in a future model. A few months from now, we'll probably just crush it. No, that makes a ton of sense. I mean, obviously, you mentioned kind of like the, you know, being able to switch the prompts and the scaffolding. I think one thing that, like, I think a lot about on the investment side is

you know, there's lots of companies that, you know, the models are able to do what they're able to do. There's all sorts of scaffolding they build, you know, based around those limitations to make the products work today. And then it feels like, you know, you guys released the next great model and some of that scaffolding just gets obviated. It's like, okay, cool. Like the models are way better at like following instructions. I don't need to do all this hacky stuff because you have this long context window now. Given that, how do you think about when it like does and doesn't make sense to like build some of the scaffolding or like what set of scaffolding makes sense for these folks, you know, for people to focus on? I like to,

take this back to your reason for being as a startup. Your reason for being is to ship value to your users and make something people want. I think it is super worth it to build the scaffolding and make your thing work. You basically are doing a few months of arbitrage before this capability is available more easily.

But I do think it's important to keep in mind future trends. So maybe build the RAG thing for now, or maybe have your instructions five times in the prompt, although not with 4.1. But just be prepared to change things. But know where things are going. So I think context windows are only going to keep improving. I think reasoning capabilities are only going to get better. Instruction following is only going to get better.

And so just have an eye to where those trends are going. Yeah. Any other like, you know, for where things are going, like tips for folks? Yeah, I think multimodal is another one.

The models are getting so natively multimodal and easy to use in those. Yeah, I feel like that's been a pretty under-discussed part of 4.1. It's pretty impressive multimodal capabilities. Yeah, honestly, huge shout-out to our pre-training teams because these new pre-trains have just significantly improved upon multimodal. And I think we will continue to see these improvements. But so many things that didn't work in 4.0 just work now because the models have gotten better there. And so...

It's worth it to connect the model to as much information about your task as possible, even if you're getting bad results today, because tomorrow it'll get better. You mentioned fine-tuning. I think it's interesting. I feel like we've gone through this journey with fine-tuning where early on, I feel like a lot of folks were like, I don't know how helpful this actually is. And then it feels like there's been a renaissance of fine-tuning with these newer models and how helpful it actually is. Yeah.

I guess I'm curious what you've observed. Does that arc ring true to you? How should people be thinking about this? And should more people be revisiting their prior assumptions around fine-tuning? Yeah, I think I would bucket fine-tuning into two camps. The first is fine-tuning for speed and latency. And so this is still, I think, the workhorse of our SFT offering. So 4.1 works well, but you can get it at a fraction of the latency.

But then, I think we haven't seen too much of fine-tuning for frontier capabilities. You could maybe get them in a really niche domain with SFT, but with RFT, you can actually push the frontier in your specific area.

And the fine-tuning process is so data efficient that you can just make do with like 100 samples or something on the order. So our RFT offering is actually shipping to GA next week. I guess your listeners will probably hear about it when it's out. And we're really excited about that. There's some use cases where it works really well. For example, like teaching an agent about how to pick a workflow or...

how to work through its decision process. Then there's also some interesting applications in deep tech where maybe the startup or

organization has data that other folks don't have and it's really verifiable. And from that, you can get the absolute best results with RFT. I think one thing that I've been struck by at least is it feels like across the board, the number of examples you need is not massive. I think in the early days, people were like, oh, well, some of these companies sit on tens of thousands of examples and they'll just be able to out-compete. It feels like the data really does matter, but it's maybe to the tune of a lot less examples than folks might have previously thought.

Yeah, I think these two trends are making fine-tuning more interesting where it's extremely data efficient. And also, RFT is basically the same RL process we use internally for improving our models. So we just know that it works remarkably well and it's less fragile than SFT.

And so, yeah, for those reasons, I think it's going to be really useful for deep tech and some of the hardest problems. Is this the kind of thing you think everyone should play around with? Or like, you know, is it like, I mean, obviously there's some cases the models can do, but let's take, you know, almost anything that maybe they aren't as accurate as folks want. Is it like worth trying this for, you know, for any of those cases? I think my mental model is if it's a stylistic thing, then you should probably use preference fine tuning, which we launched somewhat recently. If it's,

more simple. Like, you know, maybe you want nano to classify things and it gets...

you know, 10% of cases wrong and you can close that gap with, with SFT, that's great. But then for the things where just no model in the market does what you need, then you should turn to RFT. And it's not like you were kind of alluding to the fact that there's like some things, especially when they're verifiable that, that like make this easier to do. Like, do you have like any rough, like rules of thumb you use for like when RFT, you know, the types of domains or the types of problems that RFT will be like particularly effective for?

or what these more easily verifiable domains are. It's like everyone's asking this question now outside of code and math. Yeah, I think there's stuff in chip design or in biology, just stuff like drug discovery. I think those sorts of things where maybe you need to explore, but the things that work are easily verifiable. I think those will be good applications. Sure.

Certainly chip design is that. I feel like drug discovery, perpetual awesome use case, but sometimes it takes 10 years to figure out if it actually works in people. So the feedback loop is always obviously interim steps in between. But eventually, I mean, it does kind of beg the question, I mean, I feel like, you know, you see in 401, obviously these multimodal capabilities, you know, you talk about kind of the ability to use RFT for, you know, for biology. I guess there's always been this question of like, you know, are there going to be like

standalone types of foundation models like a robotics foundation model or a biology foundation model that is like nothing or has something to do but like it's kind of a separate class of models like what's your kind of view on that does it feel like it's you know you kind of mentioned the g and agi uh before like does it feel like we're converging uh in that in that aspect i kind of do i think generalization uh you know improves capabilities a lot um

I think it remains to be seen with robotics. I guess we'll know empirically if the best robotics products are their own models.

But I do kind of think, and I think the trends I see here internally are combining everything just produces a much better result. Everyone's teased that you'll have soon, like, you know, one model that will like pick up behind the scenes for people what to, what to use. But obviously we don't have that yet today. And so, uh, you know, I'm curious if I'm a company and, you know, uh, figuring out like which, you know, obviously I'll probably test a bunch of them. Do you have any rough rules of thumb on like which models people should be choosing, uh, for the different things they're trying to do? Yeah, totally. Um,

It's a pretty tough decision tree, so I'm excited we're going to simplify it. Here's how I think about it. In chat GPT, obviously in chat GPT DAO, and so my... You and me both. Yeah. My main model there is 4.0, and I use 4.5 sometimes for writing or creative stuff. And then 0.3 is what I use for the hardest math problems. I don't know. I was filing my taxes, and I wanted them done right. So that's...

you know, somewhere where I'd use O3. Does that line up with you? Are those the models you use in chat? I wasn't sure the models were good enough yet to trust my taxes to them, so I haven't yet done them. But maybe I should. If you're saying it's good enough, that's great. Next year, I will totally go ahead and do that. I'm more in trouble checking my CPA.

Verify with a trusted source. But yes, that definitely lines up on the consumer side. And then I'm curious for the enterprise users, obviously I feel like you always want to go as fast and cheap as you can, but I think folks are still trying to figure out exactly when to reach for each different kind of model. Yeah, totally. So yeah, how I think about it there is

developers should just start with 4.1, see if it works well for their use case. If it does and you're looking for faster, then I would look into mini and nano and fine-tuning those. Obviously, mini next and then nano as the smallest model. And then if some things are just out of reach for 4.1, then I would push for 04 mini and see if you can kind of

you know, get the sufficient like reasoning capabilities out of it. And then you go to or three. And then if that's not working, then you go to RFT with, with O4 mini. I guess on the other side of using these models, one thing I always enjoy is like the prompting guides you guys released behind alongside these models, because it's always kind of like funny, sometimes counterintuitive, like the different things that help on the, on the prompt side, like any particularly favorite things that have emerged is like, Oh, that's actually a really helpful way to prompt, you know, 4.1. Yeah. I think we,

I've found XML or structuring your prompts really well, works super well. The other thing is just telling the model to keep going.

I liked that one. It's something we're hoping to fix for the next one, but it is remarkable how much better performance you can get by telling the model, like, hey, please don't come back to me until you've solved the problem. So yeah, those are interesting and somewhat counterintuitive. How do you go about, like, so like, yeah, you've seen that keep going thing, and obviously in your cookbook shows like a big impact. How do you then go about, like, you know, incorporating that into the next generation of models such that, like, that isn't a thing anymore? Yeah.

Our post-training process can be pretty sensitive to the exact mix of data used. So you can imagine a post-training process where you train the model on one diff format, and then your users are using totally different diff formats and the model's a bit lost.

Whereas for 4.1, we trained the model on 12 or so different diff formats, everything we could think of. And so our goal is to really put out something that works really well and even document maybe the best one. So our prompting guide has diff formats we found that work well. We also want it to work well out of the box for developers who aren't going to read our docs, which I recognize as most. You want it to work anyway, even if you're not using the best. So we focus a lot on

on general prompting and general capabilities. And this way we don't, you know, kind of burn in the model a specific one.

Yeah, though keep going is a great thing to say to our team internally too. So, you know, definitely it helps across the board. You've obviously mentioned evals as one thing that like the most sophisticated companies do well. I'm curious if like there's something that you've, you know, either maybe some of the OpenAI products or some techniques that like a select few companies are using really well and you're like, God, I just wish that like thousands of companies were using this or thinking about things this way. Yeah, I think some of my favorite developers to work with are those who know their problem really well.

really well and actually have evals for the whole problem but can break them down into specific subcomponents. And so they can tell me things like, the model got better at picking the right SQL table by this percentage, but it got worse at picking the right columns by this percentage. And it's like, wow, this level of granularity really helps you tease out, like,

what actually is working and what isn't. And then, you know, they can tune specific parts of this. So I guess, like, making your system modular and easy to plug different solutions into, I think, uh...

That takes a little time up front, but makes you move faster in the long run. I guess a question people are always asking is how much AI expertise will the leading AI app companies need versus just being good engineers that take your models off the shelf and know their end customer? Do you think long term, being able to have a sense of what data to apply on the fine tune or tweak your evals

Does that end up being a really important skill set for the app players? Or is it really like, no, they can kind of take the models mostly off the shelf or, you know, a basic fine tune and the kind of core AI research capability may be less important? Yeah, I'm really long generalists. So I think people who understand the product are really scrappy engineers who can do anything like that.

I honestly don't think you'll need that much expertise to combine these models and these solutions in the future. So yeah, I'm definitely much more bullish when I hear about a team of scrappy hackers than a bunch of PhDs with only research publications under their belt. There's so many exciting areas to continue pushing these models forward. What future areas of research are you most excited about to make these models better?

I'm really excited about using our models to make models better. So this is particularly useful in reinforcement learning when we can use signals from the models to figure out if the model is on the right track. Yeah, I'm also excited. This is like a more general research area, but we're working on improving our models

are speed of iteration. So the more experiments you can do, just like the more research gets done. And so it's a real focus right now to make sure, you know, we can run our experiments with the fewest number of GPUs and get, you know, you basically want to kick off a job and know when you wake up in the morning that you know if this thing is working or not. Is that just like a pure infrastructure problem or like, you know, like the latter part? Not really. You also need to make sure that kind of the things you're training are

are at sufficient scale to get signal on what exactly it is you're experimenting with. So also some interesting ML problems there. Yeah. And then in terms of using the models to make models better and kind of signals if you're on the right track, where are we in that? Does that work? Or are we still kind of early innings of that? Yeah, it works remarkably well. I think...

Synthetic data has just been an incredibly powerful trend. So yeah, excited to push this more, but every more powerful model makes it easier to improve our models in the future. You guys have also shipped some really interesting agents. I think Deep Research probably most famously is a product that I use all the time. And basically, as I understand it, using reinforcement learning on a tool or set of tools until the model gets really good at using it

How do you imagine that type of approach scaling for agents at large? I guess it's kind of like a sub-variant of the question we were talking about earlier of building these specific models for end users or specifically doing RL on tools versus the G of generalization here. Yeah. So I think...

deep research is like zero to one or deep research and operator like zero to one or two where you want to train the model like really deeply on this specific thing. But I think what we've seen with O3 is that we can just train the model to be great at all kinds of tools. And actually learning to use one set of tools makes it better at other sets of tools. So I don't expect too much of just like one tool specific training going forward.

We've kind of proven that out, and now we can incorporate those capabilities broadly. Actually, that's one thing people really love about O3, is that it can do a lot of deep research, a lot of those capabilities, but quicker and more

you know, you get a, you can really go for deep research when you want, like, the absolute best report. But if you want something somewhere, you know, in between, then O3 is a great fit for that. Yeah. And as the kind of general models, you know, get better at using tools and, you know, and doing some of these tasks, you know,

Are there areas that you think will, like, be easier or harder? I mean, obviously, you guys have publicly said you'll have a coding agent. You know, I don't know if there's, like, as folks are thinking about, again, like, what's on the, you know, what capabilities are sooner rather than later. Any just, like, mental model you use of, like, yeah, I think these things would come before the next set of things. Yeah, I think, I mean, yeah, coding is obviously coming soon, given...

like, SWE bench numbers are already exceeding, you know, what a lot of humans would get there. So I think the ability to supervise these long runs is there. In terms of other stuff, I think, like, long workflows. So what's interesting about O3 already is that when it calls developer-specified tools, they're already part of the chain of thought of the model. So the model can...

you know, use the thoughts of the previous tool call and the output and think some more about what to do. And so I think because of that, the agentic, like maybe customer support or other sorts of capabilities, I think are important.

I think personally are there and just need to be hooked up with everything to make a cohesive product. Yeah, I mean, it seems like in many ways, like the capabilities of these models like exceeds like the actual like nitty gritty just implementation of like, yeah, hooking them up to things, getting them, getting enterprises ready to use them in some way. But it's like, you know, I think there's always this big debate of like, if you stopped, if you completely stop model progress right now, is there like,

Just tens of trillions of dollars of value to be extracted from these models. And it seems like you're very much in the camp of yes. Yeah. I mean, I think if you think about the capabilities overhang of the internet, we still haven't saturated that.

Things are still coming online. The internet is still eating the world. And I think for AI, we haven't even saturated the capabilities of 3.5 Turbo. I still think there are billion-dollar companies that only need that level of capabilities. And so now with 4.1 and these reasoning models, I think we have...

If we truly stopped right now, I think we'd have 10 years of building at least. Sam's obviously talked about combining the model families into this GPT-5 that will probably end the really fun point this and point that and all that. But what actually needs to be done to combine this into a single model? It goes back to what are the models good for? So right now, the 4.0 series is really great for chat. And most users in the world use 4.0.

They love the way it matches tone and style preferences. It's a great conversationalist. It's good at figuring out deep conversations with people. It's kind of a good sounding board. But O3 has a very different skill set. It can think through problems really hard. You don't really want the model to think for five minutes when you say hi.

And so I think the real challenge facing us on post-training and research more broadly is combining these capabilities.

So training the model to be just a really delightful chitchat partner, but also know when to reason. And this kind of plays into 4.1 a bit. I mentioned that we down-weighted some of the chat data and up-weighted coding to make coding better. So there are some zero-sum decisions in that sense, where you have to figure out what exactly you're tailoring the model for. So that's the real challenge in GPT-5 is like,

How do we strike this right balance? Yeah. I mean, it's so interesting because I feel like some reason people have been drawn to, you know, different models in the past has been like intensely like personality based. I like the personality or vibes of this model. And I'm struck by, I mean, in some sense, kind of combine it into one model, you get like a median personality. And back to the earlier question of like, I wonder whether, you know, longer term folks will want like, you know, maybe they accomplish this through prompting or like, you know, through kind of like learning about you. And then the models themselves have all these personalities within them and can kind of emerge. Any thoughts on that?

Yeah, I already think we're going in this way a bit with enhanced memory. So I think like my chat GPT is so different from like my mom's or my husband's. So I think we're going in this direction already. It's just becoming so much more useful the more it knows about you. But also the more it knows about you, the more it can like adapt to the things you like.

So I think that's actually going to be a really powerful lever for personality in the future. But we're also going to make it more steerable. So you can already use custom instructions and tell the model, like, hey, I don't like capital letters or please never ask follow-up questions. I don't like that. So I think we're going to lean more into steerability there. I think everyone should be able to kind of tweak the personality that they want. But yeah, I'm curious, like, what's...

What kind of personality are you looking for? It's kind of like I'm still discovering, right? Well, I like kind of like the banter is fun, right? And like a little like, you know, kind of like hanging out with your kind of like fun and quirky and like, you know, kind of like almost takes risks sometimes and like the stuff they're saying type friend. I feel like I always enjoy that. I guess I'm also curious just to kind of hit on your personal journey at OpenAI. Like obviously you've done a ton of different roles within OpenAI. You've also like

The company has had, I mean, probably a million different subchapters of growth and experience in your time there. Maybe just talk a little bit about your personal journey there. And also, how is it kind of, what feels similar and different from the early days you joined to now leading this large team here?

Yeah. So yeah, I've been here for two and a half years and I joined on the API team on the engineering side. Actually, a lot more of my background is engineering. I worked at other companies like Coinbase building their high-frequency, low-latency trading systems. So a lot more focused on back-end distributed systems.

But I did study AI in college and I worked with some professors there on research projects and I actually remember using OpenAI Gym at the time, which was super cool. But yeah, I was here for like a year and a half working on engineering and then it kind of seemed like it made sense to focus more on the model side for the API specifically.

there wasn't really enough of a focus on improving the models for developers. And I kept hearing folks wanted something like structured outputs. So that was kind of the first foray into doing research here. It's like training the models to do that and building the engineering systems. And then after that, I kind of formed this team and then moved over to research. And actually recently rebranded my team a bit. And we focus now on power users. So it's the power users research team. And

The reason for this rebrand is that we don't just focus on the API. Obviously, developers are some of our most discerning power users. They use features that other users don't know about. They know about prompting our models the best. They know the capabilities the best. But there's also power users across ChatGPT. There's some in free. There's plenty in plus and pro. I'm kind of insulted I haven't been reached out to as a ChatGPT power user. I thought I might have hit the threshold, but I guess there's probably some people that use a lot more. Yeah.

I mean, yeah, we get a lot of signals from people who are using our models in this way. But also, the reason it's interesting to focus on power users is because the things that the power users are doing today are going to be the things that the median users are doing a year from now.

So we just learned a lot from being on the frontier and figuring out what we can do to make the models better for them. And I guess, like, what's it been like, obviously, you know, over those two years? I feel like the organization has changed a lot, both in size and the scope of things you work on. Like, what kind of feels still the same and what's really different, you know, these days? Yeah, I think the pace of shipping is the same. It's actually remarkable, like, how an organization this large can move so quickly. Yeah.

I think some things are different is you just definitely can't have context on everything going on at the company anymore. It's like...

It used to be more possible to have pretty good state on all of the cool projects going on and read all of their research updates and be intimately familiar, but now you kind of just have to tolerate it. You can't know everything cool going on anymore. Totally. Well, we always like to end our interviews with a quick fire round where we get your take on some overly broad closing questions. And so maybe to start, we'd love your take on just one thing that's overhyped and one thing that's underhyped in the general AI discourse today. So yeah, overhyped, I think...

Like I mentioned, a lot of the agentic ones are saturated or people release the absolute best numbers they get, but realistic numbers are different. And then underhyped, I mean, the corollary of that is your own evals. And so using your real usage data to figure out what's working well, underhyped.

Awesome. What's one thing you've changed your mind on in the AI world in the last year? Yeah, this is back to fine-tuning, but I actually used to be more of a fine-tuning bear because it's kind of like, you know, it's a few months of arbitrage, but is it really worth the time? But I actually do think RFT is worth the time for these specific domains where you need to push the frontier. Yeah. Was there one particular fine-tune that convinced you or was it over time having seen this? I think the cool thing now is that...

Our previous post-training stack, or the 4.1 stack, is a lot more than just SFT. We weren't shipping how we trained our models. But with RFT, it's basically a similar algorithm as our reinforcement learning. That's why I think it's a big shift where you can actually get the capabilities that we can elicit ourselves. Do you think model progress will be more the same or less than last year or this year?

I think it'll be about the same. I don't think we're slowing down. I don't think we're in a fast takeoff at the moment, but it's going to continue to be fast.

And there will be a lot of models. I realize I can't ask you to pick a favorite, but I'm curious, like, you know, you mentioned this like class of kind of harder to solve problems, you know, maybe, you know, beyond enterprise apps, like any kind of like consumer products or things that you're like most excited about outside of OpenAI or things that you use in your kind of like day-to-day life? Yeah, I use a lot of stuff that is like AI-based. Like recently I've been using Levels and they are like,

like have a pretty cool, uh, AI focus there. I think whoop has some very cool like health insights as well. Yeah. Yeah. I think taking AI out of just the digital world is super cool. Well, this has been a fascinating conversation. I want to make sure to leave the last word to you. Um, where can folks go to learn more about, uh, you for one, uh, anything you want to point, uh, our listeners to, uh, the floor is, uh, the floor is yours. Yeah, totally. Thanks. Um,

So yeah, we put out a blog post for 4.1 if you want to read more about it. I'm also on Twitter, and I love hearing feedback from users, like developers, power users. So if something isn't working well in our models and you have a prompt that can show it, please email me. I'm firstname at openai.com, and I love getting the feedback so we can make models better. We'll have to get you on again to talk about the weirdest email you got from us of an obscure use case prompt. Yeah, I've already gotten some good ones. Yeah.

Yeah. Well, Michelle, thank you so much. This was a ton of fun. Yeah. Thank you so much for having me.

Ep 64: GPT 4.1 Lead at OpenAI Michelle Pokrass: RFT Launch, How OpenAI Improves Its Models & the State of AI Agents Today 47:12 Share

Unsupervised Learning

Deep Dive

Shownotes Transcript

Ep 64: GPT 4.1 Lead at OpenAI Michelle Pokrass: RFT Launch, How OpenAI Improves Its Models & the State of AI Agents Today