We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Ep 59: OpenAI Product & Eng Leads Nikunj Handa and Steve Coffey on OpenAI’s New Agent Development Tools

2025/3/25

Unsupervised Learning

AI Deep Dive AI Chapters Transcript

People

Francesc Campoy

Mark Mandel

Nikunj Handa

Seth Vargo

Steve Coffey

Topics

Nikunj Handa: 我认为未来消费者与智能体的交互将更加无缝地融入日常使用的各种产品中，不再局限于像ChatGPT这样的特定平台。智能体API的应用将呈现高度垂直化的趋势，开发者将充分发挥其在特定领域的专业知识，创造出我们目前难以预测的各种应用形式。智能体获取网络信息的方式也在不断演变，从简单的单次搜索转向更复杂的迭代式信息获取、链式思考和并行处理，这将极大地提升信息处理效率和决策能力。企业应该优先构建内部多智能体系统来解决实际业务问题，并在时机成熟时再考虑将其公开到互联网。2024年，智能体应用的流程相对简单，工具数量有限；而2025年，链式思考模式将成为主流，模型能够自主选择和调用多个工具，并根据情况调整策略。未来智能体应用的关键在于去除工具数量的限制，允许模型访问和使用数百个工具，从而充分发挥其潜力。强化微调技术将赋能开发者创建自定义的任务和评分器，从而训练模型更好地解决特定领域的问题。目前，我们提供的是构建自定义评分器的基本模块，未来需要解决的是如何更便捷地创建高质量的任务和评分器。计算机视觉技术与文本输入相结合，可以应用于处理缺乏API的遗留应用程序以及需要视觉信息的任务。计算机视觉模型的平台化应用前景广阔，例如Browserbase和Scrappybara等公司提供的服务。目前开发者使用智能体API的策略主要包括：尝试让模型和工具直接完成任务，进行提示工程，以及将模型和工具作为工作流程中的一个步骤。将任务分解给多个智能体可以提高效率并简化调试过程。模型能力远超大多数AI应用的利用程度，因此构建辅助模型运作的工具和流程至关重要。在设计Responses API时，我们遵循了“API如同阶梯”的原则，即提供易于使用的默认设置，同时允许开发者进行更深入的自定义。Responses API和MCP解决的是不同的问题，两者可以互补。OpenAI致力于提供一站式服务，但独立的AI基础设施公司仍然有其存在的价值，尤其是在构建高度灵活的底层API方面。目前需要解决的挑战包括构建工具生态系统、完善计算机视觉虚拟机基础设施以及简化模型评估流程。未来模型改进方向包括提高工具使用的可靠性、开发更小更快更擅长工具使用的模型以及改进代码生成能力。中国近期出现的先进智能体案例证明了模型本身具备强大的能力，关键在于提供更便捷的开发工具和流程，让更多人能够利用这些能力。企业应该探索前沿模型和计算机视觉模型，并尝试构建多智能体架构来自动化内部流程。企业应该关注员工日常工作中最不喜欢的任务，并尝试利用AI技术进行自动化。智能体技术既被高估也低估了，虽然已经经历了多个炒作周期，但真正能够有效利用智能体技术来解决实际问题的公司仍然很少。过去一年中，我对推理模型与工具使用的结合力量有了新的认识，这使得构建真正强大的智能体应用成为可能。过去一年中，我对微调技术的强大力量有了新的认识，这使得能够在特定任务上显著提升模型性能。长期来看，应用构建者的核心竞争力在于能够有效地整合工具、数据和模型，并进行持续的评估和改进。有效地协调工具、数据和模型调用是未来应用构建者的核心竞争力。目前科学研究领域对AI模型的应用仍有很大潜力。未来一年模型的进步速度将超过过去一年。我最期待看到的是能够有效解决旅行规划问题的AI应用。 Steve Coffey: 智能体API的应用将非常垂直化，开发者比OpenAI更了解各自领域的应用场景，所以未来应用形式难以预测。2024年智能体应用流程清晰，工具数量有限；2025年则转向链式思考，模型能自主选择和调用多个工具。未来智能体应用的关键在于去除工具数量限制，允许模型访问和利用数百个工具。强化微调技术允许开发者创建自定义任务和评分器，从而训练模型更好地解决特定领域的问题。目前OpenAI提供的是构建自定义评分器的基本模块，未来需要解决如何更便捷地创建高质量任务和评分器的问题。计算机视觉技术与文本输入相结合，可以应用于处理缺乏API的遗留应用程序以及需要视觉信息的任务。计算机视觉模型的平台化应用前景广阔，例如Browserbase和Scrappybara等公司提供的服务。目前开发者使用智能体API的策略主要包括：尝试让模型和工具直接完成任务，进行提示工程，以及将模型和工具作为工作流程中的一个步骤。将任务分解给多个智能体可以提高效率并简化调试过程。模型能力远超大多数AI应用的利用程度，因此构建辅助模型运作的工具和流程至关重要。Assistance API在工具使用方面做得很好，但易用性方面存在不足。Responses API和MCP解决的是不同的问题，两者可以互补。OpenAI致力于提供一站式服务，但独立的AI基础设施公司仍然有其存在的价值，尤其是在构建高度灵活的底层API方面。目前需要解决的挑战包括构建工具生态系统、完善计算机视觉虚拟机基础设施以及简化模型评估流程。未来模型改进方向包括提高工具使用的可靠性、开发更小更快更擅长工具使用的模型以及改进代码生成能力。过去一年中，我对微调技术的强大力量有了新的认识，这使得能够在特定任务上显著提升模型性能。长期来看，应用构建者的核心竞争力在于能够有效地整合工具、数据和模型，并进行持续的评估和改进。有效地协调工具、数据和模型调用是未来应用构建者的核心竞争力。 Mark Mandel: Assistance API在工具使用方面做得很好，但易用性方面存在不足。 Seth Vargo: Responses API和MCP解决的是不同的问题，两者可以互补。 Francesc Campoy: 未来模型改进方向包括提高工具使用的可靠性、开发更小更快更擅长工具使用的模型以及改进代码生成能力。

Deep Dive

Shownotes Transcript

Translations:

中文

I'm Jacob Efron, and today on Unsupervised Learning, we had a really wide-ranging discussion.

We talked about how developers should think about where these agents do and don't work, as well as computer use models and how those are being used. We talked about how enterprises should be building for this genetic feature, as well as what will differentiate application builders who are building on top of these models. And we hit on AI infrastructure, what the needs are still for developers, and where there's still room for startups to compete. This was a ton of fun to do this right after a really compelling release from OpenAI. I think folks will really enjoy this. Without further ado, here's our episode.

Well, thank you both so much for coming on the podcast. Really appreciate it. Yeah. Good to be here. Awesome to be here. Yeah. I mean, congratulations. Never a dull moment at OpenAI, but I feel like the last month has been like an even crazy by your standards, the amount you've shipped. Yeah. Yeah. It's been quite a journey, hasn't it? It's been hectic. Yeah.

I can certainly imagine. Well, I feel like there's a ton of things we'll want to dig into in all the stuff you've released lately. But maybe to start just at the highest level, I'd love to hear kind of your long-term vision for how we as consumers will interact with agents in the next five, 10 years. Yeah, I mean, right now we see it all happening online.

in services like ChatGPT, you've got deep research, you've got operator, people are specifically going to this spot. I think the most exciting thing about releasing models and APIs that are underlying these agentic products

is that we're going to see them in more and more products across the web. So computer use coming to, you know, a browser that you'd like to use or operator automating, like a task that you do day to day at work, uh, and, and doing all the clicking and filling out forms and all the research for you, uh,

I think it's just going to become more and more deeply embedded into products that you use today, day to day. And that's what we're most excited about, at least in the API platform, is to just disperse this thing and have it be everywhere. Yeah, I think one of the cool things about working on the API platform is you actually don't know what people are going to want to build. It's very verticalized, right? So on ChatGPTU first party, we kind of have an idea of what people will want to do. But in the API, it's just like,

They people know their domains way better than we ever could. Right. And so it'll be really interesting to see how these products like in these model capabilities make their way into verticals. Yeah. They're particular like agent that you'd be like you're like waiting for you like, God, I just can't wait until I have, you know, everyone always like the travel agent for some reason. Yeah. I don't know if there's there's one that's top of mind for you guys.

My top one is an API designing agent. The amount of time Steve and I-- We would go back and forth. Yeah, we're just going through every single parameter name that we can think of. Should it be param config or config param? Yeah.

Yeah, that would be amazing. We could have some deep research-like thing that looks at the best API design things and gets really good. Yeah, we could just be fine-tuning it on all the APIs that we really like. That's actually a really good idea. Yeah, I hope someone takes your API and then gives you a product back that obviously does that. That would be great. I would love that. That's the dream, right? Yeah, exactly. I guess a question a lot of folks are asking is, we're obviously in the very early innings of these agents, and they're kind of accessing the web and communicating with each other in ways that have been built for this previous paradigm.

There's all sorts of futuristic ways that folks think about how these agents may access the web and communicate with each other. There's even that viral YC demo where agent realizes talking to an agent and they switch to something that's easier to exchange information. How do you guys think about how this evolves? And obviously, I'm sure the developers will take you in all sorts of directions, but any early inklings of how this might work? Yeah, for sure. I think on agents communicating or getting information from the web, we've already seen a big change. We've gone from this world where an agent would do a single turn

decide whether it wants to search the web or not, get information from the web and synthesize a response. That was what I think 2024 was about. 2025 is already about products like deep research where the model is getting information from the web, thinking about what it got, reconsidering its stance,

getting something else from the web, opening multiple web pages in parallel to try to save time. And this whole chain of thought, tool calling or calling tools in the reasoning process is a significant shift in terms of how agents access information from the web.

And you can totally imagine some of these web page extraction details being replaced by other agents in the near future. I don't even know if this agent needs to know that it's talking to an AI agent on the other end. It's just like an endpoint that it calls and it's like, oh, it got some very useful information that it uses to make its decision or backtrack or do something completely different.

Yeah, I think it's all going to be pretty seamlessly embedded in this chain of thought process where tool calling is just happening between both the internet and your private data and your private agents. So that's where I see it going like

pretty much in the coming months. - Is this something that you think companies should be, 'cause obviously one version of the world is they can just wait for agents to start accessing their sites. Another version is they should be building actively toward this and create the agent themselves that makes it easier for a consumer agent to hit it. How should folks that are running these, running products at some of these companies be thinking about this? - I think the,

developers are already doing this. We put out the agents SDK for this very reason because people are creating these multi-agent swarms of multiple agents to solve these business problems. If you look at a customer support automation problem, you have one agent that's looking after your refunds, another that's looking after billing and shipping information, something else that makes a decision on pulling the FAQ or escalating to a human. And so

We already see this multi-agent architecture be very popular and we want to make it much easier for developers to build on it and that's why we built the agents SDK. Now, when do you start exposing these agents to the public Internet and how that becomes useful is going to be very interesting. I don't think we've seen too much of that, but it makes so much sense that that will happen at some point.

And my advice to companies and products would be just build these AI agents internally to solve real problems that your company is facing today. And whenever it becomes apparent that exposing this to the internet for someone else to communicate with you makes sense, that'll just happen. And I don't think we're too far from it, but yeah, I think it'll just happen in the coming months. Yeah, totally. I think what's really interesting too is that like,

Most of the data that a model is seeing is either your own data, chat history, file search. I think what's really interesting, especially with these tools that are much more connected to the web, is that we'll see a lot more data going into the model that's actually from around the web and not just data that you're providing, which is really interesting. As developers are thinking about incorporating and using these APIs, what heuristics do you guys use for where agents do and don't work today? How would you advise folks?

Let's take a little bit of a step back. 2024, what most agentic products looked like was a very clearly defined workflow with about less than 10 tools.

It's about a dozen tools at most, and this very well orchestrated go from here to there to there to there. And that's how a lot of companies built a bunch of really cool coding agents, built a bunch of really cool customer support automation projects, deep research projects, et cetera. In 2025, we've gone to this model where everything is happening in this chain of thought. The model in its reasoning process is clearly smart enough to figure out

how it should call multiple tools and then also figure out that it's going down the wrong path, take a U-turn and then try something else. I think you've gone away from the whole deterministic workflow building process. OpenAI has been working on tools like reinforcement fine-tuning, etc. to make this something that developers can use themselves.

I think the next step after this is going to be, how can you get rid of that 10-15 tool constraint that you have? How could you just expose this thing to hundreds of tools, have it figure out which is the right one to call, and then make use of

make use of those tools. I think that's really the next unlock. And then this thing becomes like, it has all the superpower it needs. It has the compute. It has the way of reasoning about different tool trajectories. And it has access to a lot of tools. So that's what I'm really excited about in the coming months is removing the number of tools constrained. Mm-hmm.

But yeah, it's kind of hard to make that work with today's models, but I think that's going to change. Yeah. Yeah. I think also just increasing the available runtime that these models have to go off and do what they need to do. I mean, if you're a human, you can go off and work on something for a day and use as many tools as you need to get the job done. And I think now we've seen runtimes for models, especially deep research, that are in the minutes.

But being able to get these things to go into the hours and into the days is going to yield some really powerful results. Last year, you had to put such specific guardrails and chain things so closely together because you couldn't let things go off the rails. It seems like now you're even more flexible in what you can allow. And then obviously the dream is just like, yeah, go off here. It's like the hundreds of tools that you could use across every task. Go figure it out. Yeah, exactly. Totally. I think it's not a... Let's see how the next generation of models...

generalized to all of the use cases that developers are going to have. There's also this reinforcement fine-tuning technique where you're creating these tasks and graders. If developers can create their own tasks and graders and get the model to find the right path, the right tool calling path to solving a particular problem that's very unique to that developer's domain,

That would be amazing. So I'm really excited about the next series of models that are going to come out and our early results from reinforcement fine-tuning. All that comes together to make agents that are actually very useful and reliable. The really cool thing about that is you're really steering the model in its chain of thought, and you're kind of teaching it how to think about your domain, which is just a really powerful...

kind of mental model when you think about it. You're like, okay, how do I think about, like, how do you basically, like, train a model to be, like, a legal scholar, essentially, or train it to be, like, a medical doctor or anything like this? Really, like, training the way that it thinks in the same way that, you know, four years of university would train you to think in a specific way. So, like, I think the reinforcement fine-tuning thing is a great example of, like, where you're going to see, like, really interesting verticalization for these models. Yeah. And for that, I mean, how have you thought about, I feel like one of the classic problems or people talk about there is, you know, you can provide...

I'm sure folks want something off the shelf that makes it easy to do the grading and evaluation. And at the same time, some of these domains are so hyper-specific in their own problems. How have you thought about the infrastructure level, the right level of tooling to provide to folks that are doing that fine-tuning in a domain like legal or healthcare? I'd say it's still a work in progress. We're like

I think right now the things that we're exposing is basically we're allowing, we're kind of giving developers a way to build their own graders. So for example, if you are, if you have an eval that you show does 50% on a medical task, right, you can build these graders. Let's say that can cross-reference a model's chain of thought or something else that it's outputting based on like a,

uh, some sort of like known ground truth, like a medical textbook or something like this. Right. And so over the course of fine tuning, you can sort of like steer the model in those direct in the direction to be able to produce like better and better outputs and just, yeah, just kind of be able to just steer it in that way. And so we're kind of providing the basic building blocks, um, really mostly just like these, these really flexible graders that allow you to like,

take a model output and then grade it against some sort of ground truth or execute some sort of code to prove like, oh yeah, this is mathematically correct. We're not just checking that this string equals this string, right? There's actually some mathematical correctness to it. Yeah. I mean, it feels like the biggest question across the board in so many aspects of AI right now is what actually can be graded. I mean, I feel like it's the big question in test time compute and what you can scale. And obviously, I think if you take health care and law, for example,

you know, one critique of some of these evals is like, well, cool, like, you know, being a lawyer is not passing the bar. Like, being a doctor is not, like, passing these medical exams. Anything, like, you've seen folks on the ground doing that you feel like is a creative way to actually, like, best use this type of approach? Honestly, like, after having talked to the folks who are, who've built things around operator and deep research internally, like, it's,

it's pretty challenging right now to do this stuff and takes a lot of iteration. I don't think I've seen anything out there that's like productized

grading and task generation in a way that just nails it for your domain. I think this is the biggest problem to be solved this year, and if not, it might even go into next year. The technique is going to come out, but how are you going to actually build really good tasks and graders is something that's going to be pretty challenging.

Yeah, I know it's possible now. These products exist. So you know that it's possible to build something like deep research. There's been some replications of that around the internet as well. So you have enough proof over here. It's just about how do you productize it so that almost anyone can make use of it. That's going to be hard.

What about computer use? How do you classify for developers today? How they should think about using that, where it works, where it doesn't? Computer use has been surprisingly a lot of cool use cases. Initially, we thought that a lot of computer use use cases will be around legacy applications that don't have APIs.

and people have been trying to automate this thing for ages and they haven't been able to. That's definitely the case. We definitely have had a couple of customers try it out in the medical domain where you have these super manual tasks that people are just clicking through across three or four different applications to do things.

And that works really well. But you've also seen, we've also seen examples of companies that are using it to do like research on Google Maps. So I think Unify GTM is one of the companies that had used it earlier during our alpha phase. And they basically, uh,

They would have climate tech startups ask questions like, "Has this company expanded its charging network?" What the agent would do is open up Google Maps, turn on Street View, and go to places and see whether there are more chargers or not. That's really cool. I didn't know that.

And I'm like, okay, Google Maps does have an API. I actually don't know if Street View has an API, but it's probably really hard to figure out which exact location and which direction to look at, maybe. And so all of these...

You can pretty much automate anything. It's kind of cool, right? So you could start there and then you could maybe think about API approach after that. Totally. I mean, there's a whole like many, many domains like don't map to JSON, right? Like you can't serve them over the web in plain text. So like these kind of use cases where you need some sort of a combination between like

vision and text ingestion, I think are really, really well suited for Kua. Yeah, that's a really interesting example. I didn't know. Yeah, the Unify use case is fascinating. That's really cool. I was struck by, obviously, you had a bunch of alpha testers and whatnot. And so you released this. And then the next day, I feel like every big company was like, this is an awesome thing we built with this API. Any particular favorite, even just in the week or so since it's been out, any particular favorites that you didn't expect or kind of cool ways people have been using these? Oh, that's a good one. Post-alpha, let's think.

Well, the computer use ones are the coolest. I think you have... I really am excited about the platform players on computer use as well. Like if you think about...

the other tools that we have. So we have web search, we have file search, and we have computer use. Web search, you have a bunch of companies that provide APIs for people to be able to get data from the web, put it into the model's context. File search is pretty mature. Honestly, you have the vector database industry. And computer use, I think things are super early. The main thing people want to do or businesses want to do is take these

Docker containers or these VMs in the cloud and then put their software in it, put their authentication into it so that they can go and automate things. And there are a couple of really cool ones. There's Browserbase that provides this service. There's a YC startup called Scrappybara that has, I think, one of the better developer experiences around making computer use models work really well with hosted virtual machines.

And I'm like a developer platform person. So for me, it's like looking at those platform plays and like, all right, what's the thing that people are going to build on top of that is very exciting. And so, yeah, I'd say those are my top two browser-based and scrappy bar. I'm pretty excited to see what they do. Yeah, I thought too Arc was doing some pretty cool stuff. Like they were basically building a,

a tool where they were going to have basically you just like open a tab and give it an instruction and then it kind of goes and does something in the background. I think it's very much an operator like use case, but it's really baked into the product. I mean, it's just a web browser. You're using it, right? It's not like

necessarily baked into a tab in your web browser. It's really just part of the browser itself. But that sort of native integration was really cool. Yeah. I think they're calling it DIA or something. That was there. That was super cool. That's awesome. Is there anything you've noticed so far that maybe some of the most sophisticated users are doing with the APIs that you're like, God, I wish we could disseminate this more broadly? If only we were on a podcast and we could tell the world this is a good way to use some of these things. Anything you've noticed, patterns that some of the most sophisticated folks are using?

For the tools, it still feels pretty early. I think during the alpha phase, we definitely found folks who were... They try to get the model and the tool to do the thing that they're trying to get it to do. And if that doesn't work, they try a bunch of prompt engineering. And then if that doesn't work, they make this a step in the workflow. And I think...

By going through those steps, they typically get what they want. It's like, hey, web search, the tool is not giving me exactly what I need, but can I make it part of my workflow where this is just one of the steps that gets information from the web and then I pass it on to something else, either deterministic or another LLM step.

On net, I'd say it's pretty early right now. And we're going to discover a lot of this in the coming weeks. Yeah. I think one, to invert the question a little bit, like one thing that I'm really glad that like we were able to ship is sort of like in the agent's SDK, this idea that we're going to sort of split the concerns of what your job is or what your task is across many different agents. It's very much analogous to sort of like the single processor computer versus the multiprocessor computer, right? Like you just allow each agent to focus on one task and you give it all the context and then

your efficacy on those tasks goes way up, right? Because you're not trying to prompt engineer one agent to do a hundred different things, right? You're kind of just like spreading that across. So I was really glad to see us sort of like, I'm not sure if we invented that paradigm or not. I'm assuming we didn't, but like just to like ship that as a really first class pattern. I was really, I thought that was really cool. Yeah. No, it's so interesting because I feel like you alluded to the fact that like, hey, if it's not working, you can kind of just like add it as a step. And I feel like one interesting, you know, quandary that we have on the investing side is like, it feels like,

you know, a lot of people, whatever the current capabilities of the model are, they kind of build whatever scaffolding they need to make them work. And sometimes you're like, well, that gets you the product in the market now and gives you a product that is valuable. At the same time, if you went to a beach and waited three, six months for the models to get better, they may just be able to do it, right? With your, you know, 100 tools to one thing versus like chaining the steps together. And so, you know, I'm curious like how you think about like, you know, the kind of steps that people are building around the models. Like, does that all get obviated over time or is like some of that useful? Yeah.

I think that this is the most like agent or agent and tool orchestration is like the most important thing right now because

My opinion is that the models are much further than where most AI applications are making use of things. There's so much value to be extracted from these models that building things around models to make them work really well is an extremely important thing that AI startups should be doing and AI products should be doing. It's like the...

And it's time and time again, where even in customer support automation, which has been a thing that's been around as a concept for a while, we had a couple of companies really crack it in late 2023 and early 2024. And the adoption has been kind of slow. You don't see that many companies move as fast as the first 10, 15, 20 companies moved.

it just shows how important it is to be good at orchestrating, to be meticulous about looking at your traces, figuring out how to prompt engineer, having an eval set so that your prompt doesn't degrade something else. This is so hard today. It's crazy how hard it is. And so I would tell people that's the exact thing to be focusing on is how to make these models work really well. Yeah, 100%. And I think, too, like,

you know, just the idea of splitting up your task among many different agents is like, just makes debugging the whole workflow way easier, right? Because if you have a really capable model and it has 100 instructions and you change a few tokens, right, it might drastically change the outcome of your eval, right? But if you just have one

you know, handoff agent, you have one triage agent, you have one this, like tweaking each one of those becomes a lot more isolated where you're not, you know, the blast radius is much smaller as you're sort of like hill climbing on your eval. I think when you were on latent space, you mentioned that you like, you know, over time want to add more knobs to make, you know, things more customizable for developers. What do you think that like looks like over time? And, you know, how did you think about this kind of tension of providing something that's like relatively easy to use out of the box versus like the ultimate amount of customizability?

Yeah, totally. I mean, this sort of like idea of APIs as ladders is really something that we took from first principles when we were designing the responses API. And I think it really comes down to, you know,

a couple of things, right? Like you want to give a lot of power out of the box. You want to make doing the simple thing really easy. And then you want people to be able to get a little bit more reward for every effort, every amount of effort that they put in. And so for us, this looks like a great example, I think, actually is file search, where it's actually really easy to use just out of the box. You upload some documents. You don't even have to do it in the API. You can just do it in the other website. You pop in your Vector Store ID.

And it just works. And now let's say, OK, well, this actually isn't quite working for my use case. Well, OK, now I actually have knobs to go in and tweak the chunk size. The default is 400. Maybe I want it to be 200. Maybe I want it to be 1,000. So I have those knobs. They have sensible defaults.

And so I can go in a little bit deeper and get a little bit more reward for everything I'm putting in. And, you know, it goes way deeper than that in the file search example, right? You have metadata filtering, you have the ability to customize the re-ranker, right? But this stuff isn't, we don't force you to set all those things right up front, right? We kind of like give those things to you and expose them. They're in the docs, you can find them. But if you're just kicking the tires of the API, you don't want to think about

You're like, what the heck's a re-ranker, right? So yeah, that's kind of what we think about it is make it as simple as possible. I think we actually spent a pretty long time trying to get the quick start for calling the API down to four lines of code of curl. And we were really obsessing over that. It should be this simple. But then there's also 50 more params that you can set if you want to, and they'll have reasonable defaults. Over time, what other knobs might you want to add? Hmm.

That's a good question. Oh, yeah. I mean, for tools like for web search, you want to basically add site filtering. That's been a big ask right now. You just have to search the whole internet or you can prompt your way into it. Specific location, too, on web search. Now you can set the city, you can set the country, but actually setting down to the block or the even court.

Which is super important for weather, for events, type of queries. Especially in SF microclimates, right? Yeah, seriously. Actually, one of the things we're really excited about doing with Responsive API is building all the features into it that we had in the Assistance API, but not forcing users into it. So we released the Assistance API, I think, November 2023. It had this full concept of storing your conversations, storing your...

model configurations in an assistant object, et cetera. And we found that the hill to climb to get started was a lot. With responses, we're taking the other approach where you're starting off with a single API call and a single endpoint and one concept you have to learn.

And then maybe you want to store your conversation with us so you can opt into using the equivalent of the threads object. And maybe you want to store your model configuration with us so you opt into an assistant type of object. And those things you just plug in. It's just one parameter you configure. And that's a knob that you have to have OpenAI host the thing for you.

So yeah, I think that's another set of knobs we've like in the short term we really want to get to. Yeah, exactly. Reflecting back on some of the previous APIs you've released, like obviously these ones are meant largely to supplant those. Like any learnings or like things you were like, hey, we got that really right or actually we kind of missed the mark on that and we've kind of fixed it in this current iteration. Yeah.

MARK MANDEL: Totally. I mean, I think the thing that we really got right with the Assistance API especially is tool use. That's where we really figured out-- we saw a ton of usage, especially with the File Search tool. That's where the API really found market fit, right? It was people wanting to bring their own data to the API and have the models search over it.

But what we got wrong is a lot of the things that Nekunj said, really. It was just too hard to use. No way to opt out of the context storage. A lot of people didn't like the context storage. They wanted more of a chat completions interface where they were able to provide their own context on each turn of the model.

But also the check completion interface is quite limiting, right? The API can only output one thing and the model does many things, right? And so you want it to be able to do a bunch of stuff in the background and then kind of give you the results of all of its thinking and all it's doing. And so, you know, we really tried to like take the best parts of assistance APIs, sort of the tool use and they're like multiple outputs and all of that stuff and the ease of use of check completions and bring those things together.

MARK MANDEL: Makes a ton of sense. How should developers think about this kind of suite of developer tools now and the MCP landscape? SETH VARGO: Yeah, I think they're probably solving different problems.

The responses API is focused on making these multi-turn interactions with models really good. We're providing a foundation for the model to be able to call itself multiple times, so have multiple model turns, and call tools multiple times, so have multiple tool turns to get to a final answer. So that's like we

we've set the building block, which is the responses API. MCP is sort of like how you use tools and bring tools to models. And I think these things are honestly pretty complimentary in some sense. And we have to figure out what we do on the tools registry and the tools ecosystem side. But MCP is super cool. And that's

something we have to figure out in terms of how we bring that to our ecosystem as well. One thing I'm struck by is, obviously, I feel like in the first years post-ChatGPT, there was a lot of AI infrastructure companies that popped up that were trying to do aspects of what you've released now, like agent orchestration and vector databases. How do you think about the opportunity for standalone AI infrastructure companies and where it makes sense for those to exist now?

on top of what you guys are building and where it might not make as much sense. Yeah, I think on our side, we're working with our users and listening to what their asks are. And they want a one-stop shop for the things that they want the LLMs to do. They want it to be able to search their data and search the internet. And so we've taken a step in that direction. That being said, I feel like the AI infra companies are building...

low-level, very powerful APIs that are infinitely flexible. There's always going to be a big market for that kind of stuff. I think we just got to build the thing that our users are asking for, which are these more out-of-the-box tools.

And we're taking a different approach to this whole space. But there'll be vertical, specific AI infrastructure companies. I think there's certain companies that build VMs just for the coding AI startups out there so that they can...

test their code and spin down the VM as quickly as possible. I think they're called RunLoop or something. I've heard of them. So there's going to be verticalized AI infra, which seems like it makes a lot of sense to keep doing that. Totally, yeah. It's stuff that we're not always going to want to be in the business of doing, right? I think, too, there's a whole class of LLM ops companies that are doing some really interesting things, like helping you manage your prompts and helping you manage your billing and understanding where your usage is going. And I think that that sort of stuff is like...

It's not necessarily like low-level infrastructure, but it's still stuff that developers care about. Yeah, in a multi-model fashion, multi-provider and all of that. Exactly, yeah, like open router, things like that. Yeah. Yeah. I mean, and obviously, you know, it sounds, you guys spend probably most of your days talking to developers and getting their wishlist. I'm sure it sounds like you got a lot of it into this current generation of APIs, but I'm sure there's always more to do. Like, how do you think about, you were kind of talking about evals earlier as like the problem, but like, how do you kind of think about the stack range problems that are like still left unanswered?

that make working with these models painful today for developers and what some of the most important things to be solved are. Yeah, I think tools is definitely a very big thing for us to figure out. We have the foundational building block. We need to build the tools ecosystem on top of it. There's obviously great work on the MCP side over here, and that is top of mind for us to figure out what we do on that front. We also have...

the computer use VM space is pretty early and I think that's another big one. How do you get enterprises to securely and reliably deploy these virtual machines in their own infrastructure and observe them and all the things that the computer use models are doing on top of that? I feel like

These models, these computer use models are going to get so good so quickly because we're just at like the GPT-1 or 2 of that paradigm. And this thing is going to be incredibly useful. So I'm like very curious to see how the infra on that front takes off. I mean, I think that one of the things that was really interesting to me during the alpha period was like all the different environments that people wanted to try out the computer use tool in. Like we saw folks...

The model works best in a browser environment, right? It's kind of like what it was trained on, but people were trying to use it with iPhone screenshots and Android. And I was like, wow, that's so interesting. I hadn't even thought about doing that. And so I think that the sky is going to be the limit on what people...

people will want like are is there going to be a company that just does sort of like iPhone VMs or like sort of like uh uh you know like a um there was a company that used to do just like testing frameworks for like iOS things like that but just now it's for AI models like stuff that's like really interesting um

Because different flavors of Ubuntu, all of that stuff, it's really just a huge amount of fragmentation. And so it's going to be really interesting to see how the community steps up to fill the gaps there. Yeah. Yeah. We're also seeing people doing-- I think there's a startup trying to do cybersecurity work, so trying to find vulnerabilities in other sites and surfaces using computer use. You have to poke around for 30 minutes. Yeah. Which is pretty super interesting. Yeah.

MARK MANDEL: That's really interesting. I mean, I guess obviously one of the fun parts of your job must be you're obviously probably really tightly integrated with the research team, see the models as they come through. Like, any things that you're looking out for on the model side? Like, I'm sure you've got the next computer use model or the next models that are used for agents. Like, any milestones or capabilities that you're like, god, when we can do x-- like, any time I get the new model, I try x. And if we could do that, like, that would be so game changing for our developers. FRANCESC CAMPOY: Yeah, that's an interesting one. I actually have a bunch of prompts

that I've gotten from a bunch of YC startups. And they're always like, this thing never works. And I actually have them saved as what we call presets or prompts in the open-end dashboard. And each time something new comes, I try three or four of them. They're all pretty much focused on agentic tool use. And there's six or seven different

tools that are pretty straightforward. And I'm just looking for these reliable executions of them from turn to turn. And I'm pretty optimistic with our next series of models, but there's certain of them that it just doesn't get right. Yeah.

I'm also really keen on finding much smaller and much faster models, faster than Foro Mini for sure, that are pretty good at these tool use things. If you think about the workhorse models or the supporting models that sit around the O1s of the world that can do these really quick classifications and guardrailing and all, I think there's a lot of room for improvement. Yeah.

on those type of things. And yeah, just the fastest, smallest classifier would be really, really cool to work on. Totally, especially because they're so fine-tunable. Yes. Right? And you can just really tailor those things to your heart's content on a specific use case. So yeah, that would be really cool. I'd have a fleet of those. For me, it's diffs. I just want the model to be able to spit out a diff that I can apply cleanly to my code, and it'll just work, and I don't have to budget to get it to-- that's going to be huge. That's going to be really, really huge.

Models don't really understand line numbers that well. What was your reaction? Obviously, there was some really impressive agent work out of China recently. And I think that it kind of always seemed that the most cutting-edge agents would go alongside the most cutting-edge models. And I think they're using anthropic models and whatnot. But I feel like it might have challenged that paradigm a little bit. And so I'm curious kind of your reaction to some of those demos. Yeah.

My reaction was like, this is what we've been saying internally, is that the capabilities are there in the models, but so few people are able to make use out of it. I think it's crazy that it's still like this. We need to make it easier for developers and everyone to be able to build more powerful things with the models without being exceptional AI and ML people. And so

I just feel like it validates the fact that give people the right tools, give people the right models, help them put them together with things like the agent's SDK, make these things observable so that more and more people can build things like what we saw come out of China. Yeah, that's my take on it. I think just making the flywheel spin way faster from evals to production to fine-tuning and back again, that is such a powerful loop that we just need to make way faster.

We're simpler. Yeah. What do you think are the key things to make that simpler? That's the biggest thing to figure out, honestly. We've got to-- If we have a good answer. I mean, the research team does it at OpenAI all the time. The model is getting better at chat. It's getting better at doing all the deep research things. The next operator model is going to be so much more powerful at doing computer use things.

How do you productize that is the thing that we need to figure out. Obviously, with a lot of toil and really closely observing your traces and creating the right evals and graders, it works for sure. We just have to productize this and we need to figure out how to make this easy.

It needs to be about 10 times easier than it is today. It's definitely doable. You can create an eval, but it's a lot of work to create an eval. And so I think that's the biggest thing for me is just like, how do we make that process of evaling your task, your workflow a lot easier? No, I mean, it's funny. I am struck by it. It feels like we have a new model and people spend like six, nine months trying to discover the use cases. They probably discover 1% of what these models can actually do. And then it's like on to the next one.

And so it's pretty wild. I mean, obviously, you know, I think we all kind of feel like we're on the precipice of this like super large change. And, you know, it feels like, you know, we're going, you know, especially as you make these tools easier, you know, agents are going to be increasingly ubiquitous. If I'm just like a normal, you know, enterprise or consumer CEO today, and I haven't really thought about this so much, like what would you be doing in those people's shoes? If you're running a company that like, you know, probably in this agentic future has some way of interacting with these models?

It's going to start exploring these frontier models, start exploring the computer use models, take a couple of workflows internally, and try to get a feel for building these multi-agent architectures to automate things end-to-end.

I feel like that's the most actionable and actual thing that you can do right now on the tool side, like figure out which of your, your, your, uh, like manual workflows need, uh,

a tool interface and start doing that. I feel like the whole digital transformation and automation thing that had its thing during the cloud days is coming back right now. And so sometimes I talk to users who are like, we want to automate this whole thing. But 90% of the work to be done is to like

figure out how to get programmatic access to certain tools that you're using. And like the LLM portion is just like tiny in the middle. And I'm like, the

This is a very different problem for us. And yeah, you can solve it with computer use right now and try to get it to production. But really just finding ways to automate your applications, trying out the frontier models is probably the main thing I'd recommend. Yeah. I think it's really interesting being a developer in this era because for a long time, we have

as developers, been constantly automating away the bottom 20% of our job, whether it's through better frameworks or better programming languages or what have you. And so I think that for me, if I were running a company, I would just be asking my employees, what's your least favorite thing that you do on a day-to-day basis? And let's try to figure out ways to automate that. That's going to make everybody happy. It's going to increase productivity, of course. And so, yeah, that's how I would think about it. Have you guys done that?

No, I'm not. I love that. I mean, look, a fascinating conversation. We always like to end with kind of a quick fire round where we stuff a bunch of overly broad questions in the last five minutes. And so maybe to start, I'd love your take on like one thing that's overhyped and one thing that's underhyped in the AI world today. Yeah.

My answer is, agents are both overhyped and underhyped. We've been talking about agents for a couple of years. We've gone through two full hype cycles. Yeah, I know. At the same time, underhyped because, hey, the companies that actually figure it out and build deep research-like things or fully automate some really manual task are able to just do so much. So, yeah, that's my take on it. Yeah.

I mean, obviously you guys are so close to the cutting edge here. I'm curious, like what's one thing you've changed your mind on in the AI world in the last year?

I think for me, it's definitely the power of these reasoning models has been... We were always aware of this reasoning thing coming. And I did not appreciate how that combined with tool use is going to create things like operator and deep research. And just seeing that it's possible to move away from this workflow...

that every company was doing to this completely agentic product that figures out to use in its chain of thought and actually delivers like really, really powerful results. That's been like the biggest like shift for me. And then like seeing early results of our reinforcement fine tuning alpha, those are, you know,

That's been the biggest shift for me in terms of how it's possible to do this. Yeah, for me, it's just fine-tuning broadly. I just like the power of being able to-- I thought that all the knowledge that you could put in a model is baked in when it comes off of the GPUs. But being able to really add a bunch of your own custom information and seeing how much that moves the needle for a specific task is pretty impressive.

What do you think will be the biggest differentiator of application builders long-term? Is it the question in venture? Is it deep knowledge of the models and how to really build these agents? Is it just knowing a domain super well so you know what to build? What do you guys think of that? I think it's kind of a combination. And then there's this idea of like,

If you have whatever special sauce it takes to be able to really bring the AGI out of the models that we think is in there, I don't know what that is. If it's prompt engineering or workflow orchestration or something else, I think that is going to be a huge differentiator. For me, it's being really good at orchestrating. I feel like that's going to be the biggest. What do you mean exactly by that? Bringing together your tools and data with

a bunch of model calls with a bunch of models, either in the fashion of it being reinforcement fine-tuning and calling these tools in the chain of thought, or in terms of chaining together multiple LLMs and being really good at doing that quickly, evaluating and improving it. I think that's the biggest challenge

skill that would move people forward in the next year or two. Awesome. What do you think are some of the most underexplored applications of these models today? I haven't seen anything crazy on the scientific research side. When the O-series model started, that was the main hope and expectation was that there'll be a step change in how quickly scientific research happens. I think we've seen some early reports around that, but

very curious to see how that changes. I think that so much criticism about the AI industry as a whole has been that the interfaces are not quite right yet. And I think especially for...

A space like academia, where everything is kind of the same way it's been for a long time, I think finding the right interface for that is going to be really key and really drive a lot of adoption there. Robotics too, maybe. It's probably time for something big to happen. The origins of OpenAI. Good old RubySkip. Do you think model progress will be more, less, or the same as last year or this year?

Oh, it's going to be more. I think it's got to be more. Yeah. Especially as like, I mean, it's a feedback loop, right? Especially as like we, the models are kind of teaching us how to, how to like improve them with better data and things like that. It's like something we do a lot on the research side. Which AI startup or like categories are you most excited about right now? Like outside of OpenAI?

I came from a travel background. I was doing a travel company right before I joined OpenAI. So I'm just really excited to see somebody really crack that. I think the travel industry is super entrenched and there's only a handful of big players. And so I'm really excited to see who builds the actual AI travel agent. Everyone's favorite demo for agents. Exactly, yeah. But there's not a product that people are using there. So I'm really excited. Yeah, why doesn't it work yet? I don't know. I'm going to go figure that out right after this. Yeah.

I use Granola a lot. Have you heard of that one? Yeah, of course. Yeah. That's my favorite AI tool these days. And every meeting, I'm in very meeting heavy roles, so it helps a lot. Yeah. Yeah. No, great product. Well, look, I think there's a ton of interesting threads for folks to pull on. Obviously, a ton of great stuff that you guys recently shipped. I want to leave the last word to you. Where can our listeners go to learn more about the APIs, about really any place you want to point them? The floor is yours.

Yep. Our docs, platform.openai.com slash docs. And also like the OpenAI Devs channel on Twitter or the account on Twitter. And the community forum is always a great place to check out. I'm going to load the domain of that one. Is it forum.openai? Community.openai.com. Just Google OpenAI Community Forum. You'll probably find it. Or ask ChatGPT for it. Or ask ChatGPT. Awesome. Well, thank you both so much. This has been a ton of fun. Awesome. Thanks so much.

Ep 59: OpenAI Product & Eng Leads Nikunj Handa and Steve Coffey on OpenAI’s New Agent Development Tools 44:37 Share

Unsupervised Learning

Deep Dive

Shownotes Transcript

Ep 59: OpenAI Product & Eng Leads Nikunj Handa and Steve Coffey on OpenAI’s New Agent Development Tools