Hi, listeners, and welcome back to No Priors. Today, I'm joined by Issa Fulford, one of the pioneering minds behind OpenAI's deep research. This is a new agentic product that OpenAI released in February of this year, which uses reasoning and tools like web browsing to complete multi-step research tasks for you. Today, they're making it free to all US users. Welcome, Issa. Issa, thank you for doing this. Thank you so much for having me. You and your team have shipped like
one of the most exciting AI products of late. I use it a lot, deep research. Where did the idea come from? Tell me the origin story. Yeah, so around a year ago now, we were very excited about the progress internally on this new reinforcement learning algorithm. We were seeing a lot of progress on math problems and science problems and coding problems.
And at the same time, I was working with my friend Yash who works at OpenAI on a few side projects.
And we're very interested in agents generally and kind of wondered if we could apply the same algorithm to tasks that are maybe more in line with what the average user would do every day. And so the two first things we were thinking about were online browsing tasks, because I think in a lot of different professions, people do just have to do a lot of research, synthesize a lot of information and then come back with a report.
And then we're also thinking about software engineering. We kind of have been working on those things. I've been focusing on browsing. So to start, we kind of with the math and coding problems that people were already training on, those data sets already exist. You know, you can have a math problem with a ground truth answer and you can train on those. But for browsing, it's kind of more open ended. You don't really have
like that that exist. So we really started by grounding the research and what product use cases we actually wanted the final model to be good at. So we literally would write out just a list of things like, I hope the model that could find this list of products for me
and rank them by these reviews from Reddit or something like that. Or I want it to be able to write a literature review on this topic. I feel like a lot of people, when they think about browsing and agents, they land on the same two, three transactional use cases that I actually don't think are particularly inspiring, right? So it tends to be like...
order a burger on DoorDash or something like that. Or I feel like ordering flowers is also like a really common one. Why do you think you came up with like such a different set of goals for the agent? Yeah, so I think...
Before we focused on taking right actions, which those are examples of taking right actions, we wanted to get really good at synthesizing information from a large number of sources and mostly read-only tasks. That was for a number of reasons. Firstly, just a huge number of knowledge work professions mostly do that. So it would be quite useful for those groups of people. Secondly, I think the overall goal for OpenAI is to create an AGI that can make new scientific discoveries and research.
we kind of felt that a prerequisite to that is to be able to synthesize information. You know, if you can't write a literature review, you're not going to be able to write a new scientific paper. So it felt very in line with the company broader goals. It's also very meta because you have, you know, helped make an AI that makes me better at learning. And it's learning. Yeah. I hadn't thought of that. I love that. More practically, the read-only task is maybe a
the safety question is a bit more constrained so it was a good thing to to start with as well yeah it seems that the you know read-only space people were also not nearly as ambitious as as you were going in or you and yash were going in about like maybe it could understand this set of things for me okay so you thought of these end evals or and come up with a set of tasks that could be auto-gradable or fit a set of characteristics that made them better fit um
the algorithms. And then what? That was actually a huge process in itself. I think we initially had built a demo to pitch people on this idea and it was no model training involved. It was fully just prompted models with the UI pitching the vision of what this product could look like. And so I think after that, then we were at the point where we actually had to start thinking about how are we going to do this? How are we going to create the data? How are we going to train the model? What tools do we have to create to enable the model to browse the internet?
effectively. And that was a lot of iteration. I was working very closely with Edward Sun and a few other people on this. And so we also collaborated a lot with the RL team. I think it was definitely a big undertaking. And a good thing about it was we were able to work uninterrupted for quite a few months, making the numbers on our e-balls go up. So I think it was nice to have
not too much pressure to ship something really quickly. And we were just able to iterate and get it to a good state. Did you have a favorite, like most important task? We had a few tasks. People would just propose different tasks. One of them was to find all of the papers that Liam Fetters and Barrett Zoff had written together. I think there was 11. The model now can find most of them or all of them. We would always ask that question.
And then another one, which the model actually can't answer anymore, probably for good reason, but finding the middle name of one of our co-workers. And then personally, I think I started using it pretty early on for actually finding information for like...
product recommendations, travel. And I think actually quite a few people internally, we had kind of a streamlit playground that people would just use. A lot of people had found it and were using it. Sam told me he used it to buy a bunch of things. Every time it would go down, people would message us like, what happened? We need to use the model. Even when
a previous version that honestly wasn't that good. So I think that was a good initial sign. Yeah. What can you say about the actual bulk of the work, like the tool creation and the data creation? So for the data, we did a bunch of different things. We used human trainers. For some of it, we kind of had to come up with new ways, new kinds of data sets, I guess. And we had to figure out how to design data sets to exercise the kind of skills that we wanted the model to learn.
And then you have to make a way to grade those datasets as you're training them. And then you also have to make the good
good tools for the model to be able to actually complete the task successfully. So right now we just have the browsing tool, which is a text-based browser, but it can see embedded images and open PDFs. And then also it has access to a Python tool, so it can do analysis and calculations and plot graphs and things like that. But you can imagine in future versions, we'll just expand the tool set. And so the model will just become more capable, but we'll also need to make data sets that actually make the model...
exercise all of those different tools and figure out how to use them and backtrack and, you know, all these different things during training. So that's actually able to flexibly answer new problems from users in the product. It is clear that reinforcement fine tuning on very powerful base models
can do very useful things now. That's super exciting. What advice would you have for startups or other companies who are thinking about doing RFT for a particular task as to like when it's worth doing or when they can just try to do sort of just traditional orchestration where agents are a component? So I think in general, you will always...
get a model better at a specific task if you train on that task. But we also see a lot of generalization from training on one kind of task to, you know, other domains. So you can train a reasoning model on mostly math, coding, other reasoning kind of problems, and it will be
good at writing, but if you trained on that specific task, it would be better at it. I think if you have a very specific task that you think is so different to anything that the model was likely trained on and you try it a bunch of times yourself and you've tried a lot of different prompts and it's just really not good at it. So maybe it's some genetic sequencing task or something that's just so out of distribution for the model that it doesn't know how to figure it out. I think
that is a good time to try reinforcement fine-tuning. Or if you have a task that is so critical to your business workflow that getting the extra 10, 15% performance is really make or break, then probably try it. But if it's something that you think, oh, the model's pretty good at, but it gets things wrong some percentage of the time, and then you see with every next model that's released, it gets a little bit better, it might not be worth the effort
if the model naturally is just going to get better at those things. So that would be my recommendation. Okay, great. Great advice. You've talked about needing to use human experts to create some of this data. I think of browsing as a somewhat universal task. I guess there are worse and better browsers. Where do you feel like you need expertise or what do you know about browsing expertise that you didn't before? Or information gathering expertise? Yeah, I guess it's one of those things where basically every single profession involves...
you know, having a question or wanting to do research in a domain and then having to find information from many different sources to synthesize an answer. And like while doing that, you have to have the expertise to reason about
is this a useful source is this not is this you know should should I include this is this like completely off topic whatever like that is kind of universal to most jobs or most you know scientific domains any kind of anything so and the cool thing with RL is that you don't
necessarily need to know the whole process of how the person would do the research. You just have to know what the task is and what the outcome should be. And the model will just learn during training how to get from the problem to a good answer. So I think we just took a pretty broad approach. I think that's one thing that if you work at a place like OpenAI, you
I think you can do what they would tell most startups not to do and just try and focus on a really broad set of users and just get experts in loads of different domains and try and see if you can get good at everything at once, which was the approach that we took. And then we also created a lot of synthetic datasets and things like that. But the human data was definitely a really key part for making this model successful. Did any of the learned planning from the model across these domains surprise you, like in terms of the path to find people
the perfect handbag or the restaurant in Japan or the set of papers that was relevant. Yeah, I guess sometimes it will use search terms that I wouldn't necessarily have used or, you know, we didn't teach it to plan up front, but sometimes we'll see it. It does end up making a plan up front before starting its research. Sometimes
the model will do smart things and try to get around restrictions you put on it. So you have to make sure that it's not hacking, you know, and trying to use a different search engine other than the search engine that you gave it or something like that. Like it will do smart things that you have to make sure you're looking out for, you know, in case you want to not allow the models to do those things. Maybe we can actually use this as a, like a moment to talk about some of the failure modes. Like how do you think about some of the
classic issues with agents, like maybe, you know, compounding error or distraction or even safety? Yeah. So I think with deep research, since it can't actually take actions that aren't kind of the same class of the typical agent safety problems you would think of. But I think the fact that the response
responses are much more comprehensive and take longer means that people will trust them more. So I think maybe hallucinations is a bigger problem. While this model hallucinates less than any model that we've ever released, it's still possible for it to hallucinate most times because it will infer something incorrectly from one of its sources. So that's part of the reason we have citations, because it's very important that the user is able to check where the information came from. And if it's not correct, they can hopefully figure it out.
But yeah, that's definitely one of the biggest model limitations and something that we're actively always working on to improve. In terms of future agents, I think the ideal agent will be able to do research and take actions on your behalf.
And so I think that's a much harder question that we need to address. And it's kind of at that point when capabilities and safety kind of converge where an agent is not useful if you can't trust it to do a task in a way that doesn't have unintended side effects that
you don't want. Like if you ask it to do a task for you and then in the process it sends an embarrassing email or something like this, you know, that's not a successful completion of the task. So I think that is going to be a much more interesting and difficult safety area that we're
starting to tackle. You can tell me if you just don't have a projection here, but do you think people are going to want explicit guardrails? Do you think you can learn a bunch of those characteristics in the model itself? If you've used operator, I'm sure you have. You have to confirm every right action. I think to start with, that makes a lot of sense. You want to build trust.
with users. And as the models become more capable, maybe you've seen it successfully do things a few times and you start to trust it more. And so maybe you allow it to, okay, every time, you don't have to ask me every time you send an email to these people, like that's fine. But I do think that as these agents start to roll out, we will definitely want to have guardrails and confirmation just so, you know, while they're not
the end state capability. We still want to make sure we have like a good level of oversight, but I think that they will get so good that we'll just trust them to do things on our behalf. What are some of the obvious ways you feel like deep research as a product is going to get better? Yeah, I mean, it's going to extend into right. You just implied that. Yes. I mean, I think maybe it's, you know, the ideal state would be to have a unified agent that can do all of these different things. Anything that you would
delegate to a coworker, it should be able to do. How are we going to make decisions about if it's like, Sarah, you do this versus agent, please do this? Yeah, I guess. Or is it always just try the agent first? Probably. I mean, I would try the agent first if it was my work. It's kind of the pattern of every time the model becomes more capable, the
level of abstraction of the human becomes higher, if that makes sense. Like the task you're asking it to do is just higher and higher level, but you're still initiating the task. So, you know, maybe previous a year ago, I was asking it to write a function for me. And now I'm writing it to asking it to write a whole file. And maybe next year it will, you know, make a whole PR for me or something like that. So I still think we'll be in the driving seat as to deep research. I think obvious next steps for deep research would also be to
have access to private data, like be able to do research over, you know, any internal documentation or GitHub, whatever it is. There's a golden thread here because when we first met, you were working on retrieval. And I was like, there cannot be only one person at this company working on retrieval. Everything, all roads lead back to retrieval. So I think that will be really cool. And then eventually taking right actions or
calling APIs. And then obviously there are just a lot of things that the model is not perfect at now that we just need to improve. But I think we have a really cool working relationship with the reinforcement learning team. So a lot of teams will contribute data sets to the big runs that they do. So we contribute data sets. And then as they train models with a ton of compute, then it just becomes a better base model for us to continue training from. So I just think the capabilities are compounding.
So this was not a low-key research preview, but a side project that turned into a very interesting, you know, internally pitched project. How do you think about, like, what is a product that OpenAI or at least you yourself want to work on independently versus, like, what belongs in the core research path? A cool thing about OpenAI is that even though the company is bigger,
I think the culture of anyone being able to have an idea and prove it out and then push it to completion is still, you know, still been maintained as the company has grown. For me personally, I'm always motivated by to work on things that I will use myself with the research, for example, I do use it a lot for, you know,
looking up various things, travel recommendations. But I think I'm probably a daily active user. It's fun when you get to dog food. I think I'm a dog now. Oh, amazing. Yeah. I'm burning a lot of GPS. Are there use cases where like, you know, you're the original expert? Are there ways that you or Yash or like you've seen the user base use them that you encourage people to use deep research? I'm always interested to see
people using it in domains that I have absolutely no expertise in. For example, in medical research or I've seen a lot of different scientists posting about how they've used deep research and how I helped them do something. To me, that's the most interesting because when we were working on it, I obviously had no way of judging whether an output is good or not. So seeing experts actually ratify deep research responses is useful. An area that I was surprised to see people using the model in was code search for
for coding questions. I think like use the latest package or latest version of whatever repo to help me write this
file or something for data analysis as well that's also something um the model's already pretty good at and i think we'll just continue to get better at um i think you know uploading a file or something like that and having it do some analysis for you or do some research and then create a report with numerical analysis is pretty interesting i actually haven't tried this so it's and it's not a it's not a browsing test like what makes the model particularly good at
this or what is it capable of? Is it really like multi-step and then being able to do planning and understanding of the task and produce a report that's cohesive? Yeah, I think also the base model or the model that we started fine tuning from O3 is just very capable model. It's trained on
many different data sets, including a lot of coding, reasoning, and math tasks. So that inherited capability is pretty strong. And then when you add the browsing on top of that, it's still able to do that analysis. So I think those two together can be quite powerful. Before the podcast, we were just talking about...
the idea of like learning taste or like preferences from users, like OpenAI has just released a bunch of memory features. Like how do you think that deep research could, or, you know, just agents in general could evolve to take into account like how people want to learn or their information ingestion preferences? Yeah, I think agent memory will definitely be very important.
It would be very annoying if every time you ask it to do a task, you have to repeat the same information, how you want it to do the task, everything about you, which currently for deep research, you do have to do. And I think as the tasks get more complex and right now it will take five to 30 minutes, you can imagine in the future it might take
hours or days to complete a task that you ask the model to do. You definitely want the model's research to be compounding. You don't want it to want to have to start fresh every time. So I don't necessarily have a good answer, but I think it's something that will be very important. There is a common understanding between many people at some of the leading labs that like
the recipe to AGI is, I'd say, somewhat known or, you know, there's confidence on this. And, you know, the return of RL is very exciting for everyone. The stance that I've heard from you and from others is both enthusiasm on like,
This seems to work. We're going to get real capability out of it. It's quite data efficient and it's going to be a lot of work. Tell me a little bit about like the emotional experience of building deep research and if that changes your view at all. I agree with everything, everything you said. I think it's so impressive to see how data efficient the algorithm is, I guess, for
The data you train on is much higher quality and smaller. So actually curating that is an undertaking. And then making sure that the model has access to all the tools that a human would have access to to do the work that they need to do. And then making sure that you represent tasks that people will find useful or do in their jobs in a way that you can understand.
you know, judge whether the model did a good job or not is also hard. And there's so many other challenges for pre-training where you have so much more data, you have to do all of these different things that are like, I think it's just a different challenge and both are compounding. Like you need a really good base model to be able to do RL. And then for our team, we just do more RL. So yeah, it's like all very compounding, but I think that everybody does kind of see a pretty clear path to RL.
this broadly capable agent. Do you think there are big blockers to progress of, like you said, maybe not exactly describing it as the next iteration of deep research, but just confidence that we're going to have these unified agent capabilities and it will feel like a coworker? What stands between us and that? There's a lot of really hard safety questions that we need to figure out. We would never ship anything that we don't have very high confidence
And I think the stakes are way higher when it has access to your GitHub repositories and your passwords and your private data. So I think that's a really big challenge. I guess also, if you want the model to be able to do tasks that take many, many hours, finding efficient ways to manage context, kind of similar to the memory thing. But if you're doing a task for a really long time, you're going to run out of context. So what's an efficient way of dealing with that, allowing the model to continue to work?
do its thing. And then, yeah, just the task of making the data and making the tools. I mean, I've said this already a few times, but that's a lot of work. I was just looking at my history of queries. My user request is like, I want to see what things I asked of deep research versus other models in particular in my memory. But it has ranged from like, obviously, you know, if I'm trying to get up to speed on a market for a company I'm looking at or on a technical topic or travel planning,
It's a big one. Also, I have looked for things that are taste related. So I'll be like, okay, I like...
you know, this set of books for these reasons. I want you to, you know, actually just giving me a long form summary of a bunch of other things you think I should read and explain why. I realize I don't have a super clear mental model of like when deep research should be better than O3. What instinct can you give me here? Deep research is very good when you have a very specific query or well-defined query. So maybe not a general overview of a topic, but some you're looking for some specific information. And
you think it would be supplemented by existing research online. Even if that information is also, you know, we also train the model on the base model on that information. I think having live access to it is quite useful. So if I have any instinct about like directing to retrieval or particular sources, that focusing is useful. I think so. And also we trained it to have much longer outputs than I think the...
you know, normal models would. So if you're looking for something very comprehensive, maybe sometimes too comprehensive for some tasks, I think deep research will be useful for those things. Connect this for me to a deep research, like fashion task.
I've used it to find new brands. So I'll say, these are the kinds of brands I like. Please find new, new brands where I can find this specific coat that looks like this one or something like that. And then it's very good at finding those versus the, I think the base model or the normal model will say it will give you some brands, but it won't necessarily fit all of the constraints that I had given. Like I wanted to sell this, you know, fake fur coat that's this length at
this season or something, it's not going to be able to do that because it just won't have the up-to-date information and also just won't necessarily be able to deal with all of the constraints in a query like in one shot. O1 isn't browsing as comprehensively. I'll use it to find things where I'm looking for a very specific thing that would take me hours to find. So I'm looking for this very specific item or sweater that is
It's probably available on RealReal or somewhere, but I can't find it. Or I'm looking for an Airbnb with like very specific constraints. So I think those kinds of things deep research is good for. And then more general, like high level things you should use like normal search for. Yes. Well, I will admit I have had some multi-year browsing slash shopping tasks, but I'm now making a cron job.
for deep research. I want to ask just one more experience question, which is, was there a particular like win or failure that surprised you in the training of deep research? It really was one of those things where we thought that, you know, training on browsing tasks would work, you know, felt like we had good conviction in it. But
Actually, the first time you train a model on a new dataset using this algorithm and seeing it actually working and playing with the model was pretty incredible, even though we thought it would work. So honestly, just that it worked well
so well was pretty surprising even though we thought it would if that makes sense yeah yeah it's the it's the visceral experience of like oh the path is paved with strawberries or whatever exactly but then sometimes some of the things that it fails at are also surprising like sometimes it will make a mistake where it will do such smart things and then make a mistake where I'm just thinking why are you doing that like
stop. So I think there's definitely a lot of room for improvement. But yeah, we've been impressed with the model so far. I'm used to all my technology tools being instantaneous. Deep research is not instantaneous. It's thinking and using tools. Can it be faster? Yeah, I do think there's a good middle ground in between where sometimes you don't want it to do really deep research, but you want it to do more than a search. And I think that we will release things soon that people will be
happy about and we'll fill that gap. Okay. I don't know how to communicate this preference, but I want to like toggle at some point to be like as much work as, I mean, because I would say this to a human, I want you to do as good of a job you possibly can do in the next five minutes. Yeah. See, that's something where I think it seems like a bad UX to actually make the user make that decision. The model should just be better at knowing how much time to think. I think we made a decision when training the model that's
We just are going to go for max thinking time every time. So I'm sure I will ask it a really simple query sometimes just to test and then get quite frustrated that it's still thinking. So I do think that that's also an area for improvement is, you know, knowing how long to think for. But yeah, I suspect with deep research, we'll always be focusing on the tasks that take the maximum length of time. And then I think
like O3 or, you know, O next will have a better in between. What is an example of a test you can imagine deep research taking a day at in the future? I mean, there's some GPUs smoking. Yeah, I think anything that would take, I mean, right now,
In five or 30 minutes, it can do what human experts rate take many hours. So I guess in an hour, it could do something that would take a human days. In a day, it could do something that would take a human weeks. Obviously, there'll be a lot of challenges to get it to scale like that. But I think you can imagine it doing a research project that would have taken...
weeks to complete or like write a thesis or something like that. Okay. I'm going to make our intern compete with it over the next couple months then. Yeah. Sounds good. If you were to project forward a year, which is a really long time in AI land, what is something that you think will surprise people that agents can do and that will actually be released? So it takes the safety considerations into the set. Yes. A general agent that could do a lot of the
you know, help you do a lot of the tasks that you would do in a lot of different areas. Like for me, I do a lot of coding. I'm hoping that there'll be an agent that is pretty, pretty sufficient at coding, but that I will just trust to, I'll give it a task and it will hopefully make a PR or something. But maybe I can ask the same agent to help me book a trip to Korea or something. I hope that we'll get to a more unified approach.
experience. But I also think that the rate at which these models are improving is going to be pretty surprising to most people. Why do you think a unified experience is important? Or why do you think that makes sense? Because I think today it's like quite different to think about. Obviously, ChachiBT is one experience that's very encompassing.
But there are models that people use in different contexts, like, you know, next line completion type models for coding that just feel like a very different setting. I think that you'll probably want both. Like, you'll probably want an experience where you can at some point override or interrupt the model and say, oh, no, I didn't mean that. Or you can take over and start typing something. Yeah. Especially in the short term as the models are not as capable as humans in a lot of areas and are more capable in other areas. Yeah.
So I think it will be a combination of like you asking the model to do something, but then when maybe to go with the coding example, then maybe you're also in your VS code or whatever it is, your cursor and your...
it's been doing something for you, but you can also like actually type and, you know, write some of it yourself. So I think it will be a combination of those things. But I kind of want it to be something that is just like, it's like having some, a coworker on Slack or like a remote coworker. You can just ask to do things for you, send them a Slack message and then they'll start doing it. And then you can like review their work or,
you know, help at some point. But it seems like a pretty nice, like general interface. And you don't have to think about which agent should I ask to do which task. Like you should just be able to figure it out. The mental model I have for this is my general ethos is actually I love the people I work with. I prefer to work with fewer people with less management overhead, all things considered, because each person has
more context and I have more understanding of them. And so like the universally useful agent is attractive. Yeah. And you only have to tell it something once and it will remember and then it will have state on everything you're working on. Things like that. Awesome. Well, this has been a great conversation, Isa. Thanks for doing this and thank you for the product release. Thank you so much for having me and thank you for using Deep Research.
Find us on Twitter at NoPriorsPod. Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.