We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Ep 49: OpenAI Researcher Noam Brown Unpacks the Full Release of o1 and the Path to AGI

Ep 49: OpenAI Researcher Noam Brown Unpacks the Full Release of o1 and the Path to AGI

2024/12/6
logo of podcast Unsupervised Learning

Unsupervised Learning

AI Deep Dive Transcript
People
N
Noam Brown
Topics
我观察到大型语言模型能力的扩展,包括预训练方面,仍然具有潜力。然而,每一次能力的提升都需要付出越来越高的成本。从GPT-2到GPT-4,模型所需的资源从数千美元到数百万美元不等,甚至可能达到数亿美元。虽然投入更多资金、资源和数据可以获得更好的模型,但这种扩展模式最终会受到经济效益的限制。当模型扩展到一定程度后,继续提升的成本将变得难以承受。因此,我认为存在一个软性的限制,即经济效益最终会限制模型规模的进一步扩大。 相比之下,我对测试时计算感到非常兴奋。我认为我们正处于类似GPT-2时代的早期阶段,当时模型的扩展规律非常清晰,只需简单地扩大规模就能获得更好的模型。虽然现在预训练的规模扩展变得更加困难,但测试时计算仍然处于早期阶段,有很大的提升空间。算法改进也更容易,存在大量低垂的果实。这并不意味着预训练已经完成,只是测试时计算的提升空间更大。 我认为测试时计算的潜力巨大。目前一次查询的成本大约为几分钱,但对于某些重要问题,人们可能愿意支付数百万美元甚至更多。这意味着成本可以提高数个数量级。此外,算法改进也可以进一步提升测试时计算的效率。因此,我认为测试时计算的潜力巨大,并且有很大的提升空间。

Deep Dive

Shownotes Transcript

Translations:
中文

Noam Brown's a research scientist at OpenAI, where he was a key part of their work on O1. Noam's at the forefront of reasoning in LLMs, had a really interesting past track record at FAIR, where he worked on problems in diplomacy and poker, and we hit on the biggest questions in LLMs today on unsupervised learning. We talked about whether these models are hitting a wall, how far test time compute can scale. We hit on how Noam defines AGI, and what he's changed his mind on in the last few years in AI research. This was a really fun one to do right after the general release of O1, and I think folks are really going to enjoy it. Without further ado, here's Noam.

Well, Noam, thanks so much for coming on the podcast. Of course, great to be here. I've been looking forward to this one for a while and certainly well-timed with some exciting launches going on with Shipmas. Yeah, I'm looking forward to it. We're going to be releasing O1 tomorrow, which I guess by the time this podcast is out, it's already going to be out there. I'm pretty excited for it. I think the community is going to love it, but...

But I guess we'll see. Well, I'd be remiss not to start around what I feel like has been the question of the past month around have we hit a wall with model capabilities? And I think there's obviously different parts of that question. And so maybe to start would just be the extent to which you feel like there's still more juice to squeeze on scaling pre-trainings.

So my view on this, and I've been pretty public about this, I think there's more room to push across the board, and I think that includes pre-training. I think the right way to think about it is that every time you want to scale these models further, there's a cost to that. And so you look at GPD2, it costs between $5,000 and $50,000, depending on how you measure it. You look at GPD4, obviously there's a lot of improvements, but

The fundamental thing, the most important thing that's changed is the amount of resources that have gone into it. You go from spending for frontier models like thousands to tens of thousands of dollars to hundreds of thousands to millions to tens of millions to, for some labs, possibly hundreds of millions of dollars today. The models keep getting better. I think that that

will continue to be true. That if you throw more money, more resources, more data, all this stuff into that, you're going to get a better model. The problem is that, okay, well, if you want to like 10 exit each time, then at some point that becomes like an intractable cost.

And so, okay, the next, if you want to make it better, you want to do another 10X, now you're talking about billions of dollars. You want to do another 10X, now you're talking tens of billions of dollars. And at some point, it's no longer economically worth it to push that further. So you're not going to spend, presumably you're not going to spend trillions of dollars on a model. And so there's no hard wall. It's more of like a soft wall that eventually the economics just don't work out for it. Right.

Right. And it seems like obviously there's, you know, in many ways you're able to push this forward with test time compute and like, you know, like there's lower hanging fruit there from a cost perspective to push that forward. Exactly. And so this is why I'm really excited about test time compute. And I think why, like, you know, a lot of people are excited about it is that we're still, it's kind of like we're back in the GPT-2 days, like when GPT-2 was figured out and like the scaling laws were figured out, it was pretty obvious that like, oh, you just scale this up by a thousand X and you're going to get a better model. And you could do that.

It's a little harder now to scale things up by 1000x when pre-training, but with test time compute, we're still pretty early. And so we have a lot of room, a lot of runway to scale that further. There's a lot more low-hanging fruit for algorithmic improvements. So I think there's just a lot of exciting stuff to be done in that direction. That's not to say that pre-training is done, but it's just that there's so much more headway to just push the test time compute paradigm further. And I should also say, it's not like, even going back to pre-training for a second, it's not like...

you know, there's, you know, two mortars of magnitude or something that you can push in and then you're done. You're still going to be Moore's law. I think costs are going to continue to come down. It's just like question of like, how quickly can you scale it? Like there was this huge overhang where it was very easy to scale it very quickly. And that's becoming like a little bit less true. I realize this is probably an overly broad question, but like how high is the ceiling on test time computer? Like how do you think about, you know, where that could go? Again, I think about it in terms of dollar value. So,

How much does a Chachapiti query cost today? Ballpark a penny. What cost could you spend on a query that you care a lot about? And what cost would you be willing to pay? I think there are some problems out there that people would be willing to pay a lot of money for. And I'm not talking about like a dollar or five dollars. I'm talking like a million dollars for some of the most important problems that society cares about.

So how many orders of magnitude is that? Like that's what, eight orders of magnitude? So I think...

there's a lot of room to push it further. And I also think there's a lot of room for algorithmic improvements. So it's not just like, oh, we're just going to dump more money into the query and then you get a better output. It's like, no, actually we can improve this paradigm further and make the scaling a lot better. You know, one thing I thought was interesting is, I guess maybe a month ago, you know, Sam Altman had tweeted, you know, we basically know what we've got to do to build AGI. And I think you tweeted like his view matches like the median view of OpenAI researchers today. Mm-hmm.

Can you say more about that? Because obviously there's so many people now talking like, oh, we've hit a wall. Like, what do you think they're missing? I feel like we've been pretty open about this, that we see things continue to progress pretty rapidly. I think that that's my opinion. I think that Sam expresses his opinion. And I think, you know, I've heard some people say that like, oh, Sam is just trying to like, you know, create hype or something. And I'm kind of surprised by that because like we're saying the same thing and, you know,

Yeah, I think it's a common opinion in the company that things are going to progress quickly. And do you think like pre-training and test time compute alone kind of get you most of the way there? Or is there also, it seems like this algorithmic bucket as well. It's not by any means that we've like, we're done. It's like we cracked the code to super intelligence and now we just have to like, you know, um, scale. It'd be pretty cool if you came on the pod and announced that you had that. Uh, but I think, okay. So the way, the way that I think about it, um,

Back in 2021, like late 2021, I had coffee with Ilya Sutskiver. And he was asking me about my AGI timelines. And I told him, to be honest, I think it's going to take a very long time. I'm pretty skeptical that we'll get there within the next 10 years. And the main reason that I gave him was that

we don't have a general way of scaling inference compute, a scaling test on compute. Like I saw how much of a difference that made when it came to games and the fact that it wasn't there in language models in a very general way. Um,

just to me it seemed kind of like silly that we're going to get to super intelligence just by scaling pre-training because you know you look at these models and like yeah they're doing pretty smart things but also you know back then they couldn't even draw a tic-tac-toe board you know and like yes you get to GBD4 and suddenly they can draw the board and like make mostly legal moves but sometimes they still make illegal moves they

the big suboptimal decisions in tic-tac-toe. And like, I have no doubt that if we scale pre-training another order of magnitude or two, it's going to start playing tic-tac-toe really well. But like, if that's the state of things that we're spending like tens of billions of dollars to train a model and it can barely play tic-tac-toe, you know, that's like pretty far from super intelligence. So,

So I told him, like, look, we're not going to get to superintelligence until we can figure out how to scale inference computing in a very general way. And I think that that's an extremely hard research problem, and it's going to take probably at least a decade to figure it out. To my surprise, by the way, he agreed with me. He agreed that scaling pre-training alone would not get to superintelligence. And I think, I didn't realize it at the time, but he was also thinking about very, very carefully this scaling test on compute direction.

So I thought it would take at least a decade. It took like two or three years. And I thought that was the hardest unsolved research question at the time. I have no doubt that there are others. In fact, I know that there are other problems that aren't solved, research questions that aren't solved. But I don't think that any of them are going to be harder than the problems that we've already solved. Yeah.

And for that reason, I think that things will continue to progress. Obviously, you've had just a massive impact in this test time compute work. And, you know, your research career had obviously been in search and planning, you know, games like poker and diplomacy. And, I mean, from others' accounts, it seems like when you joined OpenAI, you were pretty clear, like, this was the direction to push in. It seems to have really paid off. I'm curious, like, how consensus was that approach when you joined, like, you know,

uh, you know, maybe talk about like getting the kind of research organization oriented behind that. Yeah. It's interesting. Uh, when I, um, went on the job market and I was like, you know, interviewing at a bunch of places, people in general were quite receptive to the idea that like, okay, like the research labs, I think everybody actually, uh, for the most part was among the frontier research labs believed that pre-training alone would not get us. The current paradigm would not get us to super intelligence and that there was something else that was needed. Uh, so there was a lot of reception to this idea of, okay, um,

yeah, maybe we need to figure out how to scale test on compute. Some labs were more bought into it than others. And I was actually kind of surprised that opening eye was really, really on board with this because, you know, they're the ones that pioneered large scale pre-training and scaled it farther than anybody else. And, but they, they were very on board with it. And I didn't know it at the time when I was talking with them, that they had also been thinking about this for a while before I joined. So when I, when I did join, I,

you know, it's interesting cause I think the motivation was, was different. Um, the motivation that they had in mind was more about like overcoming the data wall. Um, not so much about, we need to figure out how to scale a test on compute. It's more about like, how do we, how do we like get over the data wall? Um, but the techniques, um, ended up being, you know, the, the, the agendas ended up being pretty compatible. And, um,

Yeah. So it actually wasn't too hard to get, you know, look, when I, when we started, it was still this like exploratory research direction. Um, and, uh, there were some people working on it, but it wasn't like, you know, half the company is dedicated towards it, towards this like large scale effort by, by any means. Um, but a few months after I joined, uh, you know, we, I and various other people were trying things, um,

that some many of which didn't work um but you know there was one thing that one person tried that like ended up getting some some signs of life and um people were like oh that seems interesting maybe we should like you know try some more things and like you know you get more and more signs of life and um and eventually like i think the leadership recognized that like okay there's actually something here that seems different and valuable and we should really scale this up yeah and and

I was supportive of that, but I think others were. And I think it was a testament to OpenAI and its organizational excellence that it was able to recognize that there was a lot of potential here and was willing to invest a lot to scale it up. I think it's an underappreciated point that in many ways it's really surprising that something like O1 came out of OpenAI. It's disruptive, but...

to the paradigm that OpenAI, you know, it's disruptive to the paradigm that OpenAI pioneered. And I think it's like a really good sign that OpenAI isn't getting trapped in the, you know, innovator's dilemma and is willing to invest in a risky direction. And I think in this case, it's going to pay off.

Yeah, no, it's really interesting because obviously if the script had continued to play out of just scaling pre-training continuously and raising more money to do that, OpenAz is in a great position to do that. And so any sort of orthogonal approach, yeah, it is different. And so it's cool that it came out of the same place. Obviously, your original timeline was, hey, it's going to take 10 years to do this. You did it in two. What was the first thing you saw that kind of were like, yeah, okay, actually this might be way faster than I thought?

So first of all, it's not just me. It was me and a lot of other people that managed to do it in a shorter period of time than I predicted. What's the first thing that I saw? I think when I joined, we had a lot of discussions about the kinds of behavior that we would like the model to do. And that included things like

We want to be able to see it try different strategies to solve a problem if a strategy isn't working out. We want to see it take a hard problem that involves many steps and break that problem down into smaller pieces that it can tackle one by one. We want to see it recognize when it's making a mistake and correct those mistakes or avoid making them in the first place. And there were a lot of discussions around how do we get those individual things. And that kind of...

bothered me, the fact that we would even try to tackle them individually because it just seems like, okay, well, ideally we just get something that figures out all this stuff on its own. And when we got the initial signs of life and then one of the things that we tried that I was a big fan of that I advocated for was why don't we just have it think for longer? And when we had it think for longer, it would just do these kinds of things emergently. It wasn't like

Um, it wasn't like, oh, suddenly we have like a one, but it was like, oh, there's indications here that it is doing things that we wanted, that we were strategizing about how to enable to do these things. Um, and it's just figuring out on its own that it should be doing these things. And it was also clear that we could scale it a lot further. So I think for me, that was, that was the big moment where we just like had it.

think for longer and suddenly you see a qualitative difference. Like you see this qualitative behavior, um, that we thought we would have to somehow add to the model and it figures it out on its own. And of course the performance is better. Um, but the performance wasn't that much better. It was, it was really seeing the qualitative change, seeing, um, the, those behaviors, um, that really gave me the conviction that like, okay, this is, this is going to be a big deal.

I think that was like probably October 2023. Wow. And it got out pretty fast after. Could have been faster. I guess, how would you kind of contextualize for our listeners today where planning in an O1 type model is helpful and where it's like, you know, you should stick with GPT-4.0 or, you know, it's not as helpful. And how do you expect that, I guess, to, you know, obviously you're constantly working on improving this. How does that change kind of going forward? I think eventually...

There is a single model. I think right now we're in this state where, you know, GPT-4.0 is better for many things. And O1 is better for many things. Certainly O1 is more intelligent. So if you have like a very hard problem, O1 is extremely good for that. I have talked with researchers at universities, like, you know, a friend that's a professor who...

loves O1, is a real power user because it just can tackle these hard research questions that normally you would need somebody with a PhD to be able to handle.

I think for some tasks, I think like creative writing might be one of them, though. Actually, I'm not sure if I know that for something like creative writing for always better than a one preview. I'm not sure what the comparison is for a one. But certainly like the big benefit of for always that you just get a faster response. So if you just want a response immediately and it's not a very hard reasoning task, you know, I think for it was a reasonable thing to try. Yeah. Yeah.

But I should say that eventually where we want to end up is like there's a single model and it can, you just ask it everything. And if it requires like a lot of deep thinking, then it can do that. If it doesn't and it can respond immediately with a quite good response, it does that as well. What is the intersection of like multimodal models and these models look like going forward?

So O1 takes as input images. Yeah. I think that's going to be pretty exciting. It's going to be exciting to see what people do with that.

Yeah, I don't see any blockers to this, like having them be as multimodal as like 4.0 and these other models. One of the fascinating parts of O1 is I feel like a lot of the previous work that you had done in reasoning was built, you know, on reasoning that was kind of specific to that problem. Like I, you know, as I understand it, like Go was, you know, Monte Carlo search that maybe wasn't as relevant for poker. And like, you know, obviously one of the things that is so impressive about what you built is, you know, you scaled kind of inference compute generally. Can you talk a little bit about like,

what's required to do that versus maybe some of the more specific work that had been done in the past toward like specific types of problems? Well, I think it requires, I mean, I can't go into details about like the, you know, the actual technique, but I think the important thing is that it requires like maybe a change in mindset that I think when I was a PhD student and afterwards, once I saw how much of a difference,

scaling test time compute made in poker, I was like, okay, this is great, but unfortunately it only works in poker. So how do we extend this algorithm to be able to do more and more domains? And so there's a question about how do you get this technique to work for both poker and Go or poker and diplomacy or something like that. And so we developed techniques that work in Hanabi, we developed techniques that worked in diplomacy, and

One of the things that I was considering doing is just trying to get this algorithm to play as many games as possible. Try to come up with an algorithm that would work for it. It's similar to what was done in poker, but be able to more broadly work. And I think the diplomacy work actually convinced me that that's kind of the wrong way to think about it, that you really need to start from the end point, which is like, okay, we have this extremely general domain. And language is actually a really good example of this, where you have such breadth

And instead of trying to extend a technique that worked in one domain to do more and more domains and eventually do everything, we should instead start from everything and figure out

and you know figure out some way to scale test time compute and my guess and like initially of course it's not going to scale very well it's not going to be like a very good technique to scale test time compute but then can you have it scale better and better and i think that change in mindset i mean the diplomacy work is really what convinced me to um have that change in mindset because trying to take the techniques that we developed for poker and go and apply them to diplomacy we we

To apply it to the actual general, full general game of diplomacy. We managed to apply it to diplomacy with some constraints on what it could actually do. And there was a ceiling to how much it could achieve. And we actually only got to human level, strong human level performance in diplomacy. And it was pretty clear that if we pushed that paradigm a lot further, we weren't going to get to superhuman performance. So to actually tackle...

the full game of diplomacy and reach super intelligent, like super human performance in diplomacy, it was clear that we needed something that would actually just like work for pretty much anything. And so I thought like, okay, let's just, you just got to jump to the end point and try to tackle everything. It's so interesting. I mean, you kind of mentioned that, you know, you kind of expect everything to converge on, you know, kind of a single model. I

I guess in what timeframe, like in the medium term, do you think that like we have one model that rules them all or, you know, obviously there's lots of folks out there building specialized models for different use cases. Like, do you think building your own model like makes sense? I guess there's folks building legal models or healthcare models or some of these things. So it's a good question. I get asked this a lot. I don't have a great answer for this, but like one thing I have been thinking about is, you know, you can ask,

you can ask O1 to multiply two large numbers and it can do it. Like it'll work through the arithmetic to figure out how to like, you know, carry the digits and all that. And then like actually multiply two large numbers and tell you the answer. It doesn't make any sense for it to do that. Like the optimal, like really what it should do is just call a calculator tool or write a Python script that multiplies the two numbers, runs the script and then tells you the output. So I think that is,

That calculator tool is like one extreme end of the spectrum of like very specialized, very simple, but very fast and cheap.

And on the other end of the spectrum, you have something like O1 that is very general, very capable, but also pretty expensive. And I think it's quite possible that you'll see a lot of things that essentially act as tools in between those two extremes. And that O1 or a model like O1 can use to save itself and save the user a lot of cost. Yeah. It's really interesting that the tools don't end up being capability enhancing. They're more just like...

to not require massive amount of compute costs to solve something that could be much more easily solved. Yeah, it's also entirely possible that some of these tools just do like a flat out better job than O1.

So I think the way I think about it is like kind of the same way that I would think about how a human would act. Like, you know, you could ask a human to do something, but like maybe they're better off just using a calculator or, you know, doing some other kind of like specialized, using some other kind of specialized machine or something. Well, I guess on the O1 side, any like, you mentioned kind of your friend who's a professor using it, like any other kind of unexpected use cases that you've seen in the wild or personal favorites? I think one thing I'm really excited for is to see how O1 is used for coding.

Um, I think Oh one preview, like people were pretty impressed, uh, it's a coding ability, but it was good in some ways for coding and, and not as great for others. And so, you know, it wasn't like strictly dominant in terms of among models for coding. Um, I think that Oh one is going to do a lot better and I'm pretty excited to see, um, how, how that changes the field. Yeah. Um, if that changes the field and, um,

Yeah, I'm just really curious to see. I use O1 internally. Other people do. We've had some people play around with it and give us feedback, but I don't think we really know how it gets used until we actually deploy it in the wild. Yeah. How do you use it?

I use it for a lot of coding tasks. Or like, you know, if I have something and frequently what I'll do is like if I have something that's pretty easy, I'll give it to 4.0. But if I have something that I know is really hard or that I need to write a lot of code, I'll just give it to 0.1 and like have it just do the whole thing on its own. And also frequently if I have a tough problem that like for whatever reason 4.0 isn't getting, I'll just give it to 0.1 and it'll usually give me an answer. It's not doing core AI research yet.

O1 is not doing core AI research. You mentioned on the path to O1, obviously there were some things that you saw, milestones that were really meaningful around the ability to reason through things. As you think about, obviously you're continuing to work on this class of models. What are the milestones that are meaningful to you going forward? Things that if you saw as you guys continued to scale up, that would be important to you. Milestones as in among benchmarks or something? It could be specific benchmarks or even just how you think about the next set of capabilities that are important that you'd hope that an O2 would have.

I'm really excited to see these models become more agentic. I think a lot of people are. So I think one of the major challenges, one of the major barriers to actually achieving agents, people have been talking about agents for a while. Ever since Chachapati came out, people were always talking about agents. They would come to me and ask, oh, why are you working on agents? My feeling was that the models are too brittle, that if you have a...

a long horizon task and there's a lot of intermediate steps. You need the reliability and you need the coherence to be able to have the model figure out that it needs to do these individual steps and then also execute on them. And yes, people tried to prompt the models to be able to do that and you could kind of do it, but it was always kind of fragile and not general enough.

And the cool thing about O1 is that I think it's a real proof of concept that you can give it a really hard problem and it can figure out the intermediate steps on its own and it can figure out how to tackle those intermediate steps on its own. So the fact that it's able to do things that are completely outside the realm of what something like 4.0 can do without really excessive prompting, I think it's a good proof of concept that

it can start doing things that are agentic. So yeah, I'm excited for that direction. There's obviously a lot of folks today that are working on agents, and I think they take basically the current limitations of models and find ways around them, right? Whether they'll chain six model calls together to check outputs, or they'll find some smaller fine-tuned model that just checks whether something ties exactly back to the original data source. It feels like there's all these orchestration and scaffolding that's built to make this work. Does that feel like...

some of that stuff persists or is that eventually all just become part of like the underlying model? You know, okay. So there's this great essay called The Bitter Lesson. I knew we couldn't get through this podcast without The Bitter Lesson coming up. You know, because I'm surprised. Like whenever I like give talks at various events, like AI events,

Um, you know, sometimes I'll pull people and I ask them how many have read the bitter lesson. And surprisingly, if you have, I think people had been in the field. I feel like if anyone's listened to a podcast with you or follows you on Twitter, they would have been exposed to the bitter lesson. Perfect. Great. Okay. So for those that haven't, I mean, I think it's a great essay. I highly encourage people to read it. It was written by the creator of the field of RL, um, Richard Sutton. And he talks about this and he says that every, you know, basically every time there's

You look at the history of chess, for example. The way people tried to tackle chess was to code things up, code their knowledge into the models and try to get them to do human-like things. And the technique that ended up working really well was techniques that scaled really well with more compute and more data.

And I think the same is true now with these language models that, okay, we've reached a certain level of capability and it's really tempting to try to push it. Okay, there's things that they're just unable to do and you'd like them to be able to do those things. And so there's a big incentive to then add a bunch of scaffolding and add all these prompting tricks to push it a little bit further to be able to do those things. And you encode a lot of your human knowledge in that in order to get the models to go a

What's ultimately going to work in the long run is a technique that scales well with more data and more compute. And

There's a question about are those scaffolding techniques, do they scale well with more data and more compute? I think the answer is no. I think something like O1 is something that scales really well with more data and more compute. And so I think that in the long run, I think we're going to see a lot of those scaffolding techniques that push the frontier a little bit further. I think that they're going to go away. And I think it's an interesting question for builders today of like you could solve a here and now problem with that and then evolve over time with what's required. Yeah, it's a tricky thing, especially for startups because I know that

they probably face a lot of demand for some task and, you know, there's something that's just out of reach of the models and they think like, okay, well, if I invest a lot into the scaffolding and customization to make it be able to do those things, then I'll like, you know, I'll, then I'll have a, uh, a company that's able to do this, that thing that nobody else can do. Um, but I think it's important. And this is actually one of the reasons why we're, we're telling people like, look, these models are going to progress and they're going to progress quickly is that you don't want to be in a position where, um,

the model capabilities improve and suddenly the models can just do that thing out of the box. And now you've wasted six months building scaffolding or some specialized, you know, agentic workflow that now the models can just do out of the box. Talking about kind of what's happening in the broader LM space. I mean, beyond, you know, test time compute, like what are other research areas you're paying attention to? I was really excited by Sora. I think a lot of people were, I thought it was really cool. I wasn't really keeping track

to up-to-date on the state of video models. And so I was like, when I saw it, I was pretty surprised at how capable it was. - You obviously cut your teeth in academia. I think there's a question now that a lot of folks are thinking about about the role of academia in AI research today, given obviously access to a completely different level of compute. How do you think about the role of academia today?

Yeah, it's a real tough question. I've talked to a bunch of PhD students and they're in a tough situation where they want to help push the frontier further. And it's hard to do in a world where so much is dependent on data and compute. If you don't have those resources, then it's hard to push the frontier forward. There's a temptation, I think, among some PhD students to try to do what I said shouldn't be done and add resources.

their human domain knowledge, add these like little tricks to try to push the frontier a little bit further. So you take, you take a frontier model, you add some clever prompting or something, you push it a little bit further and you get like 0.1% higher than everybody else on some eval. And, you know, it's, the problem is that it's actually not, I don't, I don't blame the students so much as like, I think academia incentivizes this. I mean,

It's prestigious to have a paper accepted to a prestigious conference, and it's much easier to get a paper accepted to a conference if you're able to show that you're at least slightly better than everybody else on some eval. So the incentive structure is set up in a way that encourages that behavior, at least in the short term. But in the long term, that ends up not really being the most impactful research.

So my suggestion is don't try to compete with the frontier industry research labs on frontier capabilities. I think that there's a lot of other research that can be done, and I've seen really impactful research that can be done. So one example is just like investigating novel architectures or novel approaches that

scale well. And if you can just show that, okay, well, you can show the scaling trends and show that it's showing a promising path as you throw more data and more compute into it, then that is good research, even if it's not getting steady-air performance in some eval. And people are going to pay attention to that. I mean, it might not be that

The people that casually pay attention to the field are going to pick up on that. It might not make it into the news cycle or something, but the people...

It will have an impact if it's showing promising trends. And I guarantee you that industry research labs look at those kinds of papers. And if they see something that is showing a promising trend line, they're willing to put in the resources to see if it actually pays off at large scale. What evals are still meaningful to you? When you're playing around with a new model, what are you looking at?

I think there's like a lot of vibes questions that I ask, and I'm sure everybody has these. Do you have a go-to vibes question? I mean, my go-to is really tic-tac-toe. Always games, I guess. That makes sense. Yeah, it's like, you know, it's shocking to see how challenging it is for some of these models to play tic-tac-toe. I joke that it's, I think it's just because there's not enough five-year-olds on the internet getting tic-tac-toe strategy on Reddit. Yeah, we haven't populated the world with tons of tic-tac-toe data. Yeah, and I just like...

see how these models do with the kinds of day-to-day questions that I have. And it's pretty cool to see the progress in things like going from 4.0 to 0.1 preview to 0.1. Yeah. I mean, you mentioned, obviously, it sounds like since 21, you changed your mind and then showed it with what was possible at test time compute. Anything like in the last year that you've changed your mind on in the AI research world? And I shouldn't say it. It wasn't like I changed my mind in 2021. I was pretty bought into this

Even basically when we got the poker results in like early 2017. Yeah. I think for language models, I think the shift to language models, like I think I started thinking about that like more like 2020, 2021. Yeah. No, sorry. I meant more like the, you had in 2021 thought it would take 10 years to scale this stuff. And now I think it's two. Anything in the last year that you like, you've kind of done a 180 on that's something you thought?

I think the main thing that I've changed my perspective on is how quickly I think things would progress. So like I said, I remember I've been in the AI field for a pretty long time by today's standards. So I started grad school in 2012. I saw the deep learning revolution happen. And I saw people talking very seriously about AGI and super intelligence back in 2015, 2016, 2017.

And, you know, my view at the time was that, you know, just because AlphaGo is superhuman at Go, it doesn't mean that we're going to get to superintelligence anytime soon. And I think that was actually the correct assessment. Like, I think people didn't look at the limitations of AlphaGo enough and the fact that like, okay, it can play Go, it can even play chess and shogi, but it can't play poker. And nobody actually has a good idea about how to actually get it to be more general than that.

And two-player zero-sum games are these very ideal situations where you can do this unlimited self-play and keep hill climbing in some direction that gets you to superhuman performance. And that's not true for the real world. So...

I was, I was pretty, uh, I was on the more skeptical end and it was probably like, you know, actually more optimistic than the average AI researcher that we could like progress, um, towards, uh, very, very intelligent models that would change the world. But I think compared to, you know, people like at OpenAI or some of these other places, I was on the more skeptical end. Um, then I think my perspective on that has changed quite a bit. I think, um, seeing the, seeing, uh,

The ability to scale test time could be in a very general way. That changed my mind. And I kind of became increasingly optimistic. Actually, I think the conversation that I had with Ilya back in 2021 was the start of that. That he kind of convinced me that, yes, we don't have the entire paradigm figured out. But maybe it's not as far away as 10 years. Maybe we can get there sooner. And I think...

Seeing that actually happen changed my perspective, and I think things are going to happen faster than I originally thought. I mean, obviously there's a bunch of folks out there that are trying to compete with NVIDIA. I think Amazon recently has been pretty aggressively investing in Trinium, having Anthropic use it. What do you think about some of these other hardware efforts? I'm pretty excited for seeing the investment in hardware. I mean, I think...

One of the cool things about O1 is that I think it really changes the way that people should be thinking about hardware. So I think before people had this mindset that like, okay, there's going to be these like massive pre-training runs, but then actually like the inference cost is going to be like pretty cheap and, you know, very scalable. I don't think that's gonna be the case. I think that we're going to see a major shift towards inference compute. And if there are ways to optimize around inference compute, I think that's going to be a big win. So I think there's like an opportunity for a lot of creativity on the hardware side now to, you know,

adapt to this new paradigm. Kind of hitting on some questions outside of LLMs, you know, I feel like your work with diplomacy is, like, incredibly interesting. Obviously, it's, like, this game that involves negotiation, predicting how others will act, et cetera. It's hard not to think about the implications of that for, like, you know, simulating society to test policies or, like, even having AI, like, as part of a government in some way. How have you kind of thought about this and what are your kind of intuitions as these models get better and better about, like, the role they play in those parts of society at large? Well, I think, um,

I guess two questions there, but kind of answering one of them. I think one of the directions that I'm pretty excited about for these models is using them for a lot of social science experiments and also things like neuroscience. I think you can learn a lot about humans by looking at these models that were trained on vast amounts of human data and are able to imitate humans quite well. And of course, the great thing about them is that they're much more scalable and cheaper than hiring a bunch of humans to run these experiments. So I...

I am curious to see how the social sciences use these models to do cool research in their fields. What are some ways you could imagine that happening? I think, you know, normally you would do a bunch of, like, if you want to do, I mean, I'm not a social scientist, so it's, you know, I haven't thought about this, like, that

that well. Um, but I think like economics, for example, um, there's a lot of, um, you did work at the fed before, right? I did work at the fed. Yeah. I guess the social science. So I guess game theory is actually a good one where, you know, I've been in these, uh, I, when I was an undergrad, I did some of these experiments where like, you know, they would hire, um, they would bring in a few undergrads, pay them a small amount of money and like have them do these like small game theory experiments to see like, Oh, how rational are they? Um, how do they respond to incentives? Like how do they, how much do they care about like, um,

How much do they care about making money versus getting revenge on people that wronged them? And a lot of these things you can do now with AI models. It's not obvious that it would translate. It would be like a match for human performance, but

That's something that can be quantified. You could actually see, in general, do these models do things that humans would do? And then if you have a much more expensive experiment, then you could maybe extrapolate and say, this is not cost-effective to do with human subjects, but we could use this AI model. Or things that are ethical concerns as well. Maybe you can't do this experiment because it's not ethical to do with humans, but you could do it with AI models. So I guess one example is...

The ultimatum game. Are you familiar with that? No. Okay, so the ultimatum game is that you have two participants. Let's call them A and B. A has like $1,000, and they have to give some percentage of that to B. And then B can decide whether to accept that split or to just say that neither player gets anything. So if A has $1,000, they give like $200 to B. If B accepts, then B gets $200. A gets $800. If B rejects, then both of them get $0. And

you know, there's experiments showing that like if people get offered less than roughly 30%, then they'll reject. And of course, there's the question about like, okay, well, if it's a small amount of money, then that's,

pretty understandable that, you know, if it's $10 and you're only offered $3, then you would just be like annoyed at that person and reject to spite them. Are you still going to do that if it's $10,000 and you're being offered $3,000? It's kind of a different question. And so the only way that you, of course, it's like super expensive to actually run that experiment. And so the way they've done it historically is they would go to a very poor community in like a different country and

and offer them what to them would be a very large amount of money and see how they would act differently.

But even then, you can only, like, push that so far. So with AI models, like, now maybe you could actually, like, get some insights in how people would react to these kinds of situations that are, like, cost prohibitive. It's interesting. I mean, and also for, you know, neuroscience and other things, I always think, you know, I think a complaint of the social sciences has been all those experiments are done on, like, you know, college kids that need to get credit in their intro psych class or something. And so also, you know, getting exposure to a broader universe.

you know, the internet at least is probably a broader swath of society that has been trained on than most of these experiments, which are basically like 19 year olds at top institutions. Yeah. That's a great, yeah, that's a great point. And I should also say that like, look, if you're doing these experiments, like GBD 3.5, like GBD 3.5 is not going to do a great job of like imitating how an actual human would do in a lot of these settings. Um,

But this is a very quantifiable thing that you can actually just measure how closely these models are matching what humans would do. And I suspect, I haven't actually looked at these experiments myself, but I suspect that as the models become more capable, they do a better job of imitating how actual humans would do in these settings. And then obviously your work in diplomacy was focused on kind of an AI player among a bunch of humans. Yeah.

How, if at all, does that change? I feel like we're about to enter some world where we have AI agents interacting with other AI agents and negotiating and whatnot. How does that change things, if at all? I think one of the things that I'm really excited for about LLMs is that there was always this question in AI about how do you get AIs to even communicate with each other? So there's this whole field of AI called emergent communication, where people would try to like

teach AIs to be able to communicate with each other. And that problem is now effectively solved because you have...

a language built in that conveniently also humans use. And so a lot of these problems are just like kind of conveniently like out of the box, just like answered. So it's quite possible that maybe you don't need to change that much. What do you think of what's happening in the AI robotics space? Like where do you think that space goes in the next few years? I think in the long term, it makes a lot of sense. I did a master's in robotics. I didn't actually like work with robots that much, but I was in the program and like I had a lot of friends that were working in robotics.

And one of the main takeaways that I got is that hardware is hard and it takes longer to iterate with hardware compared to software. So I suspect that robotics is going to take a little while to progress just because iterating on actual physical robots is hard and expensive.

But I think that there's going to be progress. Obviously, you're about to release one into the wild and people are going to build all sorts of things on top of it that neither of us could possibly imagine. But are there some areas generally that you feel like are underexplored applications today or places where you wish there were more builders messing around with these models? I think I'm really excited to see these models advancing scientific research.

I think we've been in kind of a weird state up until now where the models were like broadly very capable, but they weren't necessarily surpassing expert humans in hardly any domains. And I think increasingly as time goes on, that's going to stop being true. We're going to start seeing the models like surpassing what expert humans can do in first, just like a few narrow domains, but then increasingly more and more domains. And, and,

that opens up the possibility that you can actually like advance the frontier of human knowledge and use these models like not as replacements for researchers, but as a partner that you can use to do things that were not otherwise possible or do them a lot faster. So I think that's the application that I'm most excited about. And it's not something that like has really been possible yet, but I think that we're pretty, we're going to start seeing happening. You think it's possible with this current set of models?

I don't know. And this is actually one of the reasons why I'm excited to see O1 released because, you know, I'm not a researcher in one domain, but I'm not a researcher in like all these different domains. And I don't know if it will be able to improve chemistry research or the state of chemistry research or the state of biology research or theoretical mathematics. And getting the model into the hands of these people and seeing what they can do with it, I think will help.

Give us some feedback on where it's at on those domains. You mentioned that it might start more narrowly first before expanding out. Any intuitions on the narrow subset of things that might be particularly well suited to it? Or is that for the community to find out as they mess around with it? I think it's for the community to find out. For O1 Preview, it looks like it does particularly well on math and coding. Yeah, those were very impressive results. Yeah, and I mean, it's improving things pretty broadly, but we're seeing...

quite noticeable progress on those two. I

I wouldn't be surprised if that continues to be true and we see the performance improving very broadly, but because math and coding is already ahead, that it will continue to progress more quickly on those two. But I think it's going to be a broad improvement across the board. Well, Melvin, it's been a fascinating conversation. We always like to end with a quick fire round where we get your quick take on things. Maybe to start, what's one thing that's overhyped and one thing that's underhyped in the AI world today? I mean, I guess overhyped, I would say a lot of these...

Kind of like prompting techniques and like scaffolding techniques that, you know, like I said, I think are going to be done away with in the long term. Underhyped? I think, I mean, I'm a huge fan of 01. I got to say 01. I think that for people that are paying attention to the field, it has been a big update. I think that for the broader world, I don't know if people have recognized what it means yet to the extent that they should.

Yeah, I think I'll go with those two. Hopefully the release tomorrow starts getting to that. Yeah, well, yeah, we'll see. Do you think model progress in 2025 will be more or lesser the same as 24? I think that we will see progress accelerate. How do you define AGI? I don't. I've been trying to shift away from using that term as much as possible. I think...

An AI that can do, I mean, I think that there's going to be a lot of things that an AI will not be able to do that humans can do for a long time. And I think that that's the ideal scenario, especially things like physical tasks. I think that humans will have an edge for a very long time. And so I think an AI that can do, that can like accelerate human productivity and make our lives easier, I think is like the more important thing.

term than AGI. Well, Noam, I always like to leave the last word to our guests. And I feel like there's a million places you could point people to your work, what's going on at OpenAI, but the floor is yours. Anything you want to say to our listeners or things you want to call out? Yeah, I mean, I guess the main thing is that to skeptics out there, I get it. I

I've been in this field for a long time. I was very skeptical about the state of things and the hype around the progress in AI. I kind of recognized that AI was going to progress, but I thought that it would take us much longer to even reach this point. I think it's really important to recognize that where we are right now is complete science fiction compared to even five years ago, let alone ten years ago. So

The progress has been astounding. And I think that there are reasonable concerns about like, oh, are we going to hit a wall? Is progress going to stop? But I think it's important to recognize that the test time compute paradigm, in my opinion, really addresses a lot of those concerns. And so I think for people that are still skeptical of like the progress in AI, I would just recommend like

Take a look for yourself. We've been pretty transparent with the blog post and our results about where things are, where we see things going.

And I think the evidence is pretty clear. Well, Noam, this has been absolutely fascinating and a real pleasure of this job to get to sit down with you. Thanks so much for taking the time. Of course. Thanks. A huge thanks again to Noam for a just fascinating conversation. If you enjoyed that, please consider subscribing, sharing with a friend. We're always trying to get the word out about the podcast. We have a bunch of great conversations coming up with leading AI researchers and founders. 2025 is going to be an incredible guest lineup. Thanks so much for listening, and I'll see you next week.

so