We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Ep 51: Former Chief Research Officer of OpenAI Bob McGrew - What Comes Next for AI?

2024/12/18

Unsupervised Learning

Bob McGrew discusses the differing perspectives on AI model capabilities between insiders and outsiders. He explains the significant compute increases needed for pre-training progress and the time it takes to build new data centers. He also highlights the role of reinforcement learning in improving model capabilities and the potential for future progress in test-time compute.

Significant compute increases (100x) are needed for pre-training progress from GPT-2 to GPT-3 and GPT-3 to GPT-4.
Algorithmic improvements can help, but fundamentally new data centers are needed.
Reinforcement learning is key to enable longer, coherent chains of thought and packing more compute into answers.
Test-time compute offers significant room for algorithmic improvements without needing new data centers.

Shownotes Transcript

Bob Agru is the Chief Research Officer at OpenAI for six and a half years. He recently left a few months ago, and we had the privilege of unsupervised learning of being one of the first podcasts he's come on. So it was an opportunity to ask him literally everything about the future of AI. We talked about whether models have hit a wall. We talked about robotics models, video models, computer use models.

and the timeline and capabilities that Bob envisions in the future. We talked about OpenAI's unique culture and what made research so effective there, and also some of the key decision points and what it was like to live through them. We hit on why AGI might not feel that different than how things feel today. And Bob also shared why he left OpenAI and what's next. I think folks are really going to enjoy this episode. Without further ado, here's Bob.

Well, Bob, thanks so much for coming on the podcast. Thanks for having me here. This will be fun. Really excited to have you. I know we'll hit on a bunch of different topics. I figured we'd just start with a question that I feel like is top of everyone's mind right now, which is the kind of big debate on if we've hit a wall with model capabilities. And so we'd love your thoughts on that and the extent to which you feel like there's still more juice to squeeze on the pre-training side. Yeah, I think this is probably the place where the outside view of people who are watching and the inside view of people who are at the big labs diverges them

I think if you're coming at this from the outside, a lot of people first start paying attention to AI with ChatGPT. And then six months later, boom, GPT-4 comes out. And it feels like everything's accelerating very fast and progress is happening. And then GPT-4 came out a year and a half ago. Everybody knows it had been trained beforehand. And so what's happening? Why is nothing coming out, right? Yeah.

And the view from inside is very different. And on the outside, people are hitting a data wall, what's going on. But you have to remember that to make pre-training progress in particular, it involves...

increasing the amount of compute by really a huge amount. So to go from GPT-2 to GPT-3 or GPT-3 to GPT-4, that's 100x increase in effective compute. That's what that increment means. And so you get that by a combination of just adding more flops, so more chips, bigger data centers, and algorithmic improvements.

And the algorithmic improvements, you can get some out of them, 50%, 2x, 3x would be amazing. But fundamentally, you got to wait to build a new data center. And so there is no shortage of new data centers coming up. You can just read the news. You can see Meta, X, other frontier labs are also doing new data centers and maybe not making the news. Yeah.

But, you know, fundamentally, this is a very slow multi-year process. And in fact, you know, before you see, you know, a full generation of GPT-4 to a GPT-5 lift, you're going to see something that's just a 10x lift. Because, you know, people often forget that we went from GPT-3 to GPT-3.5 to GPT-4.

The thing that's interesting right now is that pre-training is going on,

I think we're going to have to wait and see when this next generation of models gets released. But if you look at something like O1, where we've been able to make progress using reinforcement learning, by a variety of metrics, O1 is a 100x compute increase over GPT-4. And I think some people don't realize this because the decision was made to name it O1 instead of GPT-5. But actually, this is effectively a new generation of

And when that next gen, when that GBD 4.5, you know, that next generation is trained, it's

The interesting question is going to be, how does this pre-training progress stack with the reinforcement learning process? I think that's something that we're going to just have to wait and see what gets announced. That begs the question, with the multi-year process going into 2025, do you think there's going to be as much progress this coming year in AI as the last year, or do you think that things will begin to get slower? Well, I think there is going to be progress. I think it's going to be different progress.

You know, one thing that happens is when you go to any next generation, you always run into problems that you didn't see at the previous generation. And so, you know, even after the data centers are out there, it takes time for people to work through the problems and

finish the models. So the reinforcement learning process that we use to train O1, OpenAI used to train O1, it's, you know, what it does is it creates this longer, coherent chain of thought that, you know, effectively is like packing more compute into the answer. So, you know, if you have a model that takes a few seconds to generate an answer versus a model that takes, let's say, a few hours to generate an answer, that's 10,000 times more compute.

If you can actually leverage it, right? And so, honestly, we've been thinking about how to use test time compute for since, I don't know, let's say 2020. And, you know, finally, I think this is actually the real answer for how to do this, how to do it without wasting a lot of compute. And what's great about this is it doesn't require a new data center.

So here there's a lot of room for, you know, this is a new technique that's just started. There's a lot of room for algorithmic improvements. And there's no reason in theory why the same sort of fundamental principles and ideas that are used to get O1 to go from a few seconds, you know, what GPT-4O can do a few seconds, to O1 spending...

30 seconds or a minute or a few minutes to think. You can extend those same techniques to go to hours or even days in theory. Now, just like with going from GPT-3 to GPT-4, there were no foundational new techniques. The both were trained roughly the same way, but scaling is very hard. And so that's actually the meat and potatoes of it is like, can you actually do that scaling?

And I think that's the kind of progress that we're going to see that's going to be most exciting in 2025. With kind of the focus on test time compute and, you know, the, you know, O1 being out there in the wild now, I think it's really interesting to think about how folks are actually going to use these models, right? And I think you had a recent tweet that I thought was really interesting about how you're going to kind of need these new form factors to unlock the capabilities of some of these models. And so maybe just expand on that a little bit. Like, have you seen kind of any early form factors that you thought were interesting working with these models? Yeah.

Well, yeah, and just to sort of explain the problem, chatbots have been out for a while, and most of the interactions that people have today with chatbots, GPT-4 class models do a great job with those. If you're asking chat GPT, who is the fourth Roman emperor, or how do I reheat basmati rice, or most of the bread and butter conversations that we have, you don't have any pain. It works just fine. And when we were thinking about releasing O1 Preview,

There were a lot of questions about, will people use this? Will they find things to do with it? And I think those were kind of the right questions. It's, what is it that I need to do with this model where I can really get the value out of it? Programming is a great use case for this because with programming, you have a structured problem where you're trying to make progress on something, again, over a long period of time.

And in a way that is highly leveraged for reasoning. Another example is if you're writing a policy brief, right? You're writing a long document. It needs to make sense. It needs to all tie together. You know, the truth is most, I mean, there's a lot of programmers out there, but most people who aren't programmers don't have something like this that they need to do every day. But going back to the underlying breakthrough here,

It's that you have a coherent chain of thought, a coherent way of making progress on a problem. And that doesn't just have to be thinking about the problem. That can also be taking action, making a plan of action. And so the thing that I'm most excited about with models like O1, and I'm sure there will be others from other labs soon, is using them to really enable long-term thinking.

basically agents, although I think the word is so overused that it doesn't really tell you much about what you're trying to do. But I have all sorts of tasks in my life where I would like the model to

You know, book things for me, shop for me, solve problems for me in ways that involve it interacting with the rest of the world. And so I think that's the form factor that we really need to nail is what is this? How do we do this? I don't think anybody's figured it out yet. It's so interesting. I mean, it makes total sense. I feel like everyone, you know, minds run wild with what, you know, these agents can do and all the problems they can solve for people, for businesses, etc.

What are the big gaps to that happening today? Obviously, you've seen some of the early models like Anthropic released computer use, and I'm sure other labs are working on this as well. But as you think about what's stopping us from being there today, what are the hard problems still to solve? Yeah, there's a whole host of them. I think the most immediate one is reliability. So if I'm asking something, forget actions for a second, right? If I'm asking an agent to do something on my behalf, even if it's just thinking or writing some code for me,

And I need to go away for five minutes or an hour and let it work. If it goes off task and makes a mistake and I come back and it hasn't done anything, I've just wasted an hour. That's a big deal. Now add into the fact that

This agent is now going to be taking actions in the world. Maybe it's buying something for me. Maybe it's pushing a PR. Maybe it's sending a note, an email, a Slack message on my behalf. Well, if it does a bad job at those, well, now there's a consequence. I'm going to be at least embarrassed. Maybe I'm going to be out some money. And so reliability just becomes a much, much more important thing than in the past.

I think there's a rule of thumb when you think about reliability that to go from 90% to 99% reliability, that's maybe an order of magnitude increase in compute. That's one of these 10x generations. And to go from 99% to 99.9% reliability, that's another order of magnitude. So every nine requires a huge leap in model performance. That 10x, that's a really big improvement. That's a year or two years worth of work.

So I think that's really the first problem that's going to be faced. The second thing that I think is interesting is so far everything we've talked about is for consumer, right? You're not embedded in enterprise. But when you're talking about something that's working on a task, for a lot of us that will be something we do in our jobs, something that's embedded in an enterprise. And so that I think brings in a whole host of other considerations. Yeah.

It's an interesting point. We're seeing in enterprises today that a lot of consulting firms are actually doing really well because there's a lot of hand-holding to deploy to enterprises at this point. Do you think that hand-holding and the need for help on the enterprise is going to persist for a while, or do you think it's going to get more turnkey and enterprises will just be able to deploy these LLMs really easily in the future? Yeah, I think that's a really interesting question.

And, like, I mean, even, like, start with, like, so what is the problem of deploying an LLM in the enterprise, right? Well, probably the thing that something needs to do if it's going to automate a task for you or do your job, it needs context. Because in consumer, there's not a lot of context. Okay, you like the color red. Fine. Not very interesting. Appreciate you choosing red for your example. Yeah. Yeah.

But in enterprise, well, who are your coworkers? What project are you working on? What is your code base?

What have people tried? What do people like and dislike? And all that information is out there sort of ambiently in the enterprise. It's in your Slack. It's in your docs. Maybe it's in your Figma or whatever. So how do you get access to that? Well, you have to sort of either build something one-off. I think there's definitely an approach where

People build a library of these connectors, and you're able to come in and do this. This is very similar to the work we did at Palantir, where the fundamental problem that Palantir was solving was to integrate the data in an enterprise. I think that's one reason why something like Palantir's AI platform, AIP, is so interesting.

So I think that that's like the first path where you're sort of like building a library of these things, just whole platforms that can be built off of this. The other thing around is the opportunity to do something like computer use. So, you know, now instead of having this sort of very specific and potentially bespoke set of ways to do it, you have, you know, one hammer that can go hit everything.

And so Anthropic has rolled this out. It's really funny, we were actually talking about these kinds of computer use agents before the Anthropic folks left OpenAI way back in 2020. And Google DeepMind has published papers on this. Every lab has thought about this and worked on this problem. Yeah.

The thing that is different about a computer use agent versus one of these programmatic API integrations is that now, since you're controlling a mouse and a keyboard, you're

The actions you're taking now involve a lot more steps. Maybe you need 10x or 100x the number of tokens that you would need if you were working with one of these programmatic integrations. Now, again, what are we back to? You need a model that has a very long, coherent chain of thought, the ability to make progress on a problem consistently over a long period of time. The exact kind of problem that O1 has solved. I'm sure there are other ways that you can solve this.

But I think that's going to be one of the breakthroughs that I think we're going to see used over the next year. How do you think it ends up playing out? Because I imagine on the one hand, obviously a general purpose model that could use computer in any context seems quite compelling. I imagine getting to 99.999% reliability might prove difficult. And also, there's so many steps that could go off at different points.

You know, one other vision of how this could work is I'm sure, you know, some of these problems could be simplified if like the underlying application APIs were opened up in some way, right? Or other ways to, or you could have specific models for like using Salesforce or I don't know, some of these specific tools. You know, do integrations end up being a big advantage if you have access to the under the hood experience so you can just do things in a quick split second versus like sit there while you watch a computer do thing on the screen?

Yeah, well, I mean, I think you definitely see probably a mix of these approaches where some use these integrations and others, you know, the computer use becomes sort of this like backup that you can employ if you don't have something bespoke. And then maybe you see which things people use and you come up with a solution.

the more detailed integration if it works. I think to the question of, will you see a Salesforce specific computer use agent? That doesn't make a ton of sense to me technically, because I think fundamentally what you're leveraging there is the data. Someone went out and collected a huge data set of how to use Salesforce.

And, you know, you could throw that into, it would be in Salesforce's advantage to share that data set with Anthropic and with OpenAI and with Google. And, you know, they train their own models. And, you know, I think every application provider would want that to be public and part of every foundation model.

So I don't think this, you know, to me, this doesn't seem like a reason to have sort of specialized models in that way. No, it's a really compelling point because I guess also at the point where you're in a competitive space and your competitors are making their data available and their products become easier to use, you certainly want yours to be as well. Yeah, it's a bit of a mystery to me why, you know, that kind of ecosystem hasn't happened where people are, you know, shoving their data into the LLMs. You know, it's the equivalent of Google SEO. Yeah, yeah. Effectively.

That's a really interesting point. How far away from widespread computer use do you think we are? Well, I mean, I think there's a good rule of thumb for these things, which is you see a demo and it's super compelling and doesn't quite work yet. It's too painful to actually use.

And then, you know, give it a year and it's 10x better. And again, scaling is log linear. So 10x better, you know, it's one increment of improvement. But one increment of improvement is a pretty big deal. And you start seeing it being used for limited use cases. And then give it a second year. And at that point, you know, it's surprisingly effective. But you can't rely on it every single time, right? Where we are now with chatbots, where you still have to worry that they've hallucinated. Right.

Then the question of adoption is really about what level of reliability do you require? And anything where you can tolerate mistakes –

will get automated a lot faster than places that you can't. So I guess to go back to Jordan's original question, it makes total sense, you know, basically that today you need a lot of kind of hand-holding to integrate into the right data and to kind of, you know, define bespoke guardrails and workflow. What will the layer in between, like, hey, great computer use model, an enterprise ready to sign up, like, what's the layer in between look like? Man, I think there should be startups out there defining that. You know, I think we don't quite know the answer to that yet,

I think one of the interesting things that you see on this is when you have these general tools like something like computer use, the problems it solves are sort of at fractal levels of difficulty where it can solve a lot of problems, but then you will see a problem that really matters and you can't quite solve it. And then you'll say, okay, well, now we're going to do a very specific – maybe we'll make a programmatic approach.

for something like this. So I think we're going to see a mix of approaches for a while. I'm curious, you've obviously been on the research side and responsible for really cutting-edge research. We talked a little bit about test time compute. Are there other areas that you're particularly excited about? Well, I think we've talked about pre-training, we've talked about test time compute. The other really exciting thing is multimodal. Big

Big Dave for Multimodal. Big Dave, yeah. Today is the Sora release. And really, this is, in some ways, the culmination of this really long work. LLMs were invented, let's say, 2018. And it became sort of

obvious that you could apply transformers and some of the same techniques to you know adapt to other modalities so you include vision you have you know images out audio in and audio out first off these things start off as sort of side models like a dolly or a whisper and ultimately they get integrated into the main model and the modality that has resisted this for so long is video

And Sora, I think, was sort of first to demo. Other companies, Runway, some other models have come out in between. And now Sora itself has released. I think there's really two things that are interesting and different about video than about the other modalities.

When you are creating an image, really, you probably want to just have a prompt. It creates an image. Maybe you try this a few times. If you are a professional graphic designer, you might edit some of the details in this image. But let's be honest, none of us are. A lot of the usage here is, did you need some slideware? Did you want a picture to go along with your tweet or your presentation? But it's a very straightforward process. But with video...

Wow. I mean, it's an extended sequence of events. It is not one prompt. And so now you actually need a whole user interface. You need to think about how to make this a story that unfolds over time. And so that's, I think, one of the things that we see with the Sora release. And I think Sora has spent a little more time thinking about this. The product team has spent more time thinking about this than some of the other platforms out there.

And the other thing that you always need to think about with video is just that it is very expensive. It's very expensive to train these models. It's very expensive to run these models. And so I think as interesting as

you know seeing you know sora quality videos and so i think is better quality but you have to pay attention a little bit to see that it's better quality at least if you're looking at a short snapshot but now uh you know sora sora usage is available to anyone with a plus account uh we have the you know open ai released the pro account for 200 a month now that includes unlimited slow generations of sora and so when you have you know this level of quality and this level of distribution

That is actually, you know, there's two hard problems that have been solved there.

there. And I think that's going to be a really high bar for other competitors to face. What does the progress of video models look like over the next few years? I mean, obviously in the LLM world, we've seen just massive, it feels like whatever the model was last year is now 10 times cheaper and much faster. Do you expect a similar level of improvement on the video side? Actually, I think the analogies are very direct. So if I think about the differences between video models today and video models in two years...

The first thing is the quality will be better. Now, the instantaneous quality is already really good. You can see the reflections. If you share things, all the hard things that are hard to solve, you can point out, oh, look, it did a reflection there. There's some smoke. What is hard is extended, coherent generations. Yeah.

So the SOAR product team has a storyboard capability that allows you to put checkpoints at various, every five seconds or every ten seconds to help guide posts for the generation. Fundamentally, if you want to go from a few seconds of video to an hour of video, that is a very hard problem. And that's, I think, something you'll see in next generations of models.

The other side, again, the other analogy is I actually think it will be very much like LLMs, where if you want a GPT-3 quality token, that's 100 times cheaper than it was when GPT-3 came out. And the same thing is going to be true for Sora, where you're going to be able to see these very...

beautiful, realistic looking videos and they're going to cost practically nothing. I feel like the dream is like a full length AI generated movie that, you know, wins some award or something, you know, to ask a shameless podcast question, like when do you think we'll have that? What year? If you had to make a guess? Oh man, yeah.

Honestly, winning an award is somehow too low of a bar, right? There are a lot of award shows, I guess. Really, is this a movie that you actually want to watch? Yeah. I think we will see this in two years, but it will actually be less impressive than I just said because the reason you wanted to watch it is not because of the video, but because there was a director who had a creative vision and used the video model to

to exercise their creative vision. And the reason they would do this, I think, is they could do something in the medium that they couldn't film.

None of us are directors here, but we can all imagine lots of possibilities there. Not graphic designers, not directors. That's right. We have some very specific skills here. Yeah, we're seeing a bunch of companies pop up trying to be like the Pixar for AI. And it's always a question that we ask is, when is that actually feasible? So it sounds like a lot sooner than at least we would have thought. That's my guess. That's my guess. Once things get to the stage where they can be demoed, progress is very fast after that. And before that, progress...

progress is very slow, or at least it's invisible. I guess switching gears from video to robotics, you joined OpenAI to work on a lot of robotics stuff at the beginning. We'd love to understand your perspective on the space and where we are today and where you think it's going. This is a very personal question. When I left Palantir, one of my ideas was that robotics was going to be the domain in which deep learning happens

became real, became more than a button on somebody's website. And so I spent a year between Palantir and OpenAI really deeply understanding robotics, writing some of the first code I did with deep learning around vision. And it was a very challenging space. And at the time, I thought, well, maybe it would be five years away. This was 2015.

And that was very wrong. But I think it's correct now. I think robotics will see, you know, widespread, if somewhat limited, adoption five years from now. And so I think this is a very good time to be starting a robotics company. And I think, yes, you know, for the fairly obvious point that foundation models are –

just a huge breakthrough in your ability to get robotics up and running quickly and for it to generalize in, in important ways. And there, there's a few different aspects to that. The, the sort of obvious one is that the ability to use, um,

and to translate that into plans of action. A lot of that comes for free with the foundation models. The slightly less obvious, maybe more fun thing is just the fact that the whole ecosystem is developed. Now that I left OpenAI, I've spent some time hanging out with founders and I've talked to a couple of robotics founders. And one robotics founder was telling me about how they can literally have set it up to talk to the robot.

And that it's really cool and much easier. You can tell the robot what to do. It sort of has the gist. It uses some specialized models to go actually do the operations. But it was just a pain and a hassle to have to type out what you wanted. And you have to sit there in front of the computer instead of watching the robot. Now you can just talk to it. I think one of the big differentiators that we still...

don't know where it will go is this question of do you learn in simulation or do you learn in the real world? And, you know, our, you know, major contribution those two years of robotics was showing that you can train in the simulator and have it generalized to the real world. But,

And there's a lot of reasons to use a simulator if you can. You know, for all the same reasons you would do that if you're programming, right? It's just, it's a pain to run against production systems or against the real world. You get free testing, all that stuff. But simulators are good at simulating rigid bodies. You know, if you're doing pick and place with things that are hard, right?

like hard bodies, that's great. But a lot of the world is floppy. You have to deal with cloth or, you know, thinking about warehouses, you have to deal with cardboard. And your simulators just do not do a particularly good job of dealing with that.

And so for anything that wants to be really general, the only approach we have now is to use real-world demonstrations. But I think as you can see from some of the work that's been coming out recently, that

that can actually work really well. And I guess, like, obviously, it's somewhat unknowable, like, you know, when one finds scaling laws in robotics and then also the extent of, you know, teleoperated data one might need. But do you have, like, does it feel like we're close to that? Or, I mean, obviously, you know, I guess in 2015, you thought you were five years away. How far away do you think we are from people, I would cheaply call it like a chat GPT-like moment for robotics where people are like, oh, that really, that's something visceral that seems like it's different and works. Any kind of prediction, especially about robotics, you really have to think about the domain.

So I am pretty bearish on a mass consumer adoption of robotics because having a robot in your home is scary. Robot arms are deadly. Like they can kill you more to the point they can kill your kids.

And, you know, you can use different kinds of robot arms that don't have these drawbacks, but then they have other drawbacks. The home is a very unconstrained place. But I do think, you know, being in various forms of retail or other forms of work environments...

I think in five years we will be seeing that. And you can even see this if you go in an Amazon warehouse. They already have robots that have solved mobility for them. They're working on pick and place. I think you will see lots of robots rolled out in warehouse settings. And then it's going to sort of be domain by domain for a while.

I won't offer a prediction on when it's in the home, but I think you're going to see it widespread. I think we'll interact with them in our daily lives in five years in a way that would feel strange today. I mean, obviously there's been these separate robotics companies. To some extent, obviously robotics leverages Foundation, the advances in LLMs.

I'm curious, like the, you know, does this all kind of converge? You know, obviously there's companies that just do video models. There's companies that are focused on bio, on material sciences. Like, as you think about where this goes long term, you know, is there one massive model that just knows all of this? At the frontier model scale,

I think you should continue to expect the companies are going to put out one model. It's going to do the best of everything on every axis for every form of data they have access to. So that's an important caveat.

What specialization really gets you is price performance. So over the last year, you've seen the Frontier Labs get much better at having very small models that have lots of intelligence, that can do chatbot-like use cases very cheaply. And if you're a company at this point...

A very common pattern is that you figure out what you want the AI to do for you, and you run that against the best frontier model that you like, and you generate a huge database, and then you fine-tune some much smaller model to do that. This is such a common... OpenAI offers this as a service. I'm sure this is a common pattern on every platform.

And you can say, and this is just tremendously, tremendously cheaper. Now, if you've trained a chatbot like this, your customer service chatbot is trained like this, and someone goes off script, it's not going to be as good as it was if you would actually use the frontier model natively.

But that's fine. That's the price performance that people are willing to take. One thing I thought was really interesting is when we were talking before, you kind of mentioned this macro point about AI progress, about basically that in 2018, we'd said, hey, we're going to be in 2024 and we're going to have

all these model capabilities that you would have kind of thought for first principles, like, okay, these things have just completely changed. Like the world is almost unrecognizable relative to what it was in 2018. And, you know, while certainly you guys have had a massive impact on the broader world, I wouldn't say, you know, yet AI adoption has completely changed the entire way the world works. Like, why don't you think that's been the case? Yeah, I mean,

Just to restate this a bit, I think, weird as it sounds, the right mental frame to have about AI is deeply pessimistic. Why is progress so slow? Why... We're talking about a...

Some people say that AI has led to a 0.1% increase in GDP growth, but that's not because of productivity from using AI. That's from capital expenditures of building out all the data centers to train the AI. So why is the AI not visible in the productivity statistics, right? Just like people said about the internet in the 1990s.

And I think there's a few reasons for this. And for starters, that view from 2018 that, well, once you can talk to it and it can write code, then everybody's going to be automated right away. It's the same kind of view that you have when you're an engineer and someone asks you to write a feature and you're like, oh yeah, I can totally knock this out in a couple weeks.

And then you start writing the code and you're like, oh, well, actually there's a lot more to this feature than I realized. And if it's a good engineer, offhand they think two weeks and they sort of plan it out in the estimator and it takes two months. And if it's a bad engineer, well, it's something that they couldn't write at all. And I think that's what happened as we really dug into what is it like for a human to do a job. Well,

Yes, you may talk to them on the phone, but that doesn't mean that the thing they were doing was talking to you on the phone. There was real work that they were doing. Fundamentally, the thing that AI can automate is a task, and a job is composed of many tasks.

And when you dig into real jobs, you find that most of them, for most jobs, there are some tasks that cannot be automated. Even if you look at programming, the boilerplate gets optimized first. And maybe the last thing is this sort of, well, what is it I'm even trying to do? Like the giving direction part. And so I think as we continue to

roll out AI. We're going to find more and more of that. So I guess with that in mind, in terms of progress, what's an area that you think is underexplored today that should be getting more attention than it is? Okay, one answer here. The kind of startup that I'm really excited about are startups where people are taking AI and they're applying it to something really boring. So imagine that you were running a company and

And you could just hire all the smart people you wanted to do something super boring. Like look at all the places where you're spending money and make sure that your comparison shopping appropriately. Like, you know, if, if, if your procurement org was full of like Elon Musk quality people who are, who are really being careful about their spend, you know, you could probably save a lot of money and no one does that because, you know, the kind of people that it would take to really save money, you know, uh,

They would get bored. They would hate this job. But AI is infinitely patient. It doesn't have to be infinitely smart. And I think anywhere where if you're running your business, you could get value out of people who are infinitely patient doing something, that's something that AI should be automating. It's interesting. I always thought...

of like consultants as this like arbitrage of getting smart people to work on like kind of boring problems or in boring industries. And obviously with, you know, top cutting edge AI models, it's like you can get these, you know, someone with a brilliant IQ to work on like a problem you'd never be able to get a smart person to work on. It's a really interesting point. Yeah. Well, I mean, the first time I heard that people had done productivity studies where they show that, you know, AI really helps like 20, 20, 50% improvements and

I was like, wow, that's amazing. I'm like, oh, it's consultants. Well, you know, AI is really good at bullshit. And consultants are, you know, their job is to produce bullshit. So maybe we shouldn't be surprised that that's where the productivity gains are showing up first. Yeah, I think it was also the biggest in the bottom half of performers, right? That's right. Well, I mean, actually, I think that's kind of hopeful.

Because if you look at the bottom half of performers, the hopeful version of this is they had the skills that humans have that are hard to automate. They knew what they were trying to do, but they couldn't figure out how to write the code to get there. And then the model comes along and it's like, oh, I know how to write the code, how to get there. I don't know how to figure out what it is that I should be doing.

And so, um, you know, now the, these bottom performers, uh, can, can actually, you know, really improve in their jobs. So I see that as very hopeful. I guess in terms of performers, I mean, you are and have worked with some of the best researchers in the world. What do you think makes an AI researcher the best? There are many different kinds of researchers that do different things. So, um,

If you think about someone like Alec Radford, who invented the GPT series, but also Clip. I mean, basically invented LLMs and then went on a tear through all forms of multimodal. Alec is someone who does his best work with a computer alone at odd hours of the night. By comparison, other brilliant people like Ilya Sutskever or Jakub Pichoki, who are the first and second chief scientists at OpenAI,

You know, those guys have sort of big ideas and big visions and help work a lot through other people and helping set out a whole roadmap for the company. But the thing that I think is really common for the very best scientists is that they have a certain amount of grit. I will always remember watching Aditya Ramesh, who invented DALI, just work through this problem.

Where he, at the time, the original idea for DALI was to see if we could generate a picture that was not in the training set to prove that neural networks were creative. They weren't just sort of memorizing and rehashing. And so he wanted to generate a picture of a pink panda skating on ice, which he was certain was not in the training set. Yeah.

And he worked on this for 18 months, maybe two years. And I remember a year in, Ilya came by and brought me a picture. And he's like, look, this is the latest generation. It's really beginning to work. And it was just this blur. And Ilya's like, you can see up here, there's pink up here, and it's white down below. You can really see the pixels starting to come together. And I couldn't really see anything. But

But he just kept working at it. Every researcher that really succeeds at one of these foundational problems, they have to see this as their hill to die on. And they are just going to go after it and keep working on it

you know, for years if necessary in order to make it work. What have you learned about putting together like a research organization with a bunch of folks like that? Well, you know, the funny thing is the best analogy I have actually comes from Alex Karp at Palantir, which is, you know, he always used to say that engineers were artists. And it makes a lot of sense. You know, when you talk to a really good engineer...

They just want to create. They have something in their heart. And code is their way of bringing that sculpture to life.

And, you know, at Palantir it was, well, you know, you have to make them fix bugs, but every time you do, you know, the artist part of them is going to be sad. You have to have a process in order to make people, you know, work together, but the artist part of them will be sad. And the truth is that, you know, an engineer is an artist and a 10x engineer is 10x an artist and a researcher is 100x the artist of any engineer. And so, you know, to build an organization

with researchers. There's so much more. You cannot, you know, there's a version of engineering management where you say it would be great if everybody were interchangeable parts and you had a process and then you can sort of make them work together. And working with researchers, it's very high touch because you're

The most critical thing is you cannot snuff out the artistry, which is what – it's because they care about the vision they have in their head that makes them willing to do something.

go through all of the pain that it takes to actually produce that vision and make it a reality. You're fortunate to have worked at both Palantir and OpenAI, and there's been a lot of articles going around about how Palantir's culture was really special. When you think about OpenAI, I'm sure there'll be many articles in the future about its culture. What do you think those pieces will say? Yeah, I mean, I think one piece of it is working with researchers like we talked about. The other thing that is just crazy about OpenAI is

is how many times it has pivoted or even I like to think of it as being refounded. So when I joined OpenAI, it was a nonprofit and the vision of the company is that we would build AGI by writing papers. And we knew that that was wrong. It didn't like quite smell right. You know, a lot of the early people, Sam, Greg, you know, me, were startup people. And, you know, this path to AGI felt wrong.

And so after a couple years, there was a transition from a nonprofit to a for-profit. And that was very controversial within the company, in part because we knew at some point we'd have to interact with products. We'd have to think about how to make money.

The partnership with Microsoft became another sort of refounding moment, another very controversial thing. I mean, maybe it's one thing to make money, but then to give it to Microsoft, I mean, to big tech. I mean, wow, that's terrible. And then, again, of equal import was the decision to say, well, not only are we partnering with Microsoft, but we're going to build our own products with the API platform.

And then finally, the move to add consumer to enterprise with ChatGPT. Any one of these pivots would be definitive for a startup. And at OpenAI, it was every 18 months, every two years, we were fundamentally changing the purpose of the company, the identity of the people who worked there. And again, we went from

Writing papers was your job to building one model that everyone in the world can use. And the really crazy thing is, I think this is what we actually, if you'd asked us back in 2017,

you know, what is the right mission? You know, wouldn't have been getting to AGI by writing papers. It would have been, we want to build one model and everyone in the world can use it, but we didn't know how to get there. And so we just had to explore and, and, and find all these things out along the way. What do you think enabled you guys to be so successful in making these, these big shifts? Well, I mean, necessity for one, um, none, none of these were, uh,

None of these were sort of freely chosen, right? You know, you have a nonprofit, you run out of money, maybe you need to find a way to raise money. Maybe to raise money, you have to be a for-profit. You know, your partnership with Microsoft, you know, maybe they can't, you know, they aren't seeing the value in the models you're creating. So you need to build an API because that might actually work. And then you could show them that people actually want these models. Yeah.

ChatGBT, this is one I think we actually really did believe after GPT-3 that with the right sets of advances, the right form factor was not just an API where people had to go through an intermediary to talk to the model, but that the model would be something you could just converse with directly.

And so that was one that was, I think, very deliberate. But, you know, somewhat famously, the way that it happened was an accident. You know, we were working on this. We'd actually already trained GPT-4. And the...

we had wanted to release when the model was good enough that we all used it every day. And we all, we all looked at chat GPT, you know, in November and we were like, Oh, did it pass the bar? Not really. And John Schulman, uh, one of the co-founders who led this team said, look, I really just want to release. I want to, I want to get some outside experience. I remember thinking if a thousand people used it, that would be, that would be success. You know, we had a pretty low bar for success here. Uh,

And we made the fateful decision of not putting it behind a wait list. And then, again, the world first forced our hand.

and suddenly everybody in the world wanted to use it. What were those first days like when you released it? Oh my gosh, they were very intense. At first, there was a certain amount of disbelief that this was actually going to happen. There was some anxiety. We were quickly trying to figure out how do we get the GPUs. So we temporarily repurposed a bunch of research compute over there. And then there was this question of when is it going to stop?

is this going to keep going or is it going to be a fad? Because, you know, we had almost gone through something similar with Dolly. The Dolly 2 model was a big internet sensation and then it died back. And so there was this concern that, in fact, ChatGBT would disappear. And this was one place where I, you know,

pretty strongly believed that it wouldn't, that this was actually going to be bigger than the API. And in retrospect, I was right. I mean, what a fascinating experience. I guess, you know, one of the cool things is you're just so close to this cutting edge AI research. You know, I'm curious, like, what's one thing you've changed your mind on in the AI world in the last year? You know, the really funny thing is, I don't think I have changed my mind. After GPD-3,

And, you know, and into the, you know, 2020, 2021, if you were on the inside, a lot of the things that needed to happen over the next four or five years felt kind of obvious. You know, we were going to have these models. We were going to make the models bigger. They were going to become multimodal. Even in 2021, we were talking about how

we needed to use RL on the language models and try to figure out how to make that work. And there's a real truth that the difference between 2021 and 2024 is not, you know,

what needed to happen, but just the fact that we were able to make it happen at all. And, you know, we, the whole field was able to make this happen. But in some sense, also what, where we are now feels a little bit predestined. I guess looking forward, as you think about scaling pre-training and scaling test time compute, like

Does it also feel like you're predestined to reach AGI on those two alone? How do you think about that? I have a hard time conceptualizing what AGI is. If anything, I think I have a deep critique of AGI, that there is no one moment that, in fact, these problems are fractal and we're going to see more and more things automated.

But somehow, I don't know, I have this belief that it's going to feel very banal, that somehow we're all going to be taking our self-driving cars to our offices where we boss around armies of AIs and we're going to be like, oh, this is kind of boring. Somehow this still feels like office space and my boss is still an idiot. And that's what our AGI future will look like. And we won't be able to wait until 5 p.m. rolls around or something. But on a more serious note,

I have felt for a while, and again, I think this was a common view within OpenAI and probably within the other frontier labs, that solving reasoning was the last fundamental challenge that was needed in order to scale to human-level intelligence. That you need to solve pre-training, you need to solve multimodal, you need to solve reasoning.

And at this point, the remaining challenge is to scale. But that is a big deal. Scaling is very hard. There's actually very few foundational ideas at all. Almost all the work is in actually being able to scale them to accept larger and larger amounts of compute. And that's a systems problem. It's a hardware problem. It's an optimization problem. It's a data problem.

it's a pre-training problem. All the problems really are just about scaling. And so, yeah, I think in some ways at this point, it is predestined. And

You know, the work here is to scale it, but that is hard. It's a lot of work. Obviously, I think people talk about the societal impacts of, you know, these models scaling their capabilities. And, you know, I think we're still early in that discourse and there's probably lots of different conversations to be had there. But any aspects of that that you, you know, are particularly, you know, interested or passionate about or things you think we should be talking about more? Yeah. The thing that I think is most interesting is that we're moving from a world where intelligence is

is probably the critical scarcity in society to a world where intelligence will be ubiquitous and free. And so what then is the scarce factor of production? And, you know, I think we don't know. My guess is agency, right?

That, you know, you can just go do things. You know, what are the right questions to ask? What are the right projects to pursue? I think these kinds of problems will be very hard for AI to solve for us. I think those are going to be really core things that humans need to figure out. And, you know, not every human is good at this.

And so, you know, that's the thing that I think we all need to think about is, is how do we develop that kind of agency that allows us to work with AI? Do you think that's now or how far in the future? I think it's going to feel very continuous. It's an exponential curve.

And the thing about exponential curves is that they're memoryless. It always feels like you're always moving at the same speed, at the same pace. The model's not like eventually also figure out. I mean, if you think about figuring out what to do or the project goal, you kind of referenced a few times. You could imagine at the most fundamental level in the future, we're like to a model, hey, build a good company or create an interesting work of art or create a movie or whatever like that.

Does that agency, I guess, maybe talk a little bit about that as these models get more powerful? Yeah. I mean, I think the – can you just ask the AI to figure it all out?

Well, I think you can, and you get something out the other end. But let's use Sora as an example here. So, you know, if you're creating a video and you give it a very vague prompt, it will totally create a video for you. Maybe it will be a very cool video. Maybe it would be better than the coolest video that you could have come up with, but it might not be the video you wanted.

And so you can also interact with it where you give it a very detailed prompt and you say, I'm making these specific choices about the video that I want to see. And that I'm able to make it the video that I wanted that satisfies me or perhaps whoever my audience is. And I think that kind of tension will remain no matter how advanced the AI is.

because how you color in the blank space will determine a lot about what the final product is. How do you use the most advanced O1 model today? My go-to for understanding models for playing around with it is that I spent a lot of time with my eight-year-old son teaching him how to code.

And he loves to ask questions. And so I'm always trying to figure out how to connect the thing that he is interested in today with the lesson that I want to teach him. And so as an example, the other day he said, "Dad, what is a web scraper? How does that even work?" And that gave me the opportunity to say, "Okay, well, can I teach him how networking works with a short program that scrapes the web?"

And so I turned to a one and, you know, played around with it and had it try to figure out how to create, you know, a program that was short enough that didn't introduce too many new concepts that I hadn't already taught him, but did teach him about networking, which was the core concept that I wanted to teach him.

that was accessible for an eight-year-old. And that, I think, is just a really... And it was able to do that, and it took some playing around with it. But part of it is playing around with it is an important part of testing it. I guess on the testing side, when you think from the research testing perspective, what are sort of the core evals that you tend to have when a new model comes out, and what do you sort of rely on from that perspective the most? Yeah, well, I mean,

The first point to make here is that it changes with every generation. As we were developing O1, the right metric to look at was GPQA, so Google-proof question answering. By the time we were ready to release, it's not a very interesting metric anymore because we'd gone from basically – when we started, you could almost do nothing –

to, you know, it being completely saturated. And the last few remaining questions were maybe ill-posed or, you know, not very fun to answer. And so, you know, the metric like very much depends on the work you're trying to do in the research. And I think this is a very common place. The thing that has been consistently useful though for the last few years has been programming.

Because programming, it's a structured task. It's something many people understand. It's something I and other researchers understand, which is very important. And, you know, it scales from finish completing this line of code to write an entire website. And, you know, we are not yet at a point where programming is a solved problem by any means. And I think we have a long way to go. I think there's several orders of magnitude left before, you know,

do the work of a real software engineer is actually solved. I mean, you know, one thing from your earlier career is obviously, you know, you were getting a PhD in computer science, and I think at least a partial focus on game theory. And obviously, I think there's lots of interesting implications for, you know, using these models and, you know, to explore subjects in game theory.

I guess just generally, how do you think AI will change social sciences research going forward, policymaking, some of these areas? I guess if you were revisiting some of that work today with the power of some of these models, anything you'd be trying? Yeah. Well, as a first point, I'm actually pretty down on academia. I think it has a terrible set of incentives. In some ways, I designed the organization at OpenAI as the mirror image of academia. It's a place where you can collaborate well.

But a lot of the – one thing that I think is interesting is that in business, a lot of what product management is, is actually like an experimental social science. You have an idea. You want to test it out on humans. You want to see how it works. You want to use good methods while you do that. And something like A-B testing, it's actually – you're really doing a form of social science when you do that. And this is one of the things that I'm really excited about is –

if you're doing A-B testing, why not take all the interactions you have with your users right now, fine-tune a model on that, and suddenly you have a fake user that reacts in distribution to how your users do it. And now you can do an A-B test without going to production. And maybe after that, you can have a deep interview with one of them where they ask what they thought. Does that work today? I don't know. I haven't tried it, but it might work tomorrow.

And just, you know, this I think is a good general principle is anywhere you find yourself wanting to ask a human to do something for you,

Can you ask an AI to do that instead? And can it do, you know, a hundred of that where maybe you could only possibly have made a human do it like once painfully? Yeah. I asked Jacob to do a lot of tasks for me. So yeah, you should stop. You should start asking my model. Thank you for shipping that. You're saving me a lot of time. You mentioned, I guess, designing the, you know, the incentives that exist in academia and designing the open AI organization in contrast to that, like say a little bit more about that. Yeah. Yeah. So, I mean, again, like,

Think back 2017, 2018, 2019. AI research labs were not a big business. They were research labs. And a lot of the people who came from them came from academia. And if you look at how academia is structured, there's a set of incentives that I think work okay for how they were originally designed. But there's an extreme focus on credit. Who actually did this?

And in which order are people's names listed on the paper? That's a very important thing for people with an academic background. Maybe you don't want to collaborate with people because it dilutes your contribution to the result. If there's two people working on a problem, that feels like competition rather than an opportunity to get the job done twice as fast. And so I think DeepMind...

In many ways, I think DeepMind thought, well, let's build a lab that is like academia, but inside a company so I can tell people what to do and where it's all about deep learning.

And I think Brain originally said, well, let's bring a bunch of academics and have them do exploratory research, very academic style. I will not tell them what to do, but I'll station product managers on the outside and maybe catch these great ideas and they'll become products. And we said –

We were a bunch of startup people, more or less, along with a few very great researchers, people like Ilya. And we said, well, clearly the natural modality for a research lab is that it should work like a startup.

So we will have an opinion on the right place to go. We'll try to give people a lot of freedom, particularly these great researchers, some of whom we didn't know were great researchers at the time, but give them the freedom to go find that hill that they're willing to die on to produce the amazing thing they want to produce. But we'll make it very collaborative, and we will make sure that people are working together and ultimately that we want to build one thing, not two.

you know, publish a lot of papers. I love that. I guess you kind of ticked through earlier some of the, you know, the kind of most famous decisions in opening eyes history, the switch from nonprofit, the Microsoft partnership, you know, API release of ChatGPT. Anything that maybe sticks out as like a key decision point that maybe isn't as famous or something that you think was either a really hard decision to make or something that really tipped kind of the direction the organization went in? I think one decision that I didn't talk about earlier is

but which was also quite controversial at the time, was the decision to double down on language modeling and make that really the central focus for open AI. And, you know, this was complicated for all reasons. You know, a change like this involves a reorg and a restructure and, you know, people have to change their jobs and, you know,

Again, we had started with this culture that we were going to try a lot of different things and see what worked. Our first major significant effort was a joint effort to play the game Dota 2. Yeah.

which followed in the great AI tradition of solving harder and harder games. You go from chess to go to Dota 2 and StarCraft, which somehow feels less cool, but actually I can tell you with the math, these games are genuinely harder than Go and chess.

Even if they're not as classy. And that was a big success. And it taught us a lot. Because out of that experience came the conviction that you can solve problems by increasing scale. And a set of technological tools for doing that. And so by deciding to shut down the sort of more exploratory...

projects like the robotics team, games teams, and really refocus around language models and generative modeling in general, so including the multimodal work. I think that was obviously a very critical choice.

but one that was very painful at the time. One thing I'm struck by earlier, you obviously mentioned, you know, testing these models out with your eight-year-old kid. And, you know, I imagine in the time that you've been parenting, obviously the world eight years ago looked quite different than it does now, you know, in large part due to the advances that you've, you know, helped drive at OpenAI. And I'm wondering, you know, either for your life, for the way you parent, like,

Have you changed anything based on kind of updating your beliefs on like how soon the power of these models will manifest in the world? Yeah, I think the truth is I haven't. And I think this is probably a failure on my part, right? Like, you know, who is better placed to figure out what the kids should be learning, right?

than me, and yet I think I'm pretty much trying to teach them the same things that I would have tried to teach them eight years ago. Why am I teaching my eight-year-old son to code when ChatGBT can code for him? I think it's a mystery. But the future, in some sense, the future is predestined, but the actual contours of how that works are, I think, going to be pretty mysterious, and they're going to be revealed to us over time.

And so I think the sort of age-old truth of try things that are just at the boundaries of your capabilities, you know, work on the math, work on the coding, write, learn to write well, learn to read a lot and widely. I think, you know, those are going to develop the skills in kids and, frankly, adults that they're going to need no matter what AI turns out to do. Because fundamentally, it's not about the coding. It's not about the math. It's about learning.

you learning to think in a structured way about problems. Well, this has all been awesome. I'm sure we could chat with you for hours more. But we like to finish things with some rapid-fire questions. And so the first one is, what's overhyped and underhyped in AI today? Yeah, wow. Okay, well, one easy answer for what's overhyped is, I would say, new architectures. You know, there's a lot of things out there. They're pretty fun to look at. They tend to fall apart at scale.

And so if there's one that doesn't fall apart a scale, that will not be overhyped. Until then, they're overhyped. Underhyped, I'm going to go with O1. I think it is very hyped. But is it appropriately hyped? No. I think it's underhyped. I know our listeners will all be curious, so I'll ask. But anything you can share around why you ended up leaving OpenAI at this time? Well, the truth is that I had been there for eight years.

And I really felt that I had accomplished a lot of what I had set out to accomplish when I showed up. And it's not a coincidence that I announced my resignation after O1 Preview had shipped. There was a particular research program that we had developed. Again, pre-training, multimodal reasoning.

And those pieces were solved. And, you know, frankly, it's a hard job.

And, you know, when I felt like I had done what I needed to do, it was a good time to hand over to a next generation of people who, you know, were excited about the job and excited about solving the problems that remain. And I think they are super exciting problems. Any idea what's next for you? When I left Palantir, I spent two years before I landed at OpenAI.

And I began the process of starting a robotics company. I tried a lot of things. I got my hands dirty actually building things. I talked to a lot of people. Frankly, I made a lot of mistakes, but none that really mattered. And in the process of doing that, I learned a lot and I developed my own theses about what was important in the world and what was important about technological progress.

And, you know, all those things, the people I met, the ideas that I came up with, you know, helped land me at OpenAI. And that turned out to be, you know, a much, much better experience than anything I could have picked the first six months after leaving Palantir. And so I'm not in any hurry. I'm going to, you know, continue meeting people, figuring things out. Just I'm really enjoying the process.

of thinking and learning new things. Now that you have kind of more time, like any threads you're like excited to pull on or things that you always wanted to spend more time on, but were pretty busy in your day-to-day job? Well, you know, the funny thing is I feel like I've been, you know, stuck in a box for eight years. It's a very cool box. Yeah, very cool box to have been stuck in. But there's a lot of things that, you know, have been happening outside. And, you know, I've been, like I said, I've been talking to robotics founders. Yeah.

And seeing a lot of cool things have been happening in the time period when OpenAI wasn't working on robotics. And just generally connecting with founders, researchers, people doing interesting things has been...

just really fun and engaging. Well, this has been an absolutely fascinating conversation. And I know it's for Jordan and I and our listeners. Thank you for coming on and sharing all this. I want to leave the last word to you. I guess, where can folks go to learn more about you? Anything you want to leave our listeners with? Any threads you're excited to explore that you want to put out a call for or whatever the floor is yours?

Yeah, well, if you want to follow what I'm thinking as I develop it, the best place is to follow me on Twitter at BobMcGrewAI. And I think the right last words here are just that progress in AI is going to continue. And it's going to be very exciting. And it's not going to slow down, but it's going to change. And that's fun. So I encourage you all to keep working on it. Well, Bob, thanks so much. Seriously, this is fascinating. Anytime.

you

Ep 51: Former Chief Research Officer of OpenAI Bob McGrew - What Comes Next for AI? 01:07:32 Share

Unsupervised Learning

Shownotes Transcript

Ep 51: Former Chief Research Officer of OpenAI Bob McGrew - What Comes Next for AI?