We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Ep 55: Head of Amazon AGI Lab David Luan on DeepSeek’s Significance, What’s Next for Agents & Lessons from OpenAI

2025/2/19

Unsupervised Learning

AI Deep Dive AI Chapters Transcript

People

David Luan

Topics

David Luan: DeepSeek 的成功并非偶然，而是模型效率提升和智能提升同步进行的结果。降低成本并不意味着减少智能应用，反而会促进更多智能应用的出现。未来大型模型的训练会继续追求更高的智能，而效率的提升则会通过内部优化来实现，最终以更低成本提供给客户。仅仅依靠下一个token预测不足以实现AGI，需要结合强化学习和搜索等其他机器学习范式来实现知识的发现和利用。当前模型的泛化能力比人们想象的要强，虽然在某些特定任务上的表现可能略有差异，但这只是模型发展过程中的小问题。构建可靠的AI模型需要建立一个可靠的模型生产工厂，而不是仅仅关注算法本身。AlphaGo 等案例已经证明了模型具备原创性思维的能力，LLM 的局限性被夸大了。当前的AI代理模型虽然潜力巨大，但在可靠性和实用性方面仍有很大的提升空间。构建有用的AI代理的关键在于可靠性，而不是仅仅追求炫酷的演示效果。当前AI代理模型最大的挑战在于端到端可靠性不足，需要大量人工干预。将基础多模态模型转化为大型行动模型需要解决两个问题：一是工程问题，即如何以模型可理解的方式向模型展示其能力；二是研究问题，即如何教会模型进行规划、推理、重新规划和遵循用户指令。当前AI模型与浏览器和程序的交互方式缺乏创意，未来需要更具创造性的交互方式来提升效率。 AI代理领域的关键里程碑是能够在训练阶段赋予代理任何任务，并在几天后达到100%的完成率。AGI 的定义是：能够完成人类在电脑上完成的任何有用任务，并且学习速度与人类相当。AGI 的普及速度可能会受到社会因素的限制，例如人们对新技术的接受程度和适应能力。未来会出现专业化模型，这并非出于技术原因，而是出于政策原因，例如数据安全和隐私保护等。仅仅依靠简单的规模扩大并不能解决所有问题，还需要解决其他关键的技术挑战。高质量的数据标注在模型训练中仍然至关重要，但其作用会逐渐被强化学习所取代。过去一年，我对团队文化建设的重要性有了更深刻的认识。我改变了对AI技术长期差异化竞争的看法，认为不同领域的突破并非必然具有累积效应。数字代理的可靠性问题的解决可以为物理代理的研发提供借鉴和经验。世界建模可以解决在没有明确验证器或模拟器的情况下，如何训练AI模型的问题。

Deep Dive

Chapters

This chapter analyzes the market's reaction to DeepSeek, highlighting the initial panic and subsequent recovery. It discusses the model's efficiency and its implications for the future of AI development, including the increased consumption of intelligence despite cost reduction. The discussion touches upon the commoditization of previous levels of intelligence as newer, more complex models emerge.

Initial market reaction to DeepSeek involved panic and a stock market crash.
The market initially misunderstood the implications of increased efficiency in AI models.
Increased efficiency leads to increased consumption of intelligence, not decreased consumption.
AI use cases are categorized in concentric circles of complexity, with each circle requiring increasingly smarter models but commoditizing previous levels of intelligence.

Shownotes Transcript

Translations:

中文

David Luan is the head of the AGISF lab at Amazon. He was previously the co-founder and CEO of Adept, a company that raised over $400 million to build AI agents, and he was the VP of engineering at OpenAI during a lot of their critical breakthroughs. I'm Jacob Efron, and today on Unsupervised Learning, David and I hit a bunch of different interesting topics.

including his reaction to DeepSeq and his predictions for the future progress of models. We talked about the state of agents today and what's required to make them reliable and when they'll be ubiquitous. And he shared some really interesting stories from the early days of OpenAI and what made the culture there so special. This was a really fun one as David and I have been friends for a long time. I think folks will really enjoy it. Without further ado, here's David.

David, thanks so much for coming on the podcast. Yeah, thanks for having me. This is going to be a lot of fun because we've known each other for what, like more than 10 years now. You know, I remember when you originally joined OpenAI and I was like, that seems interesting, but I wonder like, you know, a cool career move. And then obviously you were, as always, prescient long before everyone else. I got really lucky, you know, just that I always was into robotics and the biggest constraint for robotics was how smart the underlying algorithms were. So I started working in AI and it's...

just been so cool to see that this stuff is working in our lifetimes. Well, there's a bunch of things I want to hit with you today. I thought I'd just start first topical. You know, obviously there was this huge reaction to DeepSeek over the last, you know, few weeks. You know, the NVIDIA stock crash, people were saying it was bad for OpenAI and Anthropic. I feel like now there's been kind of a coming back to, you know,

less freaking out. But I'm curious, like what people got right about the implications of this and maybe what they got wrong in the broader discourse. Yeah. So, um, I still remember the morning, uh, that everybody started waking up to deep seek news. Um, I woke up, I looked at my phone, I had like five missed calls. I was like, what is going on? And the last time something like that happened was when SBB collapsed. Cause all my investors were calling me to get our funds out from like SBB and, uh, and first Republic and all that stuff.

And so I was like, something really bad must be happening. And I checked the news and it's like stocks are crashing because DeepSeek R1 is out and all that stuff. I instantly was like, wow, people have really missed the memo on what actually happened here.

DeepSeek, I think, was incredibly good work from, I have so many thoughts on like team culture and composition and all that other stuff we can get to later. But it was just really incredible work, but it's part of this broader like story arc of we figure first out how to make new ML systems smarter and then we figure out how to make them more efficient. And so this was like the talk to the tick of 01 really. Yeah.

And what everybody got wrong was just because you could make more intelligence happen at a lesser price doesn't mean that you stop consuming more intelligence. If anything, you consume even more of it. So once I think the market woke up to that, now we're back to sanity. Given that obviously some of the at least base model there seems to have been trained on outputs of open AI. And, you know, I guess you can get the base deep seek model to say that it's chat GPT in various ways.

Do you think going forward, given what's happening with distillation, that OpenAI Anthropic maybe stopped releasing some of these models more publicly? I think what's going to happen is that people want to build the smartest models possible, but that sometimes it's not always inference efficient. So I think what we're going to start seeing more and more of is whether people talk explicitly about this or not. People are going to train these humongous teacher models on as much compute as they can get their hands on, and then they're going to try to figure out internally in their own labs

how to render it down to something that runs really fast and is efficient for customers. The biggest thing I'm seeing right now is that I kind of think about AI use cases in concentric circles of complexity. So maybe at the very inner circle of complexity will be something like just having a good old chat conversation with a base LLM.

We were able to do that back in GPT-2 pretty competently. And every incremental circle of intelligence, maybe it's being able to do mental math or coding or later on agents or later on drug discovery or whatever, requires smarter and smarter models. But every previous ring of intelligence becomes almost so cheap as to be commoditized. It kind of goes into, obviously, I feel like there's been this massive test time compute wave. It seems like a really exciting path forward for coding, for math, for these easily verifiable domains. Yeah.

um how far does this kind of paradigm get us so um there's actually an interesting paper trail and podcast trail of me talking about the recipe for how to build agi that's like years old at this point um let's contribute to that trail yeah so so now we get to this is like uh this is a a a proof that we talked about this conversation at this point in time but like but

back in, even back in 2020, right? We always said, you know, at the time we were starting to see GPT-2 had come out, GPT-3 I think was cooking or maybe done by then. We were starting to think about four. We were living in this world where, you know,

people were not sure whether or not all you needed was next token prediction to solve all of AGI and my view and a couple of you views of people around me was actually like the answer would be no the reason why the answer was no is because an LLM trained to do next token prediction by definition is penalized for discovering new knowledge because new knowledge was not part of the training set and

And as a result, what we needed to do was we needed to go look at what are the other ML paradigms we know can actually discover new knowledge. And we know RL and search can do that, right? There's such a long trail, but even AlphaGo was maybe the first time it went to public consciousness that we could discover new knowledge using RL. And the question always was when we were going to combine LLMs with RL to get systems that had all of the knowledge of what humanity already knew and the ability to go build upon it.

it. Like the reason why the initial deep mind path of just doing RL alone didn't work was because it was initialized randomly. Those models that were playing, you know, Atari incredible result or whatever, like the amount of time it would take something that knew nothing about the world to then rediscover human language and rediscover how to coordinate and like learn details about how to file your taxes would take forever if done purely in an RL setting.

And so now I think that philosophy has really been borne out by seeing how successful these like models that combine both these paradigms are. And do you think, I mean, like for domains that aren't as easily verifiable, let's take, you know, healthcare or law, like do these, you know, does following this kind of test time compute paradigm get us to models that can do that? Or like, are we going to get exceptionally good at coding and math, but still not be able to like, you know, tell a joke or something?

This is a great topic of debate. I have a very strong view. What's the strong view? The answer is that these models are better at generalizing than you think.

Um, everybody's like, ah, you know, like I played with a one and it seems like it's a little better from, from math, but the way for it to think, and it kind of like, maybe it's a little bit worse at chat or whatever. I think those are just blips on the way to glory for how these things are built. Um, today, um, like we already have signs that, uh, getting better through, uh, uh,

problems where you have an explicit ability to test whether or not the models correctly solved it, which is what we've seen from deep seek, um, does lead to transfer on some slightly fuzzier problems that seem in a similar domain. And, um, and I think the field is like working hard, like my team, others, they're just working so hard to figure out how to learn, um,

human preferences around these much more complicated tasks and then just do RL to go satisfy those. Right. And do you always have to be able to build like a model to essentially verify like the, you know, hey, that output is like good law or that output is like, you know, a good healthcare diagnosis. Like, obviously a much harder problem than like, you know,

verifying a math proof or did that code run? The fundamental thing that I view we are arbitraging is this gap between how good these models are, the same set of neural network weights, how good they are at determining whether it did a good job compared to generating the right answer. We've always seen that these models are better at determining whether they've done a good job than they are at generating the answer.

And to some extent, what we're doing with RL stuff is exploiting that to force it to try over and over and over again to satisfy its own sense of whether or not it did a good job. Talk to me a little bit about the research problems that needed to be solved to actually get a model like this out. There are just so many. But you know, well, where do we even start? I think in no particular order, and I'm only probably going to itemize like three of the problems that we've had to solve. I think one of the first ones is just figuring out

how you can even build an organization and a process to be able to reliably turn out models. Like one of the things that I always say to my team and people that I collaborate with is that today, if you run a modern AI lab, your job isn't to build models. Your job is to build a factory that reliably turns out models. And if you think about like...

how you think about that totally changes what you invest in. Right. And so until there was repeatability here, there was not a lot, in my view, there was not a lot of forward momentum. Like we've just gone through this arc over the last couple of years of like going from alchemy to like industrialization in terms of how these things have been built. And without that, there was no substrate for these things to work.

I think the next part is like all these corollaries, right? Like again, this is a space where you have to go slow to go fast. But so I think that was one of the first parts. I continually believe that the thing that people always gravitate to because they think it's cool and sexy are the algorithms. But

If we look at what actually drove a lot of this stuff, it's solving the engineering. How do you do giant, massive clusters that you can reliably keep up for long enough? And if a node goes down, you don't end up wasting a bunch of time in your job to be able to push the frontiers of scale. Now with this whole RL thing, we're going to be quickly moving to a world where there will be lots and lots of data centers, each of which are going to be doing actually a lot of inference on the base model and maybe testing it on new environments, maybe that

that a customer is brought to the, to, to, to bear, to like learn how to improve the model and then sending those new knowledge back to a centralized place where the model can learn how to be smarter. There are actually a lot of really tough engineering problems as well. Yeah. There's been folks like Jan LeCun that have had some interesting like recent and recurring criticisms of, of like the limitations of LLMs. I wonder if you could kind of just like summarize that critique for our listeners and then, you know, what are your kind of thoughts on, on folks that say, look, it's going to be hard for these,

LMs ever have real original thinking? I think we just have counterexamples. I think AlphaGo was original thinking. We watch, if you go back even to the old open AI work where we were using RL to play Flash games, right? If you're of a certain age, you probably remember like mini clip and stuff like that. Great, great, great sinks of time in middle school. But it's so funny to watch that become the substrate of AI. And we were working on

We were working on just using our algorithms to try to solve many of those games at once. And you learn that they learn really quickly how to discover, oftentimes, speedrun techniques by glitching through walls and stuff to solve platformer levels. That's stuff that humans have never done before. And on the verification side, is it mostly just, you know, it's obviously just finding clever ways, I guess, to figure out verification for some of these different domains. Yeah.

I think you use the model. I love it. I guess, you know, I'd love to shift to kind of the world of agents. And obviously, you know, you worked on computer use models at Adept. How would you kind of characterize where we are today with these models? Okay, so I'm super excited about, I remain super excited about agents. I still go back to, you know, 2020, 2021, when the first wave of like truly powerful LLMs like GPT-4 were coming out.

You go play with them and you're like, wow, there's so much promise. Like it's a, it's made me a great rap song. Um, it's like, uh, it's a, it does great roasts. Sometimes it does three digit addition acceptably. And you're like, please order me a pizza. And it just cause plays being like a Domino's pizza rep. Like it just can't do it. It's obviously a major, it's a major gap, right. In the utility of these systems.

So even since then, I was pretty sure that we had to go solve the agent problem. And so we started working when I was at Google on problems that actually still are called tool use, right? Like how do you expose affordances to the LLM to decide when it should go do something?

And back then, I think, well, the oral literature had always called it agents, but I think the general public didn't yet have a word for it. So we tried to come up with a new term called a large action model instead of a large language model. And that kind of had a little bit of traction. And then the world decided that it was going to be called an agent. And now everything's an agent and doesn't mean anything or anything anymore. Yeah.

which is very sad. But it was really cool to be the first modern agent company. And when we started Adept, the best open source LLMs were not good. So we were like, we have to train our own model because there was also no multimodal LLMs. Image input LLMs like GPT-4V came so much later. And so we had to do everything end-to-end from scratch.

It's kind of like starting an internet company in 2000 and having to go call TSMC to build your own chip. It's just insane. And so along the way, what we learned was that really early on, what we learned was that LLMs out of the box without any of the new RL stuff we're doing today, they're

They're behavioral cloners, right? They kind of do what they've seen in the training data. And that means that they're really liable to go off the rails because the moment they end up in a situation they've never seen before, the generalization tends to be bad and it does something unpredictable. And so at Adept, we were always focused on useful intelligence. And so what did utility mean? It wasn't ship a cool demo that went viral on Twitter. It was put this in the hands of someone so they don't have to do the

like shuffle things around in your computer grunt work that most knowledge workers have to do.

And so those knowledge workers care about reliability. So one of our early use cases was, can we go do invoice processing for people, right? Everyone loves invoice processing for these genetic models. It feels like a natural place to start. It's a great hello world. And so at the time, you know, nobody had really done these things before. So like, let's choose an obvious hello world. So we chose, like, that was one of them. We did Excel, some other ones. But, you know, if this thing...

ends up deleting a third of your QuickBooks entries one in seven times, you'll never use it again. And reliability remains an issue. Like even today, like an operator is like super impressive, right? And seems to be a cut, cinnamon cut above cloud computer use so far. But like, you look at both of those things that are already out and the biggest challenge is that they've both focused on end-to-end security.

task performance, right? Like go in there and you type like, hey, I'd like for you to find me five vacation sites that I could go to this weekend, right? And we'll go do an approximation of that. But the end-to-end reliability of that is

super, super low, requires lots of interventions, right? We're still not at a point where the real value of this is when businesses can really trust it in a fire and forget way. That's what we've got to solve. Maybe explain for our listeners, what do you actually have to do if you take a base multimodal model that's out there to turn it into a large action model? What is actually the work that's happening behind the scenes to make that happen?

So I can talk at a high level about it, but basically there's two things you have to do. One is an engineering problem. The engineering problem is how do you expose people?

in a model legible way, what it can do. So here are the APIs you can call, here's the UI elements you can call, let's go teach you a little bit about how Expedia.com works or how SAP works. That's kind of, it's a little bit of research engineering, that's kind of step one, is giving it a sense of what it can and can't do and basic abilities to go do stuff.

The interesting part happens in the second component, which is how do you teach it to plan and reason and replan and follow user instructions and lead the way?

later on actually even be able to infer what the user actually meant and go do that for them, right? That's a huge, huge research problem and it differs a lot from regular old LLM work because regular LLM work is like, let's go generate a piece of text. Even the reasoning work we're seeing today with math problems, right? There's an answer at the end. So it's like a single step. Even if it's like thinking for a bajillion chains of thought, like it's

It's really taking one step for you, which is like, hey, I've given you the answer. With this, it's this whole multi-step decision-making process that involves backtracking and involves trying to predict the consequences of an action you take for the future and realizing, hey, the delete button is probably dangerous. You have to do all the work to teach the model that in a basic setting. Right.

And then you set it in sandboxes, set it loose in sandboxes to learn on its own terms. Right. Yeah. Like the best analogy for this, by the way, that I'll jam in. I forgot. I think I think it was Andre Carpatho wrote this on the Internet or something is like is like modern AI training is kind of like how a textbook is organized. Right. So first you have all of the exposition. I'm just cribbing this from him, but all of the exposition

of some physical process. And then you've got some sample problems. So the first part is pre-training the sample problems of supervised fine tuning. And then the RL step happens when you have the open-ended problems at the back that maybe have an answer in the back of the textbook. It's like we're just following that process. I guess you've obviously thought a lot about how these agents will actually get brought into the world. So I guess a few questions around that. The first is you mentioned obviously part of this is kind of this engineering challenge of just

letting the models know what they have access to. How do you think over time models will interact with browsers and programs? Is it going to be similar to how humans do? Is it just going to be via code? Are there other approaches that you've seen? If I were to ding the field right now on one thing, it is that there has been a massive lack of creativity on how people interface with these increasingly smart LLMs and agents.

Like we are in the, um, do you know, like, do you remember like when the iPhone came out, the app store came out, um, people started making all of these apps, like, um, hit this button to make the burp noise. And here's a beer that you can pour into your mouth by tilting the, by tilting the phone. Um,

Like our interfaces today feel like that. And that is so sad because chat is a super limiting, like low bandwidth way to go get things done. In certain ways it's easy, but in many other ways, right? Like I don't want to have a seven turn conversation to decide what toppings I'm going to have on my pizza. Right. Like that. And I think that the lack of creativity there has been really bugging me. And I think part of the reason why is that the,

amazing product designers that could be helping us figure this stuff out. A lot of them don't yet like deeply understand the limitations of the models that people work. This is changing quickly, right? But,

And then conversely, so far the people who have been able to advance the technology have always just thought about it as like, I'm here to go deliver a black box and not I'm here to go deliver an end-to-end experience. So when that changes, I'm excited to see things like systems where when you interact with the agent, it is actually itself synthesizing this multimodal user interface for it to best elicit what it needs from you, right? And to have shared context

between the human and the AI. Like, instead of like the current paradigm is like you're chatting to each other. It's like you are doing something together on your computer and looking at the screen more like parallel rather than perpendicular. I guess you mentioned obviously operator kind of works, sometimes doesn't, you know. What, like, when do you think we actually get, you know, reliable agents? I mean, I think operator is super impressive, by the way. It's just that right now the whole field is like missing that last chunk, right? It's,

It's like self-driving, right? I forgot how many. It must have been over a decade, maybe even 15 years ago, we had amazing demo videos. Well, having done a self-driving podcast yesterday, I think it was 95 where they did the ride. They drove across the country like 99% obviously. Yeah, yeah, yeah. So are we going to have to wait 30 years for it? No, no, I don't think so because I think we actually have the right tools in our toolbox now. And I think that...

Yeah, I think this recipe for how to build a AGI-level agent will work pretty well. I guess what milestones are meaningful to you in the agent space? What do you think the next thing you're looking out for? Okay, the main milestone I'm looking for in the agent space is I can give... During training time, right? I work at one of these main labs and...

I can, I have like the milestone I'm looking for is I have a recipe where I can hand this agent in training any task and come back and days later and it's a hundred percent at it.

Yeah.

Do you think if someone was starting an adept company today, could a startup be successful here? Or is it going to be the foundation model companies and hyperscalers that ultimately move the ball forward here? So I actually have a huge amount of uncertainty on this one. But my current view is that...

I personally think AGI is really not super far away. And when you say AGI, how do you define it? A model that can do anything useful that a human does on a computer. That's one part of the definition. The other part that I also like is that it's a model that can learn how to do that thing as fast as a human can. Like a generalist human can. And I think...

I think either of those are really not that far away, but I also don't think that it's going to be, I think it's gonna be deeply transformational, but I don't think it's gonna diffuse through society really quickly because I,

Most, as we know through Amdahl's law, once you really speed up one particular thing, something else becomes a bottleneck and your overall amount of speed up is less than you think. So I think what's going to happen is that we'll have the tech, but there will be those massive, a lot of my colleagues call it a capability overhang, right? Massive capability overhang where society's ability to actually use these things productively will lag quite a while.

Do you have any early thoughts on what the gating factor might be once we do have these capabilities? I think it's people. I think it's people processes. It's figuring out how to co-design the interface with the decisions that startups are making on how to use the models. It's going to be social acceptance. Imagine you have this model that pops out tomorrow and says, hey, I have a

I have invented a brand new way of doing X. Everybody should go use this thing. Humans have to go make their peace with it and decide, hey, is this actually a better solution? That's going to not be as fast as we think. Right. And I guess to your point, there might actually be an opportunity, even if the labs are the first place to get to these models that can do this, there may be an opportunity for startups to actually be the ones to bridge the gap between these model capabilities and something that the end users actually want to interact with. I'm actually pretty sure that's what's going to happen.

I'm obviously biased. I want that to happen. That's a good point. Well, I think that's a good bet because at the end of the day, I actually still really believe that in a world with AGI, human relationships really matter. And knowing and owning your customer and being more in tune with them about what they need is going to be more important than simply controlling this artifact that actually matters.

many other labs will have. What do you think it's going to look like for, like, how am I going to use my computer in 10 years? When all these models are, you know, we've gotten to your definition of AGI. Like, am I going to ever sit down at the computer? Or like, what is your vision for like the way humanity interacts with these things? I think that we'll just get new quivers,

rather tools in the toolbox for how we interface with computers. I think today, right, I mean, we've got, people still use the command line, right? Like people, that's a really important part of people's productivity. People still use the GUI. In the future, people will still use voice interfaces. But then also, I think also people will use more ambient computing as well.

And also, they'll have this generative UI thing we were talking about earlier. But I think the metric we should be looking at is just what is the amount of leverage per unit energy a human spends with computing? And I think that is going to continue to go up and to the right with these systems. Maybe talk through a little bit like this future world of models and whether we end up with anything domain-specific. Let's take the hypothetical legal specialist.

you probably want the hypothetical legal specialist to know some basic facts about the world. - Yeah, so we make people go do a general college degree before law school. - Exactly, exactly. So I think like, I do think there will be specialized models,

But I don't want to bury the lead by just saying there will be specialized models. I think there'll be specialized models not for technical reasons, but for policy reasons. That's juicy. What's that mean? Oh, yeah. It's just like, you know, you have a couple of companies who just really don't ever want their data commingled with each other. Or you've got like some, like, you know, imagine you're a big bank, right? And you've got your sales and trading division. You've got your investment banking division, right?

the AI employees or LLMs that power this stuff, just like how those employees today can't share information, should not be able to share information even remotely through its weights, right? As you think about the key problems that still need to be solved in models, I mean, it seems like you have a lot of confidence that if we kind of just scale up compute in these approaches, like we're going to get pretty close to solving what we need to solve. But

But are there any kind of big technical challenges you see ahead in continuing to scale Model IQ? So I actually...

I actually don't believe that we take what we have today exactly and then we just pull ahead the cluster from two years from now and everything will magically work. I do think scale is going to be a major factor, but my confidence actually comes from looking at what are the main remaining open problems and trying to have an estimate for how hard they are.

And I think that if a super hard thing, like we need to go replace gradient descent, or we can only do AGI with a quantum computer or something like that.

I don't think that's in the cards. What do you do when new models come out? Like, how do you, you know, do you look at the evals? Do you vibe check them with a few like go-to questions? Like, how do you get a sense of how good these new models are? So there's two things that I do. One of them is that

What I've learned is that, and this is what's so cool about this field. Sometimes you just look at a result and you look at, especially if there is a methodology that gets published with it, which is rare now, you just look at how they did it. And you're like, wow, this is actually simpler than how we used to do this. And the results are better. When that happens, it almost always becomes part of the deep learning canon. And then you just have this moment where you're like, this is actually really beautiful. Yeah.

I think that's the main one. Then the other ones, like, you know, benchmarks are part of the hype of the field has been that a lot of benchmarks that are good, but really, like, not that aligned with what people need from these models have just become so important in people's development processes. So they're all kind of gamed. I actually think that, like,

evaluations are so hard. Measurement is so hard. Way more prestige and attention should go to that than actually, in fact, many other things that we're doing right now. Yeah. And it seems like everyone kind of has their own internal evals that they don't release publicly. They trust way more. And it's like, you could see something like an open AI model perform better on a lot of the coding benchmarks, but everyone uses the anthropic models anyway because they know they're better. And so it's interesting to see that landscape evolve.

- Well, I'm curious to the extent you can talk about it. Like I'd love to hear what you're up to at Amazon these days, how you think about Amazon's role in the broader ecosystem. - Yeah, Amazon's a super interesting place actually. I felt like I've learned so much in a short amount of time there. Amazon is super serious about building general intelligence systems, especially general intelligence agents.

And what I think is really cool about it is that I think everybody at Amazon understands that computing itself is changing from the primitives that we all know and love to a call to a large model or a large agent being probably the most important compute primitive in the future.

And so people really care, which is awesome. And I think what's interesting about it is that at Amazon now I cover agents. And what's been really cool is you get to see just the breadth of everything that agents touches in a company as big as Amazon. Okay.

And what's also awesome is that Peter Abile and I have started this new San Francisco-based research lab for Amazon. And a lot of that was because folks in the highest levels of Amazon, I think, really believe that we have to make new research breakthroughs to solve those remaining problems we were talking about earlier on the path to AGI. Yeah.

Do you pay attention to any of these alternative architectures that folks are trying? Or what other areas of maybe more out there research do you kind of keep your eyes on? Let's see. So I always pay attention to things that are

that look like they might help us better map model learning to compute. Can we use more compute more efficiently, right? It just gives us a huge multiplier over what we can do. But I honestly actually spend more of my time looking at data centers and chips because I just find it so fascinating. It's so cool. So cool. And there's some interesting plays being made there now. Yeah.

It seems like a big part of what drove this last progress in models was data labeling. And obviously, all the labs were spending tons of money on that. Is that still relevant in this test time compute paradigm? How do you think about that? I think there's two different jobs that have to be solved from data labeling. There may be more, but the two that come to mind for me first is teaching the model...

teaching the model the very basics of how to do a task by cloning human behavior. And if you have super high quality data, then you can use that to better elicit something you already see loosely during pre-training.

And then I think the second job is to teach the model what good and bad looks like for tasks that are fuzzy. And I think both of those will remain really important. But I think this middle chunk of just spamming human data labels to marginally improve models already can kind of do a thing, like that's going to be the job of RL. You've obviously been on the frontier of this space for decades.

What's one thing you've changed your mind on in the AI world in the last year? The one that I actually keep on coming back to is the importance of building

the team culture in the right way. Like, um, I think we've always kind of known it, but I've become even more convinced that hiring really smart, energetic, intrinsically motivated people earlier on in their careers, um, is actually like one of the best engines for, for, for a product. Like I feel like this, in this field, like every couple of years, the, um,

Oh, actually, yeah. Every couple of years, the optimal playbook changes. And so if people are too overfit to the previous optimal playbook, they actually slow you down. And so I think it's a lot better to bet on new folks coming in that I had previously thought.

But the other one actually that I've changed my mind on, I used to think that building AI would actually have real long-term technical differentiation that you can compound on.

I used to think, if you get really good at text modeling, you should obviously just become the winner in multimodal. If you're good at multimodal, you should obviously become the winner in reasoning and in agents. These things should compound. In practice, I've seen so little compounding. I think people are all trying...

relatively similar ideas. And I guess implicit in what you're saying is like, just because you were the first to breakthrough A doesn't actually mean that that then puts you in such an advantaged position to get to like, you know, breakthrough B. Like basically, like if you're ahead in LLM and then we're talking about the reasoning side, I mean, it was opening that happened to be ahead in both, but it's like almost like that reasoning breakthrough could have come out of any of the labs. And just because they were first to, you know, kind of a GPT-4 level model didn't necessarily mean they would inevitably be the ones to have the

I mean, it's definitely correlated, but it's not like deterministically true that you will then obviously win the next change. Well, I want to hit on your, you know, you obviously got into this space originally through your love of robotics. And so I am curious, like, what do you think of where we are in the AI robotics space today? Similarly to my belief about digital agents, yeah.

I think that we have a lot of the raw materials and I think interestingly enough digital agents gives us an opportunity to de-risk some of the hard problems in physical agents before you have to do all the costly stuff of like real world items. Say more about that. So like basically solving the reliability problem on the digital agent side, well, you know, how does that actually end up bleeding into the physical agent side? Simple example, a toy example. Let's say you have a warehouse you're trying to rearrange.

and you have a physical agent and you're asking it, hey, like figure out the optimal plan for rearranging this warehouse, right? If you were doing that by learning in the physical world or even learning in robotics sim for that, that's kind of hard. But if you could do that in the digital space already and you have all of the training recipes and the know-how, you've tuned the algorithms to be able to learn from simulated data on that, it's just, you've done the training wheels version of this already.

And so it's funny. I feel like there's these polar extremes when people think about robotics. Some people look at it and they're like, oh, the same kind of scaling laws we found in LLMs we'll find on the robotics side, and we're on the precipice of this massive change. You hear Jensen talking about it a lot. And then there's other folks that are like, we're where we were in 95 with self-driving cars where it's a great demo, but it's going to take quite some time to actually work

Where do you fall on that spectrum? I just go back to the thing that would give me the most confidence is our ability to build training recipes that let us 100% tasks.

If we can do that in the digital space, I think that it will be a challenge, but it will transfer over ultimately to the physical space too. What's your timeline for when we have like, you know, robots in our house? Oh gosh. Well, I think that's actually, it goes back to the thing I was saying earlier. I think some problems, actually a lot of problems, the bottleneck is not the modeling. It is the diffusion of the modeling. What about like a video models? Um,

Obviously, there's been a bunch of folks going into that space. It seems like the next frontier around that is really kind of like a world model and understanding of physics to allow for more open-ended exploration there. Maybe just comment a little bit about what you've seen there and your thoughts on that space. Yeah, I'm really excited about it. I think it solves one of the major problems

remaining problems, which is, you know, we talked earlier about how today we're able to make RL work on problems where you have a verifier, right? Like theorem proving or something like that. And then we talked about how to generalize that to the digital agent space where you have problems where you don't have a verifier, but you might have a solid simulator because I can go boot up my staging environment for insert app here and teach the agent how to try to use that.

But I think one of the major problems left is what happens when you don't have a explicit verifier or an explicit simulator. I think world modeling is how we answer that question. Awesome. I want to shift gears a little bit to OpenAI and your time there. Obviously, you were part of this very special time in the company and

played a kind of seminal role in a lot of the advances there. You know, I feel like at some point we're going to get this like deluge of thought pieces about what made like open AI culture so special in this like era that developed, you know, the, you know, GP1 through 4. What do you think?

What do you think those pieces will say? Like what made the organization work? Oh, I mean, I'm not sure those pieces will get it right because I'm already seeing all sorts of bad hot takes about why OpenAI succeeded during that period. I think what it is is, you know, so when I joined, I was, because the research community was really small back in 2017. I think OpenAI was a little over a year old.

And I knew a bunch of people on the founding team and some of the early employees, and they were looking for someone to... And one of the things I love about OpenAI that they got right from the very beginning is blurring the lines between research and engineering. And they were looking for someone to run that.

And so it was super fortunate. I joined when it was 35 people, incredible folks on the team like John Shulman and Chris Berner who did a lot of our supercomputing stuff and Vojcek. And there's so many people that I could name who were just incredible folks around the table back then. Okay.

And, you know, interestingly, at the beginning, it was helping OpenAI just build a lot of the infrastructure of how to scale out beyond like a tiny team that all fit one room, right? So a lot of basic engineering management stuff. But then it started morphing into like, how do we define a differentiated research strategy that would enable us basically to take the right bets for this period of ML? And...

I think what that really boiled down to was I think we realized earlier than everybody else that the

previous way research worked of like you and your three best friends write a research paper that changes the world like that era was over and that we really need to be thinking about this new era where where we thought about like major scientific goals and tried to solve them with bigger teams of combined researchers and engineers regardless of whether or not the solution was quote unquote novel as defined by academia and we would take the flag for that sometimes like people when GPT-2 first came out people said well this looks like a transformer like yes it is a transformer

And that was something to be proud of. What did you think you were signing up for when you joined OpenAI? Oh, I mean, I was just so stoked because I wanted to be at the frontier of research. At the time, it was OpenAI or DeepMind. Or I guess also Google Brain, but I think I wanted to do something a little more speculative. Same lesson as I said earlier about betting on really intrinsically motivated people

folks who could be earlier in their career was just such a winning recipe, right? Incredible people like Alec Radford, like Aditya Ramesh who invented Dali. Again, like a long list of incredible folks I can name who did like field defining things during that period who did not have PhDs, did not have a bajillion years of experience.

What common traits have you noticed and what makes these people so good? I mean, you're one of the great AI researchers. You've worked with a lot of the great AI researchers. What traits make these individuals so good? And then what have you learned about bringing them together into teams to accomplish what they're able to? So much of it is intrinsic motivation and intellectual flexibility.

I'll leave this person unnamed, but this person was so motivated and excited about what the research they were doing on my team that about a month and a half in, I remember having a one-on-one with him. And he just let drop that he had never bothered setting up Wi-Fi or electricity for his apartment. He had just moved to the Bay to join us. And it...

And I was like, how is this completely okay? And it turned out he just spent all of his time at the office just running experiments and it didn't matter. That's quite the level of passion. I mean, I've heard you talk before about how it's kind of somewhat shocking that Google didn't have the GPT breakthrough just given the transformer was invented there. How obvious is that?

at the time was it, you know, how game-changing the technology was. And, you know, I think you talked basically that, like, it was hard for Google to coalesce as a full organization around this versus other research. Maybe just comment a bit about that. I mean, credit to Ilya. I remember Ilya was like, we got to go...

So Ilya was our scientific leader, especially for the basic research part, which ended up spawning GPT-CLIP and DALI. And I just remember him going to work and being like, dude, I think this paper is really important. And poking people to try experiments that were running with other architectures with Transformer. I mean, do you think there's a risk that obviously now the foundation model companies are doing so many different things at the same time, it almost feels like there's right for another recipe maybe at some point?

I think losing focus is really dangerous. You might be the biggest fan of NVIDIA and Jensen of most people I have in my life. And so I'm curious, there's obviously so much love in the ecosystem now for everything that Jensen and the team have accomplished. What are some of the things that NVIDIA has done that you feel aren't talked about a ton, but are actually a huge part of what makes that company so impressive?

I love Jensen. Uh, what a complete legend. Um, yeah, I feel like he's, he's just made a bunch of calls here correctly over, over such a long period of time. Um, I think people know this now actually, but, uh, uh, I feel like the, um, especially for the last couple of years, I think it's really paid off for them, but, um, but bringing interconnect in house and, uh,

choosing to orient their business around systems was a really good move, I think. Well, we always like to end our interviews with just a quick fire round. And so get your thoughts on some questions. I feel like I know how you're going to answer this one, but do you think model progress this year will be more or less or the same as last year? I think visibly it'll look probably about the same, but I think actually it'll be more. What's one thing you think is overhyped and one thing you think is underhyped in the AI world today? Overhyped is...

scale is dead. We're totally screwed. Let's not buy any more chips. Underhyped, I think...

I think underhyped is how do we actually solve extremely large scale simulation for these models to learn from? David, this has been a fascinating conversation. I'm sure folks will want to learn more about you and some of the exciting work you're doing at Amazon. Where can folks go to learn more about that? Yeah, so for Amazon, I would look up the Amazon SFAI lab. And I actually don't use a lot of Twitter, but I plan to get back on it so you can follow me at JLuan.

Thank you.

Thank you for listening and see you next episode.

Ep 55: Head of Amazon AGI Lab David Luan on DeepSeek’s Significance, What’s Next for Agents & Lessons from OpenAI 43:49 Share

Unsupervised Learning

Deep Dive

Shownotes Transcript

Ep 55: Head of Amazon AGI Lab David Luan on DeepSeek’s Significance, What’s Next for Agents & Lessons from OpenAI