We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Misha Laskin, Reflection.ai — From Physics to SuperIntelligence

Misha Laskin, Reflection.ai — From Physics to SuperIntelligence

2025/3/13
logo of podcast Manifold

Manifold

AI Deep Dive AI Chapters Transcript
People
M
Misha Laskin
Topics
Misha Laskin: 我拥有理论物理学的背景,先后在耶鲁大学和芝加哥大学学习,之后转向人工智能研究并创立了自己的AI公司。我曾在伯克利大学、谷歌DeepMind以及谷歌Gemini项目中从事强化学习研究,对人工智能领域,特别是强化学习和大型语言模型的发展变化有深入的了解。我目睹了AlphaGo的成功,以及大型语言模型的快速发展,这让我意识到大型语言模型已经解决了强化学习中普遍存在的泛化问题。大型语言模型的通用性令人惊讶,它们可以回答各种问题。通过对互联网数据的预训练,大型语言模型实现了快速学习和少量样本学习,构建了一个关于世界的知识模型。然而,目前的强化学习方法比较弱,大型语言模型在生成新知识方面仍然存在困难。我认为,未来AI系统应该能够自主地帮助人们完成工作,甚至帮助人们发现新的知识。当前的推理模型的改进并非增强了预训练模型本身,而是改变了模型的行为方式,使其在新的行为模式下更强大、更有用。预训练和指令微调可以看作是模仿学习,而强化学习则在此基础上进行自我改进。推理模型在强化学习阶段学习了预训练模型不知道的新知识。我们目前正处于强化学习在大型语言模型上开始发挥作用的早期阶段,尚未达到大型语言模型的“AlphaGo时刻”。我相信在未来几年内,我们可能会看到在某些知识工作领域出现超级智能的语言模型。让模型在计算机上采取行动(例如使用浏览器)进行强化学习,是一个非常有前景的方向。强化学习的成功很大程度上取决于能否获得可验证的、高质量的数据集。强化学习算法与训练环境耦合,这既是其局限性,也是其优势。大型语言模型擅长查找和总结信息,但在解决实际研究问题方面存在局限性。当前的强化学习方法比较弱,大型语言模型在生成新知识方面仍然存在困难。物理学背景赋予了我独特的视角,让我倾向于从第一性原理出发,寻找简单的解决方案。与AI领域中常见的增加复杂性以提高性能的做法不同,我更倾向于简化问题,寻找核心原理。物理学家在AI领域的一些有影响力的工作,例如对缩放定律的研究,体现了这种简化问题并寻找核心原理的思路。 Steve Hsu: 作为理论物理学和计算数学、科学与工程学教授,我对人工智能领域的发展变化也有着持续的关注。我与Misha Laskin就人工智能的挑战与未来进行了深入的探讨,特别关注强化学习、大型语言模型以及推理模型的进展。我们讨论了大型语言模型的局限性,以及如何通过强化学习来改进模型的推理能力。我们还探讨了如何利用大型语言模型来解决实际问题,以及如何应对人工智能带来的挑战。Misha Laskin的物理学背景为他带来了独特的视角,这有助于他更好地理解和解决人工智能领域的问题。

Deep Dive

Shownotes Transcript

Translations:
中文

AlphaGo never stopped improving. It became super intelligent and you could have sunk 10x or 100x more recent resources in it and get an even more super intelligent AlphaGo. And so in principle, these systems never stop learning. It's just a matter of how many resources you want to seek into them. Now with language models in RL, it's still early days. So I don't think that we've discovered the sort of

massively scalable blueprint, but there's a foothold. Welcome to Manifold. My guest today is Misha Laskin. Misha has a background in theoretical physics. He has since transitioned to AI research and to founding his own AI companies.

I think this interview will be especially interesting for people with a physics or academic science background, but also interesting for people who want to understand the current state of AI. Misha, welcome to the podcast. Thanks for having me, Stephen. I understand that you were an undergrad at Yale in physics and that you actually finished a PhD at the University of Chicago. Yes.

And maybe just tell us what the 20-something, early 20s Misha thought he was going to do with his life, why you were attracted to physics. Just give us a slice of life for you at that stage. Yeah, I think that when I moved to the States, I'm Russian-Israeli, and when I moved to the States...

I got really interested as a teenager in two things, and it was physics and literature, basically. Literature is basically because I didn't read very well. And so I kind of, you know, it was sort of like eating your vegetables until you like them. And so at first it was very painful, but then I actually started liking it. And physics, we had this, you know, group of like the Feynman lectures in my parents' library. And I had a lot of time on my hands and just got excited.

There's just something aesthetically, I'd say, really beautiful about it of understanding pretty interesting implications that are non-obvious about how the world works from some set of first principles, ability to explain things very clearly. I just really enjoyed reading it. And so that's kind of, you know, my initial inclination into physics began there in high school. And then I basically wanted to do things like that. I wanted to

Work on impactful science. I think that was the thing. Work on impactful science. And I thought that physics would be the place for me. And so I got very interested in theoretical physics, did my undergrad in physics and double major in literature, but then professionally really wanted to go into the scientific realm and did my PhD at UChicago in theoretical physics.

many body physics, many body quantum physics. And I think, so it was actually a really wonderful time. I think it's one of the, I look back really fondly. It had this sort of, uh,

Almost physics, I imagine, 100 years ago, because theoretical physics you can do at least then. I don't know what it's like now, but then you could still do a lot of it on chalkboard. And so it was a lot of sessions with my advisor or some colleagues where we were drawing something interesting on the chalkboard. And yeah, it was a really fun time in my life.

And I think right after you finished your PhD, did you go into Y Combinator and start an AI company? Do I have the details correct?

Yeah, so towards the end of my PhD, I kind of sort of had a, I would say, a change of heart, not because I think that the stuff that I was learning was very interesting, but I felt that I became an expert in this very narrow silver of science. And while it was aesthetically pretty beautiful and very interesting,

It was hard for me to see, to imagine the kind of impact that I would have, even if successful, decades from now. I think that, yes, some people go into physics and I think that I'm able to see, maybe, I mean, I don't know, at a young age or older, depending on how patient they are, the great amount of impact. There are scientists like that around me. But maybe I was just impatient. It was hard for me to imagine decades to know whether the things I was working on were...

would bear fruit or not uh and so i i did this kind of um i would say i think that my personal confidence went went down then because it was sort of a bet that i took i had a lot of conviction and put in almost a decade of my life into it and uh then didn't really know you know what what it is that i should do um but i wanted to try kind of doing sort of uh

almost the most practical thing I could imagine. Somehow, entering the job force or the workforce didn't seem that appealing. And frankly, I think to a physicist who's just trained very theoretically, some of that is also a bit intimidating. You're kind of

You know, I hadn't studied, you know, CS, you know, formally. And so it was all very foreign to me. So I taught myself to code. This was happening, I guess, towards the end of my PhD. Anyways, I had to pick up coding for some of the projects I was working on. And I decided to do almost what I'd call a random walk through startup ideas without having like...

any, I would say, internal conviction at the time around what it is that I should build. And it's kind of just random walkthrough. Is this useful or not to someone else? And ended up converging on a company that was actually building inventory prediction systems for retailers. So, you know, how many items of clothing or, you know, should you make for your next season or something like this? And

I learned a lot about startups at that point in time, mostly about, I learned a lot about what I didn't want to do during that process. But it was also interesting building useful stuff. It's just that there are some, I think, fundamental principles around startups that maybe I should have read some of the Paul Graham essays before, but I was reading them as I was building the startup. And I definitely think some of them are true. One of them is a simple notion of productivity.

having deep empathy for your customer and sort of loving your customer. And the reality of it is that I didn't have very deep empathy for, you know, for the people who I was trying to help, right, on the retail side. I didn't understand them that well. And so it kind of converged on this consulting business that was generating revenue, but didn't really have a product around it. And I wasn't, you know, particularly fulfilled working on it either.

But at the same time, I saw deep learning taking off. And in particular, I remember seeing AlphaGo. And that kind of changed something in me in terms of, I kind of think of that part of my life as wandering through the desert kind of part and trying to kind of put the pieces back together and find what is deeply internal and motivating to me. And when I saw that come out, that to me seemed like

This is the impactful science of the time that I live in that I really want to work in. And so I basically dropped everything I was doing and went into a cave where I learned deep learning and reinforced learning fundamentals. And that was kind of my first foray into AI.

And what made you so at that stage, you might have jumped immediately maybe to a company and start working on AI, but you actually went to Berkeley for an academic postdoc. So what was the thought process there? I think at that time, AI as a useful piece of technology was it was still not obvious that that was the case. It was still very much this was

a number of years before language models took off. And it was still, I would say, there's both a lot of industry research happening, but it was more academic in nature, I would say. And I was considering two things. I was in the process, met OpenAI. They had this fellows program for people who are coming out from a different field to sort of brand help into AI. And at the same time, got introduced to

uh, Peter Adil, who is a AI researcher and a professor at Berkeley, who's done a lot of foundational work over the last decade. And my decision at that point in time was where do I think I'll learn most in the shortest period of time? And I thought that as a postdoc, I'd basically be able to iterate through a lot of different ideas and, uh,

have a lot more learning events, which in retrospect, I mean, I don't think that that's necessarily, I think both options were great. But at the time, it was just not clear that AI as an industry, like as a commercial industry, was the place to have the most impact. And it kind of seemed like the place where I'd learn most of the time was still in the academic setting. Now, at this point in time, were Transformers a big deal already or just starting to be?

They were the, you know, some of the foundational transformers tapers had come out, but I remember this was right before GPT-2. So it's not obvious, you know, when I joined the lab, I guess it was inspired by an alphabet. I really wanted to do reinforcement learning, decision making, kind of solving the problem we're talking about.

There we were just using RNNs and NLSTMs or anything of required memory and oftentimes just MLPs or continents. So it was very rare to use a transformer or some learning at all. And they were taking off in NLP, but at that time, NLP was not, you know, language understanding was one of the

several like sub areas of AI. It wasn't, you know, there was computer vision, there was, you know, language models, there was reinforcement learning. And I would say that the thing that was probably most top of mind at the time was reinforcement learning because it was just coming from, you know, AlphaGo breakthroughs. And it wasn't, when AlphaGo happened, it wasn't just AlphaGo, it was a series of papers of increasing, I would say, both

they got progressively more beautiful in their simplicity and how many assumptions they removed and kind of power. And so these were AlphaGo, AlphaGo Zero, which then learned the game of Go without any human demonstrations. Then AlphaZero, which generalized beyond the game of Go to other games. And FineRoom New Zero, which was...

an algorithm where that sort of learned the rules of the game as well instead of being given them and so that was like the thing that was top of mind to i think the ai community at large i think reinforcement learning was very top of mind and even though transformers were definitely taking off in nlp it was kind of one of the few things that was happening it was not clear that that would be the big thing

Do you feel, this is jumping ahead a little bit, do you feel a little bit like the field has come full circle now where maybe RL is, because of the recent reasoning models, or the ability of RL to condition these reasoning models, it's sort of now back at center stage a little bit? I think it's getting there. It's been a pretty interesting turnaround because after, I would say, GPT-3 and

Certainly after ChatGPT, I think the whole field of reinforcement took a bit of a backseat. And I wouldn't say it became irrelevant since the workforce algorithm that powers alignment of these models is RLHF. But it was really... RLHF is a pretty weak form of reinforcement learning. And a lot of people questioned to what extent it was even necessary as opposed to just really high-quality curation of instruction tuning data. So RL definitely took...

I'd say a backseat. And so after I worked on Gemini, which was Google's large language model, I kind of realized that I thought that the ingredients might be on the table where you have these really general objects that are language models. And there's nothing fundamentally wrong with reinforcement learning. It's not like there's something that we learned that was wrong. It's just that you need a good...

reward signal to optimize against that's basically the thing you need a good you need a task distribution to learn from and you need a good reward like way to verify that those things are being solved and so i kind of thought that you know after we launched uh generally 1.5 lit uh it was probably the time to start looking into scaling up reinforcing writing on top of language models i think that's now come full circle in terms of uh reasoning models uh

Which is also rather, I think, like one of those non-obvious things. I think a year ago, maybe, yeah, let's say a year and a half ago, reasoning models were one of the many things that an AI lab pursues. It was not clear that they would be as powerful as they're starting today. So I think we're kind of seeing a sort of, I mean, this is kind of maybe a normal part of these AI waves is that the work...

It starts, you know, obviously kind of earlier before it's clear to other people, but when it's happening, it's actually not clear that this is going to be the thing that wins, except for, you know, for a small subset of people who have a long conviction and are seeing something that other people aren't.

Right. So coming back to your bio, because I skipped ahead. And so the audience doesn't really know what happened to you. So you did your postdoc for a couple of years at Berkeley, and then you transitioned to Google DeepMind, if I'm not mistaken. And that's where you worked on Gemini. That's right. Yeah, I joined DeepMind to...

continue basically scaling up research and reinforcement learning. And again, it was not clear to me that, you know, it was not really to work one language models at the time. It was, I joined a team that was called the general agents team. So, you know, it was really to solve the problem of agency and autonomy with reinforcement learning. This is a team led by Vlad Mead, who was the first author of deep Q networks and the paper that, um,

started basically the deep reinforcement learning era in 2013 i think um but then what happened is that i just remember this very vividly i was at new rips uh in new orleans and chad chicka t came out and that afternoon i was giving a talk and i had some kind of dissociative moment of you know why am i saying these words like the thing that matters is clearly you know you know

I don't know, somehow clicked with me that it's so obvious what the thing that matters is now. And so

So why am I at the conference talking about something that is not this thing? So as soon as I got back from New York, I basically dropped everything I was doing, started working on language models and entered a project. At the time, it was a small group of people. It became the Reinforced Learning and RLHF team for GemMod.

So just to clarify for the audience, so you're giving your talk at NeurIPS on some research you've really poured your heart and soul into. But in the back of your mind, are you thinking scaling transformers as language models is the thing that really I should be focused on? And is that explicitly what you were thinking? It was something similar. I was thinking the following. So the problem with free and forced learning is

before language models was that we had developed these extremely powerful algorithms that worked in very narrow domains. So you had a super intelligent Go player that was hard to... It didn't really generalize to anything. If it did generalize, it did so in the sense that you had to retrain the entire model for a different domain. And

The problem is that most domains of interest, it's just impractical to collect the amount of data and have the verification signal that you need in order to get something useful working there. And so there was this big existential, I would say, generalization problem of we have really powerful systems. We have no idea how to make them general. And when Chat TTP came out, and I played with it that day, and it was...

very clear. The system is very general. It might not be very capable yet. It's not autonomous. At that time, it was a pretty weak chatbot, but it was very general. You could ask it about almost anything and still care, right? And it'll answer it. And sometimes very capably. I remember at the time we shipped a feature that formatted basically code as almost like blog posts about code. So you could ask it something about code and then it writes you a blog post. That was pretty magical. So what I realize is that

We were, or at least I was, let's say, spinning my wheels trying to solve the generality problem when it had already been, like, it was solved for us already, right? That these language models are great, you know. And so it was just a different way to approach, like, the problem. And that's what I thought was really interesting.

You know, when I was very young, before I actually learned any physics, I read books like Gödel, Escher, Bach. And so I was actually quite interested in AI before I knew any physics. And I always wondered about this problem of how would you instantiate knowledge about the broad world?

in your AI. And at the time, there was some very huge project at MIT, I think, where they were just literally typing true sentences into a database in the hopes that eventually it would reach some critical threshold and know about the world. And so the big thing they accomplished, which surprised me, was like, okay, transformer trained by next token prediction on trillions of tokens. Wow, you get a world model

that actually knows a lot about the human world, you know, as observed by human writing. And yeah, that was a zero to one moment that was just shocking. And I think people are still probably underappreciating that zero to one moment. Like historians will look back and go, yeah, that was a discontinuity in this whole thing. Yeah, it's very non-obvious. Still, I mean, I think it's still not obvious to me why

It works. It's magical that it does. But yeah, because the internet is such a big, messy data set. And so when you pre-train a model on them, I mean, most of us don't actually get to most consumers certainly never play around with like the actual pre-trained model because it's, um,

It's very user unfriendly. But I remember getting access to some pre-trained checkpoints and playing around with them. And if you kind of poke at them the right way, you get to enlist some very interesting answers. And the fact that they have these powerful world models that are then, I would say, very steerable, like...

doing instruction tuning and reinforcement learning, you don't need to do it for that many steps before you go from your free train checkpoint to something that is usable to people. And that was really interesting as well. And so it solved, there was this whole field in AI before then, and maybe, I mean, still is now, but I think that this kind of was an answer to that of meta-learn, which is the notion of

How do I learn very quickly from a very small number of examples? And there are all sorts of sophisticated algorithms for how to do this. And it turned out that the best meta learner, the best meta learning algorithms was just next to looking prediction on the internet. And from a meta learning perspective, few shot prompting is basically learning very quickly from a short, small number of examples. And that was really surprising.

Yeah, I think there's just some magic in the idea that a big neural net, which is generally enough, and you force it to do next token or get good at next token prediction. It builds structures within itself through that automated process that reflect things about the world. And I think a priori, I had no idea that was going to happen, but somehow they stumbled on the right way or a way to do it. Definitely. Yeah, I think that it's...

It is really magical and something that was in the back of my mind, and I think there is a bit in which it's magical with the magic faith in some sense, in that I thought there's only so much information you can extract from the internet. There's only so much that you can compress from it because it's sort of a fixed body of knowledge that's very noisy. And at some point, I thought we'd hit...

a point where you're getting diminishing returns from how much you can imagine, you know, infinitely large brain that's soaking up everything on the internet. That's sort of the, the max of how well you can do. And it was just not clear at what point will we get there? At what point will we basically get in brains and say that neural neural brain, these neural networks that are have sufficient capacity where you've kind of extracted almost, almost everything there is to extract from the internet.

Yeah, it's interesting because you don't want to go to the overfitting limit where you've literally memorized everything.

the 15 trillion tokens or something. You want some intermediate where it memorized some stuff, compressed versions of some stuff, but it built some structures also that reflect relations between the information that it's seen. So it just seems very non-trivial to me. Like I think one of the things I think people with a more like kind of theoretical physics bent in the future when they have lots of these models to experiment with will probably understand that dynamics better than we understand it now.

Definitely. I think another thing that was surprising about this whole pre-training era is that typically in machine learning, you think about, you have this notion of epochs, like you train over your data set multiple times and you see where your training and tests for validation curves diverge. And that's when you know that you're overfitting. But with pre-training, you do less than one epoch.

Basically, you scan less than the total amount of data on the internet. So there is over-coding that happens because sometimes data is duplicated. Sometimes it appears twice on the internet, like an article might be syndicated or something like this, or things might be quite similar. But generally, it's sort of less than an epoch.

Do you have a sense of whether we've hit... So in the scaling relationships that were in various papers, like the chinchilla paper, it looks like if you want to increase model size by an order of magnitude or the amount of useful compute by an order of magnitude, you need substantially more, maybe the square root, more data. And at least according to those relationships, it looked to me like...

mid last year that we would run out of data before we ran out of compute or model, potential model parameter size. Is that correct? Like, is that what people within say Google would say? I think that's roughly correct. That the, that it's been harder probably to extract significant gains from the pre-training corpus than, than some people would have predicted. I think that there was a sort of sense of this can, you know, just keep,

Keep scaling your model size and you'll just get progressively better and better models trained to the same pre-training corpus. And I think that we had already started seeing diminishing returns. And by re-inning as a field, so across multiple of these labs.

And there was a moment where, well, at first, I think that there's clearly a lot of practical value to be derived from these models. And there's a lot of stuff that could still be done, even if you exhaust the pre-training corpus around instruction tuning, RLA champ, and overall data curation and just optimizing the architecture. So probably, even if nothing else had changed, there would probably still be...

quite a bit of progress. But I think there was this fear of maybe this is as far as this idea goes. You can make it more efficient, but how do you get substantially more intelligent systems? I think the North Star is still systems that

help you do things autonomously, help you do the work that you want to do autonomously. So as a scientist, it might be, well, there's a bunch of the rope work of coding work and setting up experiments and these kinds of things that you might want systems with, but aspirationally, you want them to also help you discover net new knowledge and be sort of a patient collaborator. And it was less clear how just doing pre-training with gas, pre-training with some alignment with gas there,

Right. So were you surprised by I guess maybe you were still at Google when the reasoning work started there? Or was were you involved in that at all?

So I was at Google when it started. I wasn't personally involved in the reasoning effort that was there, but I had some colleagues who were. And of course, I was working on the infrastructure and methods for RLMCHEP and reward model training. And of course, that's the thing that makes these reasoning models work is the fact that it's basically CAS models.

Learning to reason is a reinforcement learning problem. So there's definitely collaboration going on there, but I wasn't on the team that was working on reasoning. Right. So the way I explain the reasoning, you know, the advances in reasoning in the last, say, six months or something is...

You know, you get the model to instead of giving you a quick answer, it sort of talks to itself and it learns the behavior of reasoning and it can do more if it behaves in the reasoning way rather than in the sort of just immediate response mode.

But my mental model of this, which I'd love to hear your thoughts on, is that the pre-trained model is not getting stronger, but you're getting it to behave in a different way. It's more powerful or more useful in that new mode of behavior, but you haven't really improved the underlying model particularly. Do you think that's fair? I think it's fair if you're talking about generality in that...

I think the pre-training paradigm is the diversity of data on the internet is just very hard to recreate synthetically. But I think you are improving the capability of the model and its ability to think depth-wise. So for the distribution of data that you're training it on, be it math or coding or other verifiable data, it does kind of achieve a new capability. And

I think about it very similar to how, if we forget about language models and look at how large-scale reinforcement learning systems were trained, they typically had an imitation learning component where you learned from some human data and then a reinforcement learning component where you left off from where the human data ends and

had the model self-improve until it became super intelligent. And that was the blueprint for AlphaGo, AlphaStar, opening eyes, Dota project, imitation learning, followed by reinforcement learning. And I think the same thing is playing out now where you can think about pre-training and instruction tuning as imitation learning on, right, all this data was generated by humans. We saw that now also generated by eyes with synthetic data, but it's primarily human-generated data.

And that gives you a starting point where the model has non-trivial reasoning behavior out of the box. It's not that it has no reasoning behavior, and now it does. It had non-trivial reasoning behavior, which is this whole line of work around chain of thought prompting that preceded reasoning. And then when you put this into an online reinforcement learning loop with a way of verifying the output that is...

that you can trust your verification. That is to say, if you can't trust your verification, then it can get hacked and your model won't reason in the right way. But assuming you figure out how to solve this type of reward hacking problem, then you're sort of reinforcing the good reasoning behavior that's already in the model. But at some point, you actually go beyond the distribution of what the model previously knew, and it's just letting net new things. And I think that's what's happened with these reasoning models is that they're,

They've learned net new things that the pre-trained model did not know. And the net new things are actually learned in the RL phase. So if I give it some kind of math problems and it's sort of adjusting its parameters in such a way that it can succeed on these math problems, that reinforces maybe its command of change of variables or some trig identity. Is that a fair way to think about it?

Yeah, I think a fair way to think about it is exactly what you said. And I think the, again, kind of the AlphaGo analogy holds here, where that system, AlphaGo and Xero, learned many things, just many strategies that were not yet in the corpus of things that humans knew, this famous loop 37 from AlphaGo. And

I think something similar is happening here, but maybe less, it's less drastic yet. I don't think we've seen anything close to move 37 for language models of this obvious creation of net new knowledge. But it is, you know, I guess one way to think about reinforcement learning is that it's,

a way of you're generating synthetic data and by having a way of verifying whether which traces in synthetic data are good or bad, you're kind of amplifying the good ones and downweighing the bad ones. And so once your agent accidentally stumbles on a strategy that worked, that thing gets reinforced and internalized. And that's where sort of insanity knowledge comes. At first, it's an accident, but then it gets internalized into an actual strategy. I think that's

loosely how I think about these things. And we're in an interesting phase now, I think, where reinforcement learning is starting to work again on top of language models. But we're not yet at the alpha go moment for language models yet. There has not been this powerful new knowledge creation yet.

So in the, in the deep seek R1 paper, you know, they're very open about what they did. So I like reading their paper because with Google, with, you know, Gemini or open AI, I have to always guess like what they're doing, but at least with deep seek, they're just pretty explicit. So in that paper, there's a, the vertical axis is performance on some aim, AI, me math problems. And I think right-hand side is maybe RL steps or something.

And it looks like the curve is bending over. So it doesn't. So, or at least the rate of increase with training, you know, first is, is more dramatic and then it's, it's, it's smaller. And in fact, at the end desk, you can, you could guess that it's sort of just fluctuating a little bit. It's not, if it is increasing, it's increasing very slowly. And so one interpretation of that graph might be okay, without improving the base model in some other way that they haven't tried yet,

Even more continued RL along that direction wouldn't necessarily qualitatively improve the math ability of this model. Do you think that's plausible or do you think maybe that's the wrong interpretation of that graph? Well, I think, first of all, the thing that's pretty universal about when you look at reinforcement learning curves with language models or before them as well is that they tend to be a lot linear. And so they...

If they ran the experiment for 10x longer, I think, well, we may or may not see something different, but let's put it this way. If the verification was good, if the way of detecting that this thing was solved correctly or not was good, and the exploration of the model was decent, that is, it was trying reasonable strategies, then you would get this laudable behavior where it basically never stops learning. Mm-hmm.

Now, it is... Reinforcement learning algorithms in practice do stop learning at some point, but it usually... Yeah, there are usually ways to overcome it. And so when you...

I mean, to give an example, going back to something like AlphaGo is that AlphaGo never stopped improving. It became super intelligent and you could have sunk 10x or 100x more resources in it and get an even more super intelligent AlphaGo. And so in principle, these systems never stop learning. It's just a matter of how many resources you want to seek into them.

Now, with language models in RL, it's still early days. So I don't think that we've discovered the sort of massively scalable blueprint, but there's a foothold. So to give you a sense, at least the way we see it, is that even when we look at DeepSeaCard 1, that is...

a more powerful algorithm than a normal RLHF algorithm, but it's still actually a fairly weak form of RL. It's what we call single-step reinforcement learning, where you have

you know once you have you think for a long time and then you just generate a solution and that's basically one step um but i think the natural evolution of these systems especially ones that act on your computer are systems that are going to be thinking and acting or multiple steps so they think and act and think and act and so forth and there's this outer of um trend assignment across the steps so i think we're just very early on in um

In the story of how reinforcement learning plays out on top of language models, before it was similar in, let's say, when deep reinforcement learning started, DQM came out in 2013. And the arc through, you know, Algebra 0 and Mu0 didn't end until the late 2010s. So there was at least five years and a lot of progress is being made.

I expect something similar to happen here, but on a compressed timeline. The amount of resources going into these things just much larger. And I think we just look faster now. There's a lot more infrastructure. And so I suspect that instead of five years, it'll probably be a matter of a couple of years before we see something like a super intelligent language model in some meaningful areas of knowledge work.

So I think that green force of learning is not going to apply to them. Got it. I think I heard you say on another podcast that, you know, you were about three years from AGI. And I think maybe what you just said is reiterating that point. So, you know, one of the things that I'm expecting, the one shoe that I'm expecting to draw by somebody releasing a model of this type of paper is that

You know, doing RL where the model is taking actions on someone's computer or using their browser or something. That's got to be a very fruitful thing because, as you said, it thinks it takes an action. It thinks it takes an action. Maybe it's trying to, like, buy something for you on the Internet or something. And it's going to get feedback on each of those steps. So you could imagine that's like a super fruitful trajectory through RL.

the RL space. And yeah, I'm expecting someone to release a model that's just like extremely good at like doing things on, you know, Amazon and eBay and a bunch of, you know, commercial websites and things like that. So yeah, probably happen sooner rather than later. It's, it's possible. I think that this is kind of an interesting era that we're entering because

A lot of it depends on whether you can operationalize a good enough data distribution that's verifiable for these tasks. So I think that that's kind of a big question when for some browser-based things, it's just hard to collect good enough data. There's no repository of browser-based things.

say, tasks and rewards at large scale and it's diverse. So when we see these reasoning models work, they work because there's diverse data pools for questions and answers for math and coding, like textbook coding. And so we know that these systems work when you have that kind of data structure. But in more practical scenarios where it's harder to access those data pools, I think you kind of have to get creative about how do you

If they don't exist in some easy-to-access format, is there some strategy you can invoke that will basically get you the data that you need in a clever way? So I think a lot of it depends on basically how you operationalize data collection, which is definitely a hard thing to do. And one thing that is probably even more obscure than model training or things like that, when you read the deep-sea paper, that is the one thing where they don't really tell you anything at all, right? Yes.

And I think the other thing that's really interesting is that reinforcement learning sort of, let's say, did...

it couples to the environment that you train in. So as soon as you have, I mean, right now we have these reasoning models, but as soon as you have environments with tools, let's say for code editing or browsers or, you know, other ways to interact with a computer and you run a reinforcement learning algorithm through that, it gets coupled to those tools. And so it actually loses generality, right? Because it,

It might learn, unless you train it in some way to, if you train it in this very general reasoning sort of way, it might learn to kind of generalize to some new tools. But the system that's trained coupled to the environment is likely going to achieve much more depth-wise in that environment. So, right, if you are coupled to

let's say, you know, a coding environment and a browser and, uh, some tools for doing science. And you have some way of verifying whether you're answering, say, scientific questions, the kinds of, you know, scientists care to answer correctly. So you have, you solve data distribution problem. Uh, then you train a reinforcement learning algorithm against this environment and it will sort of, um, really master the tools that, that you gave it in the environment. And, uh,

but not generalized to, let's say, tools and other environments. So there's an interesting way in which RL methods are coupled to the environments that you trained in. And this is going back to what happened with reinforcement learning before Linux models, in that those systems were coupled to their environments. So AlphaStorm is coupled to the StarGraph environment, AlphaGo is coupled to the Go board. And I think now we'll see products where their reinforcement learning algorithms train for...

some task set where the kind of neural network that's powering it is coupled to the environment that it was trained against. And an example of that is, you know, now I don't know if this is what's actually happening under the hood here, but when I look at a product like OpenAI's deep research that is powered by O3, it makes me think that most likely what's happening there is that the...

tools for deep research, like the web browser and indices that they use for a language model to interact with. Probably, you know, O3 or whatever reasoning model was taken, and then further training this reinforcement learning against those tools to get something along. So I think that's kind of maybe the next, that's a limitation, but also maybe a benefit of the system.

So I, myself, and some other theoretical physicists I know have been experimenting with these reasoning models to just see how useful are they really for our kind of research. And one of the things I discovered is it's quite good at finding stuff and summarizing it.

But if I ask it to, you know, maybe solve an actual research level problem or think about some research level thing, it'll often come back with what seems to be like just like more something more reflecting the consensus in the literature, which, you know, could be wrong if it's a frontier level question I'm asking. And then the frustrating part is that if it were a grad student I was talking to at the whiteboard, I could course correct the grad student and the grad student would immediately update their neural connections and

based on what I tell them. And then they would then reason, you know, correctly, subsequently incorporating that little nudge that I gave them, that update that I gave them. But what's frustrating about the models is that I might discover some like faulty reasoning or even contradictory reasoning and what it gives back to me. And I pointed out to the model, but it can't really update on that. It just continues to give me the same line.

And so that sort of test time learning or test time memorization, maybe you saw this Titans paper. To me, that's super interesting. It's like, what's the right way for it to be able to actually update itself at test time? Have you thought at all about that kind of thing? Yeah, I think it's an interesting problem. And in some sense, these reasoning models are kind of inherent, I think.

the priors of the pre-trained model they were trained on. And again, we have to remember that the reinforcement learning methods they're trained with today are actually a pretty weak entry level for reinforcement learning. And so there are, this to me is kind of, again, the same that we're far from the LOOP 37 moment, or maybe not that far, you know, because again, it's a matter of two years. It's a matter of perspective. But when I say far, that's kind of what I mean.

And it's very much, it goes back from data distribution problem. Like how often did the model when it was being trained with RL or otherwise, did it see corrections and what the appropriate response is to that correction and get that reinforced? And I think that there's a little bit of that. Like these models went from

not knowing how to backtrack very well to backtracking. So now when you look at the R1's chain of thought, you'll see it oftentimes says, wait, maybe I should rethink this and go back. There are these...

pivot words, like wait or oh wait or hold on, which part of that was probably mixed explicitly in the data distribution that you can run a reward verifier and see, especially if you have something that's per step and see where it messes up and then inject an oh wait in there and continue training on that. So I think a lot of that comes into how you sort of curate the data distribution that you train on.

But fundamentally, I think these systems have been trained with pretty weak RL. And so for that reason, they're still... They've learned some things that were not in the distribution of the pre-trained model initially. But in terms of generating that new knowledge, it's very hard. And I actually had a very similar experience to you. I was wondering if it could reproduce my PhD thesis. That was basically what I was wondering. And my PhD thesis, even though it was a lot of work...

working on it, but I can actually summarize the thing that was done fairly quickly. And if you pose the question the right way, and it's a somewhat likely derivation, but there are only a couple of key parts that are really tricky. And effectively, my PhD was on studying the various characteristics of the fractal one-of-a-kind effect and doing...

Basically, perturbation theory kind of approximation of the electron density for various fractional Hall states. And basically, when you expand this thing, the first two moments are really easy to find. The first moment is basically undergrad physics. Second moment is a graduate course in statistical mechanics. And the third moment, which has this very interesting...

physical constant on it, it's kind of a geometric characteristic of a fractal quantum pulse state, was something that I had discovered during my PhD. And that one is non-trivial. That one, you need a piece of geom to solve it. And no matter how I prompted it, it just couldn't get it. It only got the first two things that are in textbooks, basically. And it was not able to generate that new knowledge.

Yeah, I think so. I think that's the current feeling of people that it is still useful. Like if I don't know an area and I'm just trying to get a summary of what's already known in literature, it can succinctly, you know, deliver that. But pushing forward is just extremely hard for anything that's really not present in a strong way in the existing literature. Yeah, yeah, exactly. Let's come back to two years from now and see

see what happens. I think, I think especially in physics and method and theoretical mathematics, uh, that may, we see, might see the changes there faster than in other fields. Yep. So I, I know you have a hard stop. And so we're about five minutes out until your next meeting. And so let me just end with one last question. So,

What's special about your perspective on AI coming from a background of theoretical physics? Is there anything unique about the perspective that you bring? I think I asked John Shulman about this in another interview, but I'm curious what you think. It's a good question. Well, I think physics teaches you, first, physics is really hard. So when you get into AI, it's actually, AI is a lot easier than physics. At least that was my perspective, that picking it up was fine.

a lot faster. So you're kind of on stage by the mathematics. And the thing that you have to learn, you have to learn how to code and that's challenging and becoming a really good engineer is very hard. But, you know, once you go through the, you know, physics grinder, it's, I think the willpower you have to kind of learn things is sufficient. So it's all possible. I think something that's special is that

This thing, it might be obvious to physicists, of trying to understand things from some kind of simple set of first principles and deriving things from there and looking for simple solutions is fascinating.

Not that obvious or maybe even common in AI. A common actual way to write an AI paper is to, or was, and I'm sure it's still somewhat true in academia now, but it was take an architecture, make it more complex. Take your algorithm, make it more complex. Add complexity to get some performance gains and then write a paper about that.

And I think that that's actually the probably most common way of operating as a researcher in AI is that you take something that exists and you push it forward by making it more complex and getting some performance gain. But that's very short-lived. I actually, I don't think, I'm not even sure if I've read any impactful papers. That's the template for writing a paper for a conference, but I don't know if any impactful papers actually had that template. And

This perspective of coming in and actually trying to simplify things and do even the simplest thing and also coming in with a blank slate and having kind of no preconceptions is very helpful. An example of this was I came into reinforcement learning and basically zero preconceptions around what works, what doesn't. And this was when people were studying reinforcement learning pixels, when you're training for robots or video games and there are all these questions of

RL is great, but it's not data efficient. Sometimes it doesn't work in these pixel-based environments. And one of the papers, my first paper, is a very simple thing. It's just kind of tried, well, what if we just basically jittered the images, just randomly cropped them, because maybe it's just kind of these systems are just always seeing the same perspective, and so they're kind of memorizing it. And

I took a good implementation by a different colleague, a reinforcement learning algorithm called StopLockerCritic, implemented this random cropping, basically, this jittering of the camera. And lo and behold, that simple thing outperformed, at the time, basically all the state-of-the-art algorithms that had this additional level of complexity. So I'm not saying that that was a particularly beautiful idea or not impactful, but it was

it was just surprising to me that no one had tried this very carefully before. And so I think that's an interesting perspective that physicists bring is sort of trying to simplify the problem as much as possible into its core principles. And in some cases, I would say like some of the most impactful work that's come out of physicists coming into AI has been the work on scaling roads.

Like this was taking a perspective of like scaling laws that occur at critical temperatures, you know, in theoretical physics or around critical like phase transitions. And noticing that, you know, there are these like scaling laws that have sort of, you know, universal physical characteristics attached to them and...

That perspective that something like that might be happening when you're training these deep learning models was not obvious to people at the time. And so the folks who led that work of opening up and scaling ones were not even former physicists. They were either very recent former physicists or current physicists at the time. I think Jared still has a job at Hopkins. I'm not sure. But yeah, he could tell. He's maybe technically still a physicist. But hey, I don't want to make you late for your next meeting, so...

Really enjoyed this conversation. Maybe have you back in two years when we have AGI in our pocket. And thanks so much. Yeah, of course. Thanks for having me, Stephen.