We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Prof. Jakob Foerster - ImageNet Moment for Reinforcement Learning?

2025/2/18

Machine Learning Street Talk (MLST)

AI Deep Dive AI Chapters Transcript

People

Chris Lu

Topics

Jakob Foerster: 我认为深度强化学习在过去十年中未能充分发挥其潜力，主要是因为在硬件上没有获得优势。深度学习非常适合GPU，可以高效地处理数据。而深度强化学习的传统做法是在CPU上运行环境，在GPU上运行代理，这导致了算法设计和硬件需求上的复杂性，使得实验速度非常缓慢，算法本质上是脆弱的。现在我们正处于一场革命中，首次允许我们在GPU上联合运行环境和代理，这有望使深度强化学习最终成为赢家，并在现实世界中发挥作用。我相信，通过加速实验循环，我们可以开发出更稳健的方法，并利用元学习来发现可以泛化的算法。 Chris Lu: 为了自动化和扩展机器学习算法的发现，我们需要使用更多的计算资源。由于硬件限制，我们无法以所需的规模运行正常的强化学习算法，因此我们不得不寻找可以在GPU上快速运行的简单环境。将环境放在GPU上非常有效，但当时很少有环境可以用这种方式实现，而且实现起来非常困难。JAX是一个由Google开发的库，类似于PyTorch，它具有与NumPy相同的接口，并且具有额外的功能，例如JIT和VMAP，可以实现更快的运行速度和向量化映射函数。

Deep Dive

Shownotes Transcript

Translations:

中文

The ARC challenge as a target for the community is a terrible idea. That's not what it's supposed to be. It's not supposed to be something where we then design methods to solve the ARC challenge. Fundamentally, AI is trained on the collective outputs of humanity. This technology belongs to everyone, including people we don't like. It centralizes resources in the interest of the common good, in the interest of the public.

and not in the interest of maximum profit. Because to me, the biggest alignment challenge is not between AI and humans. The biggest alignment challenge is between those people who hold the keys to power, who control these systems, and the rest of the population. Jacob, welcome to MLST. Thank you for having me. Great to be here. It's amazing to have you here. Tell us a little bit about your background. I run the first lab for AI research at the University of Oxford.

which is nowadays about 30 people working on anything that is cutting edge, interesting, not supervised learning, thinking beyond current state of the art. I do this 50% of my time and the other half I spend at the fundamental research group at Meta.ai.

Oh, amazing. Well, I watched your talk at ICML earlier and it was really good because you were sketching out a potential ImageNet moment, you know, like an Alex Krtchevsky moment for reinforcement learning. What was the elevator pitch? So I think reinforcement learning in the last decade or so really hasn't lived up to its potential, which raises the question, why is it that deep learning has been this revolutionary success story, while deep reinforcement learning, which held that great promise,

really hasn't quite delivered in terms of real-world impact. And we have a hypothesis at Flare, which is that deep reinforcement learning had lost the hardware lottery because deep learning is perfectly suited for the GPU.

We can really keep all the course busy, churning through data efficiently. In contrast, the way that deep reinforcement learning has been done in the field is running reinforcement learning environments on the CPU, but running the agents on the GPU. And this has meant all sorts of complications and complexity, both in algorithm design and in terms of what hardware we need and how we're going to develop algorithms for this.

which has really slowed down the field and has meant experimentation very slow and difficult. And we're currently in this revolution, really, which is for the first time allowing us to run environments and agents together jointly on the GPU, making deep reinforcement learning finally a winner.

in the hardware lottery. And then hopefully this will be the step to really making reinforcement learning work in the real world. And I suppose it's not just the hardware lottery, it's also the bitter lesson.

which is this idea that when we scale up compute, we get dramatic performance improvements, but the thing holding back deep reinforcement learning to date has been this bottleneck that so much of it is running on the CPU. It does remind me though of, you know, there was that article by Alex Erpahn in about 2018, and he was saying deep learning just doesn't work, reinforcement learning doesn't work yet.

And he was also pointing out some of the other problems, which is that, you know, if you change anything or you do anything wrong, the whole thing's broken. So why is reinforcement learning so sensitive to the architecture and the parameters? I think what has happened is because experimentation has been so slow, we've only been able to train our algorithms on very specific environments. And we've only been able to hill climb on that very small set of environments, which means the things are fundamentally brittle.

And now for the first time, the experimental loop has been sped up by orders of magnitudes, which means that now we can really start to develop methods that are robust. Remember, the scientists, we ourselves are doing meta-learning. We're trying to discover methods

that can generalize. MLST is sponsored by Sentinel, which is the compute platform specifically optimized for AI workloads. They support all of the latest open source language models out of the box, like Lama, for example. You can pay on consumption, essentially, or you can have a model which is always working or it can be freeze-dried when you're not using it.

All of the models that they deploy support the OpenAI API specification out of the box, which means it's just a one-line change in your application to switch over to Sentinel and start saving money and make your application go faster. But to do so, we have to be able to actually get a lot of samples and get a lot of experience for how our methods actually work. And in deep reinforcement running, any single run has taken a lot of cost, a lot of compute, and a lot of time.

And that means the signal we were getting as researchers has been costly, has been sparse, and has been noisy. And that meant our methods are bad. And the premise is from the beta lesson, if we can turn through more data, we can get a better signal, get better gradient updates for either the researchers or our meta-learning methods.

to then optimize our algorithms to be more robust, more sample efficient in the real world. Can you give me an example? So certainly when we need physical experience, the bottleneck is we need to have physically embodied agents gathering, you know, stuff. And we can use simulators for that. But if I want to learn Quake or Dota or something like that, don't I need to actually have the game running on my machine? So I think this is a general problem, which is what happens when real world experience is expensive.

And obviously what we've been really good at as a field is using real-world data when it's given to us. But we're going to run out of real-world data at this point, right? So we've been in this regime where we could lazily scale up compute and data because data sets were large enough to be accommodating for more amounts of compute. But now we're hitting the data wall. And the question then is if I can get more data from the real world, how can I use simulation

in environments that are not perfect and not exactly the same as the real world, but are blazing fast and allow me to get a lot more data in a way that's fully synthetic. And that's really, I think, one of the key questions right now for the field, not just for reinforcement learning. How can I develop algorithms, discover algorithms in approximate versions, for example, of DOTA,

that run 10,000 times faster in a way that then the algorithms we discover will generalize to the real Dota environment, which is slow, which is expensive to run, and, and, and, and, and. And I put this under the umbrella of compute-only scaling. If you just give me tons and tons of compute, what kind of methods can I develop?

that will allow me to make progress on these simulated situations, environments, scenarios, so that the learning progress transfers to downstream tasks in the real world

that are slow and expensive to run. So Tufalabs is a new AI research lab. I'm starting in Zurich. It is funded from Paz Ventures, involving AI as well. And so we are Swiss version of DeepSeq. So a small group of people, very, very motivated, very hardworking. And we try to do some AI research starting with LLM and O1 style models. What we're looking for now is chief scientist and also research engineers. You can check out positions at tufalabs.ai.

There's two things we can go into, I think. In a little while, we'll go into things like unsupervised environment design and curriculum learning and ways of generating all of this data. But before we get there, though, there's this matter of how did you make this run all on the GPU? I think at this point, I think it's best just to, you know,

refer to the absolute expert on all of this, the mastermind behind the technical innovation that is powering this revolution of the hyperscale. Hi, I'm Chris. I'm a PhD student that was with Jacob. I technically have not graduated yet, so I'm still a PhD student, Jacob. And so I mostly study things like automating machine learning algorithms and discovery. So basically this is, can we find ways to discover new machine learning algorithms and machine learning insights automatically?

And to do this, we need to use way more compute because you can imagine even your average AI scientist is using tons of compute. To then automate this and scale it up, you need that exponentially. And so obviously in an academic setting, we don't have so much compute. In fact, when we first started at Flare,

I don't know if we had any compute really set up. I think we might have had some collabs, essentially, like Google Collab, the free tier of compute. And we wanted to run some basic reinforcement learning experiments for a paper called Model-Free Opponent Shaping. This is a paper where you're trying to learn across the entire learning trajectory of another agent in order to influence the way it learns. And so this requires many training iterations of a single agent.

And so given just the hardware restrictions we had, we couldn't even run normal reinforcement learning algorithms at that scale. So we had to look into just using really simple environments that we could write in PyTorch and just implement this in the Colabs that can run quickly on the GPU.

And the paper had really cool results. And we were really surprised by basically how effectively putting the environment on the GPU was at the time. But at the time, there were very few environments that could be implemented in this way. And it was just really hard to actually implement them. PyTorch is designed for neural networks and things like that. And so to kind of use arbitrary environment code in PyTorch was hard.

So that's kind of where JAX came in. So JAX is a library by Google that's similar to PyTorch. And one of the neat factors about JAX is that it has the same interface as NumPy. So if you know how to code in NumPy in Python, then you can code in JAX. So these are both libraries where they're designed to kind of allow you to use Python to run things on the GPU.

This is how we train all of our neural networks. This is how we train any language model, things like that. The thing about JAX is it's developed by Google, but also basically it has a separate interface than PyTorch. PyTorch is the thing that most of you are probably more familiar with, whereas JAX has a few extra features on top of it. The key ones are this one called JIT. This allows you to compile your programs on the GPU. Basically, at a high level, this just means that it'll usually run faster than PyTorch in a lot of settings. And the second one is one called VMAP.

So, VMAP stands for, I believe, vectorized map. And the idea here is that if I write a function to run, let's say, addition, right? You can imagine I just need one core to add two numbers together. When I VMAP this addition function, it turns into a vector add, right? So I can add two vectors together. And if I VMAP again, it turns into a matrix add. And it's this idea of I can just write this one simple function for one instance of my environment, right? So I can run maybe one... I can just write one instance of Carpool in normal NumPy,

and i just called vmap on this function and now i can run millions at the same time and so now any environment that you can think of is really easy to implement jacks and so for them for our full paper scaling up this approach we use jacks to kind of just implement everything entirely on the gpu of course back in the matlab days vectorization was always what we wanted we wanted to you know to take one operation and to kind of spread it out and parallelize it so it runs on many of those little cores in the gpu in this case

But something like Carpool might be an interesting one. So we want to capture environment dynamics in the GPU. Is there any limitation? Because obviously we could write an environment in Python, and Python's got lots of cool things like it's a very rich language and you can do iterative conditional logic and stuff like that. How is it different in JAX? Right, I mean, it's very similar, right? Because NumPy is, I think, a lot of what people use Python for, right? So if it's in NumPy, you can more or less do it in JAX.

There are some scenarios where JAX is worse. So this would be cases where you have a lot of if statements and branches. But I think some of the recent work from our group has shown some really crazy things you can do with just JAX. So for example, a recent paper by Mikey and Michael from our group is called Kinetics. This is basically a general physics simulator on top of a general renderer that is entirely fluent in JAX.

So you can imagine any reasonable environment you can make, you could make using the simulator in this renderer. - Very cool. Are there any examples where, 'cause you know we were saying before that we need to have a sketch. So we don't have the source code for Dota, but we want to capture as much of the dynamics as possible so that we can build agents that learn

How can we do that? Yeah, I mean, one interesting way is you can just try to build a model of Dota, right? Like learn a model. You've seen recent works where people can learn like models of video games like Minecraft or the recent works in Genie, right? You can generate these video games. And so once you're able to do that, you can sample from them much faster because your neural network will run on the GPU. Yeah, that's actually really cool. You know,

there's a new version of genie as well out now and hadn't really occurred to me that just having a dynamics model for the purpose of training you know because we can chain together all of these things much more easily if we don't have access to the original source code so what kind of performance speed ups are we talking about here like around 4 000 times i think is our basic speed up yeah i think there's also room for a lot more speed up we're just using jacks naively but you can there's

there's tons of room for optimization if you go into lower level writing certain kernels yeah because when i watched your talk jacob you were saying that you know in in the in the olden days if you like there were ways of you know distributing parallelizing and so on but it just created um so much complexity and with this with this new method being so much faster you can kind of you know even as a small lab you can do the kind of experiments that you know the big boys would have been doing before and it's not just being able to do experiments that the big boys could do but

we couldn't do, it's also being able to simplify algorithms, right? It's being able to take out a lot of the complexity that was built in that makes it difficult to understand what these algorithms are doing and being able to say, let's actually just write these very clean algorithms. So we have this paper from our lab, PQN, Paradise Q-Learning. It's just extraordinarily simple. It's basically just a Q-Learner where a lot of different agents across different cores step through the environment and learn on every transition.

No more target networks, no more replay buffers. All of this is gone and I think that just hopefully will allow the field to come up with much more beautiful and understandable and therefore also robust algorithms in the future. Talk to me about drift functions and objective optimisation.

Yeah, so this is one of the first things we did at FLAAIR. We had a paper that was called "Mirror Learning" that basically provided a theoretical framework that, to me at least, provided the first intuitive understanding of why things like PPO actually work. And what this framework said is, as long as we have a penalty term that penalizes the difference between the policy that collected the data and our current updated policy,

that obeys certain properties, then we will, in the limit of doing many policy updates, converge to an optimal policy. That theory framework was really nice, but what it also allowed us to do is to say, why don't we learn a drift function? Because PPO, that clip, is just one of many possible algorithms that can be expressed in the mirror learning space. And we thought, well, maybe this isn't the optimal one. There must be better ones. So what then we set out to do is to parameterize this drift function as neural networks.

But obviously you can now imagine there's a question, how do you actually optimize through that entire reinforcement learning loop? And there's been a lot of work in the field on metagradient estimation and, and, and, and, and, and. I was always trying to estimate this sort of like derivative by unrolling the computation graph and differentiating through it. And actually I'd done some of this work in my PhD as coming out of the multi-agent shaping work again. And we pursued this path because that's what everyone thought was going to win.

But then we also pursued evolution strategies, which does not do any of the sophisticated maths, but instead swallows the bitter lesson and doubles down on it by just doing black box optimization, trying to estimate those higher derivatives from samples. And that turned out to be extremely well suited for the new paradigm of Aral at the hyperscale.

So there was this beautiful figure in your talk at ICML, right? So you are visualizing the gradients of the objective function, I think, first of all, for DPO. And it has this kind of step, right? So it's pulling it back to clip if it drifts too much from the source data and so on. And the interesting thing about that is it's human designed. And this is what Rich said we shouldn't do, right? So we're a bunch of experts in reinforcement learning. And we externalize our intuition into this function.

Now, what you guys did is you meta-learned this function and then you visualized it. What did you see? So the interesting thing is it recovered some of the features that we had seen in PPO. So there is this clip-like behavior, but there are a few aspects that were novel. So, for example, the clip, counterintuitively maybe, has this sort of like too-good-to-be-true type of behavior where you're okay to update things

to have a positive gradient if you're beyond the clip region as long as your advantage isn't too high. So you would actually think intuitively, if I have a large advantage, if things turned out much better than I thought, I should be more optimistic and go away further from the reference policy. But instead, this clip function learned the opposite, where if the advantage is high and you've moved away further, then you have to stop there. But if the advantage is small,

then you can move further away from the reference policy. It's almost like this sort of, oh, is it too good to be true? Then you should just stay here. Yes. But if there's a small advantage, it's okay for you to keep moving. Yes, cautious optimism. Exactly. And then the other thing that this process discovered, that we then realized had actually been discovered by humans before, is rollback. So while the PPO objective says if you have gone...

too far away from the reference policy and you have a negative advantage, you're not going to get a gradient. You're going to stay there in that lower left quadrant. Instead, DPO discovered you should actually go back. We're going to push you back to that reference policy.

Yeah, that's absolutely fascinating. And also there were secondary features which it found which you, you know, which when no human had even designed or thought of before. And I think also this is where sort of our interpretability effort, Chris and I spent a lot of time along with Alistair, one of the authors of the paper, trying to make sense of these features and

We don't know what they are. I think this is an open problem. It would be fascinating to figure out what is going on there. And is this actually doing something real here? That maybe one day there's going to be another paper that says this is what those high-order features around zero advantage, zero deviation from the policy are in the middle, that we just don't understand. That makes sense. That makes sense. Okay, so we've meta-learned this new optimization set of gradients.

And it's a little bit slow. So I think what you guys did was you now say, okay, well, can we represent this in a closed form solution? But also, can we do new theory based on this? So there's this virtuous cycle of discovery. So I think the hope here is that ultimately, at that point, we hadn't yet transitioned to let's just do science end-to-end with AI agents. So having the human in the loop who can interpret this

and get back a symbolic representation of the drift function was really important to us. It has a second advantage, which is then suddenly you can break out of the JAX box. Remember, the environments we're going to use in JAX will not be the real world. We can't implement every problem. But I think what we can do is we can have a representative set of types of challenges that make the learning algorithms that we discover transfer to the real world.

transfer to settings in other simulators and also to learn world models and, and, and, and, and. And having a symbolic representation, we can just write down that drift function in one line of Python, is a really nice way to make sure it transfers to different downstream tasks and to other code environments and, and, and. What else did you notice about this policy that surprised you? Didn't you say that it explored much more than before?

Yes, it had an implicit NP regularization and this is something that we've also doubled down on ever since. What we've done in follow-up work is we've said, fine, so humans can design these clip functions, but something that humans certainly can't do is design a clip function that is time dependent. And time being how far into the optimization process the learning algorithm is.

And this sort of temporally aware version of DPO turned out to be very explicit about trading off exploration early on, but then becoming more conservative along the path optimization process.

And obviously this is a huge design space because now designing a clip function for every process in time. And this is again where metro-optimization shines. The only thing we haven't been able to do in that case is to go back and say, can we find a parametric version of this kind of clip function manifold? Now the mirror function manifold, which has one mirror function for each time step in the optimization process. This sounds very, very complicated, but the good news is that now with LLMs,

we could actually use LLMs to try and fit that black box drift function with symbolic code. One other thing is, first of all, you've open sourced all of this code.

So first complain, I know you're a huge fan of open source check. I'm going to be talking about that in a minute. And yeah, so using using these new methods that you're talking about increasingly, we can use LLMs as engines of creativity. And we can actually have an additional meta stage where we can have some kind of engine which can create the meta optimizing RL system. Yes, I think this is something we're starting to explore. We've done a few papers now at Flare, and a few works in progress.

which is rather than using ES and black box function approximators, let's explore in the space of programs and again use JAX at the hyperscale to get relatively fast feedback on the different members and then use LLMs as the mutation operator to explore in the space of programs using the fitness, which is the performance on the downstream tasks

of the reinforcement learning algorithms that we're exploring as the mutation signal. I think this opens up an entire new space of automated reinforcement learning, as we discussed before, which then obviously makes the question of how do we prevent this from overfitting even more important. Right, so there's Goodhart's law which says if the measure becomes the target, it seems to be a good measure. And this already has happened when we do science with like graduate student descent, use law of exploration and trials and errors to optimize our benchmarks.

but now imagine if we can scale this up by automated research. So this is something else we've been talking about a lot at Flare, which is how do we make sure that the algorithmic progress we find in the meta loop actually transfers downstream. What's the right framework to think about meta train and meta test? Along what axis should we be generalizing? How do we know that this is real? On the subject of creativity in LLNs,

I mean, I just use the ARC challenge as an example. I mean, you know, many people try to solve it just in a very formal way, you know, just doing discrete exponential searches over DSLs and stuff like that. And the way humans go about the problem is it's very heuristic, very creative, very serendipitous, as Kenneth Stanley would say. We're all huge fans of Kenneth Stanley here. So why is it the case that LLMs are so good at capturing our instinct? So...

I think there's two answers on the ARC challenge very, very quickly. I think it's actually a word of caution on ARC because remember, a measure becomes the target. It seems to be a good measure. And I think the ARC challenge as a measure of progress is brilliant. It shows that our systems are lacking fundamental capabilities. But the ARC challenge as a target for the community

is a terrible idea. Because that's not what it's supposed to be. It's not supposed to be something where we then design methods to solve the arc challenge. And this is where open-endedness comes in. We'd like to have methods that can solve a broad range of diverse tasks. And the arc challenge is one example in that space, which means we have to target the entire space

of human-solvable problems. Exactly. And we can use LLMs, I think, to help span that space. Yes, I mean, there's the question of whether LLMs can span the convex hull of creativity. And we can talk about creativity as well. I love discussing combinatorial creativity versus inventive creativity. But let's say, for argument's sake, that the convex hull thing is good enough. Then we've got the question of, "Charlet wanted developer-aware generalization." So he doesn't want a solution just for Arc.

And most of the solutions are not in the spirit of Arc. They use test time, active fine-tuning, test time training and so on, you know, with various different methods. And they are great methodologies for rapidly, in a human supervised way, solving a task like Arc, but it won't generalize from its initial instantiation to another task. How can we cross that bridge? There's two answers. One is, as a community, be much more clear about measures and targets.

Because we use the term benchmark, but a benchmark is supposed to be a measure, not a target. And what this means in practice is we have to work much more clearly about using, addressing broad problem spaces, and then just having the benchmarks be one instant in that entire open space of problems. And never use the benchmark during the development process of our methodologies. What this would mean is I'm working not on ARC,

but I'm working on human-level reasoning capabilities. And in my entire pipeline of methods, of design, of training, and I never use ARC. I only use it once a year to measure my progress. And I'm not just using ARC, I'm using other examples like ARC where LLMs struggle, where humans can make progress. That's one option. And the other option is, rather than having unique benchmarks that we can confuse to measure the target,

We generate, we have methods, we put benchmark design, we make that a first citizen, first class member of our scientific progress, which means we're trying to generate benchmarks that span the entire space of problems, which means then if you hill climb this entire space of problems, you are hill climbing all of human capabilities. And I think we haven't made that much progress on the latter. So I think for now, just being very clear about the measure and the target in the community is super, super important.

What's the relationship between creativity and reasoning? That's a good question. I think creativity allows at least me to come up with new reasoning challenges. If I think about how I go about my day, it's commonly using creativity to try and create new problems for myself and frankly for the lab and for the research community, and then also explore the space of solutions.

And one of the skills they need in solving these problems is then reasoning. So it's basically allowing me or researchers in general, I think, to explore space of interesting and relevant problems that can then be used to train our reasoning capabilities, much like curriculum design. So I think creativity is a great driver for figuring out what problems are interesting

And I think this is one of the key challenges right now for open-endedness, which is what actually constitutes an interesting problem. Because obviously, if we just rely on LLMs, we're going to make the measure the target. We will start good-harding the judgment of LLMs about what is interesting, and at some point just find examples that exploit the inaccuracies

of these LLM judges. But to what extent is reasoning itself a creative process? I mean, even something trivial like deduction. So we're searching the deductive closure, we're doing, you know, we're traversing all of these different things and we find a trajectory. So we've essentially composed together a new piece of knowledge, we evaluate it, it works really well. You might just say, oh, that's doing deduction. I think it's a creative process. I probably would argue that it depends on how structured your search space is. So for example, to me,

Playing chess the way that a human does it has a strong flavor of creativity because you can't simulate trillions of time steps. You have to solve this problem differently. You have to actually try and find an intuitive approach to it that finds unusual new pathways and patterns. And that sounds creative. If I look at the way that this was solved in the good-hearted way of doing AI for games in the past, that's quite brute force. That's effectively...

just number crunching the game. And that doesn't seem very creative to me. Does that make sense? So I think it's less about what are we doing, but how are we doing it, which goes back to the measure and the target. Because if I use chess as a measure, then I only get a human compatible number of samples. I can't just brute force or number crunch the game.

I have to be creative, I have to explore, I have to play, I have to use imagination. But if I'm allowed to use this game as a target, like DeepMind did it, and of course it was great work at the time, but the methods haven't really transferred to other domains, then suddenly I can turn these beautiful imagination problems into number crunching. Yes. And do you think this meta layer is the way to get that generalization? That's my hope. I think my hope is that if we use the fact that we can number crunch, but we're not going to number crunch

specific policies for specific problems, but we're going to use number crunching for finding, sharpening our intuition about algorithms and about sample efficient methods, about methods that can use imagination, that could plan on new domains, that can explore, then we will have the best of both worlds. We will use the compute, we will use the efficient samples that we can get, but we will not use them to overfit on specific problems, but to sharpen our intuitions and to automate scientific discovery and accelerate the development

exploration of extremely sample-efficient algorithms that then hopefully can have human-like capabilities. Because, so my mental model is, the reason that we're so sample-efficient is because we're the result of an extraordinarily sample-inefficient process called evolution. We've been meta-optimized on this evolutionary timescale with like vast sampling efficiency, but now allowed us to have this final product, which is a meta-learned agent

that can deal with new situations, come to a recording studio and be underslept and still make sense. Tell me about agents in the general sense. I have a deeply held conviction that agents give you something above and beyond building a monolithic system. Yeah, so I think there's been a long-term hypothesis I've been seeing as a scientist, which is intelligence is an emergent phenomena

of multi-agent interactions. That the reason we have our capabilities of abstraction, of language, of reasoning, of communication, are because we interact in extraordinarily complicated environments where the most complex parts are not doors and bananas and apples and lions, but there are other agents like us that force us to reason over others, to theory of mind,

learn from each other, teach each other, coordinate, communicate, cooperate. So when you say emergent, you mean that things like language, I mean, we have mimetic cultural transmission and tool use and all sorts of stuff. You're saying that that's not baked into the very lowest level. When we have these rich dynamics of agents sharing information and so on, they appear higher up the scale. Yeah, so I think of it, broadly speaking, as sort of a sequence of platforms.

where we had originally DNA with evolution, bacteria, single cells, and that became the platform for multicell organisms. Multicell organisms became a platform for reinforcement learning, animals that could learn at test time within their lifetime. That became the platform for groups of agents to interact, and in those groups of agents, we could then develop all of the reasoning skills, the cognitive skills that really make the human species unique.

So at least where have we come from? This has been a path of gradual, big and bigger scale of coordination. I think ultimately our society right now is at that grasp of trying to figure out what does coordination mean? What's the next step in that evolutionary process? How do we coordinate better? How do we go from, again, sort of like single cells fighting each other, individual humans being at conflict, individual nations being at conflict, to greater coordination and cooperation at that bigger scale?

And I think there's hints of this that we're seeing, but we really haven't figured this out as humanity, really. I was speaking to Benjo last night, and he was sketching out, we have these agents, and they can hack their own reward function, right? You know, because of the way we've wired them up. So they can change their own goals, and they can start to do all sorts of things that might become misaligned and so on. But

To me though, do you see a fundamental distinction between the types of agents that we're building in AI and the way agents work in the real world? I think the way that we're building these agents is very, very different. But having said this, reward hacking is not something unique to AI agents. Humans hack reward functions all the time. In fact, in my mental model, every reward is reward shaping.

and comes with reward hacking, right? I mean, think about p-hacking in the scientific community. That is nothing but reward hacking, the signal which is recognition for having papers accepted. To get a paper accepted, you have to have a p-value of less than 0.05. So we don't call it reward hacking, but this is what happens everywhere. This is not something new. I think there's obviously differences in the design process into different properties. So for example, we've really designed LLM agents

to obfuscate levels of agency, to be saying, "Oh, I'm just an AI agent. I don't have consciousness. I don't have these properties. I don't have intentions." But that's a design choice. So in many ways, we've played the role as AI scientists designing those agents to have certain properties and pursue certain goals.

much like the evolutionary process has shaped us. I love this idea, by the way, that even in the natural world, good hearting could be a completely natural property. But something like intentionality in humans, what's the difference between our intentionality and an agent which behaves as if it has an intention? Well, I mean, we have our intentionality in my mental model, again, is a side product of

having to have goal pursued to survive. So it's an evolutionary feature. Currently, we don't yet train AI agent from the ground up to be goal pursuing, but to be imitating. The current paradigm, first and foremost, is imitation based. Now, we've also seen that this paradigm leads to systems which are not very good at being agentic. So I think quite a natural step down the line is to do agentic pre-training.

where we also train these agents to actually pursue goals. And at that point, I think we're much, much closer in terms of intentionality to what we do as humans, which is goal pursuit. So AI agents are basically automata, right? They're just a mapping from an input to an output. So I wondered, perhaps we couldn't say for a very small, simplistic automaton that it has autonomy.

But with this rich multi-agent dynamics, information sharing and so on, perhaps you would think that at some level of complexity, we could say that the system as a whole has a form of autonomy. I think it's difficult to imagine that we will have strong agentic systems that will be simple enough so they don't look like they have autonomy. Everything I can imagine around what we need to get there

will be agents that can set their own goals, own sub-goals, that can self-improve their own learning processes, that can work together in teams of students and teachers. And it's almost sort of like a contradiction per se to have strong AI, strong agentic AI, and to have things that don't look like autonomy. Because again, this is a...

it's going to be difficult to write down an explicit learning rule, an explicit data set that will get us there. Everything we do around self-improvement, around emergent properties, of multi-agent teams, of large networks of agents, of cultural transmissions, of computational self-improvement, through discovering new concepts,

requires these agents to have essentially autonomy. And is it fair to say that if we're building AGI, it's more likely to have autonomy if it's a multi-agent distributed complex system rather than just a single thing that we program? That's a great question. I think there's two answers to this because AGI can be in principle a single entity, but I find that vision quite dystopian.

that AGI is a monolithic system, and instead I much prefer the swarm intelligence view on intelligence. Because humanity, obviously, we do things that no single human could do. We have this decentralized compute network of agents going about their lives and figuring out sorts of structures and rewiring themselves, new computation graphs, and getting tens of thousands of people to fly to conferences to do collective computing and imagination.

And to me, it's like the intelligence is in that system. And I hope that we'll find approaches of having that same level of distributed, decentralized compute structure, but now augmented with agentic AI systems. There's something about this distributed swarm type approach, which seems magical to me, right? I mean, just look at the biological world.

Why does it work? Well, we have adaptability, we have reuse, we have autonomy, we have all of these properties like self-repair, for example, self-preservation. There's something really important about that setup which I think we need to reproduce in AI. Yeah, and I don't think anyone has quite managed to put their finger on it. So this is the funny thing. Multi-agent learning has been the future forever. But like many areas, things that have been the future forever suddenly have become a reality.

Self-driving cars were always in the future. Quantum computers were always useless. And suddenly, the future is happening. And I think multi-agent learning and the multi-agent intelligence is that next frontier of things that were always there. It's like this will be the future at some point, and now it's happening.

And it gives you not just what you said, decentralization, robustness, but it also gives you this ability of effectively deploying vast amounts of test time compute. Because suddenly you can also use test time compute to rewire yourself, to re-explore new solutions, to divide and conquer.

And I think that's going to be extraordinarily powerful now that we have solved a lot of the first requirements to make this work. We now have agents that are good enough at basic reasoning. I think we'll get to agents that can do basic agentic behaviour.

And then the multi-agentic behavior, I think, will be the next emergent property or the next platform for real innovation in this space. Love it. Love it. Jacob, you wrote a paper called Risks and Opportunities of Open Source Generative AI. Can you sketch that out for me? This paper goes back to a conversation I had with Phil Tor at lunch in Oxford about a year something ago. And at the time, really, there hadn't been that much work in the space of open source generation.

And we're very concerned about the accumulation of power behind the large players in the closed source AI space. Because again, to me, this decentralization of intelligence is not just a path to having smart systems that are robust, but it's also something that gives agency to the parts

of this network. And it's to me a foundation of Western thought, Western democracy and balance of power that maintains our social structures and prevents dictatorial takeovers. And at the time there was a lot of discussion about the risks of open source AGI, but very few people were speaking out about the benefits of open source AGI and the risks of closed source AI. So what we decided is to gather a group of people and

it came out of a workshop in London on open innovation, to write a paper that would try and tell the other side of the story, which we thought was lacking in the discourse. Obviously, writing this paper took time. And the great news is that while we're even in the process of writing this paper, more and more open source papers, open source systems were coming out for LLMs, which means that now I think a lot of that paper is no longer as sort of urgent. But I still think being able to tell the story

These are the risks of closed-source AGI that are commonly ignored. And these are the benefits of decentralization, of democratization, of giving everyone access to these tools and being able to deploy them across the economy, being able to deploy them across the planet, and giving these to everyone who wants to innovate. That's just what we wanted to tell. And I think the paper does a decent job at this. So we're from the UK.

And we have a mix of kind of centralization and decentralization. So we have the National Health Service. If we were doing this interview about five years ago, I probably would have been arguing it was a good thing. Not so much anymore. But the government controls things like the water, the railways, and so on. And then we have private enterprise as well. So we have a bit of a mix. Some might say that AGI is such an important thing that we need to have an economy of scale. We need to have the best people, the best experts involved

It needs to be centralised. What say you to that? Centralisation is one aspect. But a different question is if it's centralised, who holds the keys? What we're doing right now is we're having the Manhattan Project developed by private enterprise, funded by people from across the world, from all sorts of backgrounds and interests, that if they were funding the Manhattan Project, that would have been absurd. So I'm on board with saying we need to have centralised resources, but let's centralise resources in the interest of the common good

in the interest of the public and not in the interest of maximum profit. Does that make sense? And we often confound those two. I think if we had something like a CERN-style effort of putting resources from across Europe, from across the globe, of trying to build models for the common good that are transparent, where the data that we use is public, is accessible, is curated by the public, where the alignment methods that we use are again

democratically vetted, come out of a sort of decentralized process like Wikipedia, where a lot of people can have input in public, in the open, transparently. I'm on board with this. I'm on board with centralization, as long as it's controlled by democratic forces. The goal is the common good, because to me the biggest alignment challenge is not between AI and humans. The biggest alignment challenge is between those people who hold the keys to power.

the control of these systems and the rest of the population. So in principle, I agree with you because when I look at a lot of the AI elites at the moment, you know, it's folks in the valley and it's a bit of a monoculture and having an open system would make it more interdisciplinary, for example, and many eyes make shallow holes. But there are folks who say that even a tiny increase in the risk by opening this technology up could have catastrophic consequences. What would you say to that?

I think the question is, what do we call catastrophic? And this is where the scale, the scale of what counts as catastrophic matters. So for example, I think having open source systems

that can be dual use is probably a good thing because it will give us early signals of where things can be exploited and there can be malicious use. But if you have open source access, at some point you get the same balance of power which underlies the stability of our world. Most actors are good and being able to use the same methods for defense

will also help us early develop defenses against the abuse of this technology. These kinds of abuses by bad actors will not be the end of humanity. But the catastrophic abuse of a runaway pay-per-click maximizer slash profit maximizer could actually be the end of the human species and could certainly be the end of our Western democracies.

I think we have to be very careful when we talk about catastrophic on that scale of what that term can actually be for different people.

Another thing is we live in a, you know, like a globalized world. We have a very differential kind of regulatory landscape. And some of the other players out there, they might have fewer regulations to deal with and they might use this technology that we're giving away for free and use it for bad purposes. How do you guard against that? I think it's on the international scale the same that applies on the national scale, which is the balance of power. If you equalize access to tools,

And that balance of power between different countries needs to be maintained. And giving fair access to AI is just part of the equation. Fundamentally, AI is trained on the collective outputs of humanity. This technology belongs to everyone, including people we don't like, right? I think it's actually quite wrong to say only a small fraction of Western elites should have access to this. Because this is trained on the cultural evolution of the output of humanity.

all of humanity. So let's use it for the benefit. And the only way that I can ensure this use for the benefit is to give equal access to people. And don't get me wrong, personally, I would prefer to go beyond just open sourcing. In our paper, we had a section on the question of open source AGI. And the argument we make in a nutshell is open source is better than closed source from a risk perspective because it prevents the catastrophic accumulation of power under misaligned entity. It prevents this.

But even better would be systems which are holistically aligned. And what I mean by that is, imagine if you had swarm intelligence, where every person had their personal AI representative that is trained for them to augment them. And the only way that we could do superintelligence is by having these teams of people and their assistants interact in a large network. And we had processes of making the mechanisms in the large network fundamentally democratic.

Which means the only way we can get to superintelligence is through this hybrid approach of humans and their agents, their systems. Which means this sort of distributed computing platform could never be used against the interest of the humans in it. And I remember the discussions we had. We had long discussions about this term of holistic alignment.

And the other also thought this was too crazy. But I said, you know what, this is it. We have to pursue it. Because fundamentally, the reason we're in this weird position now where everyone is racing to build something that nobody actually believes is a good thing for humanity is a coordination failure. So why don't we use AI to help us coordinate better, to build systems that are fundamentally democratic in their design, that fundamentally allow us to have a technology which cannot be abused against the humans in it?

rather than to say we're going to drive coordination failure to the maximum. What about this matter that doing AI development at the frontier, it costs billions of dollars and in the open source community right now, frankly, what we're doing is we're fine-tuning models that Meta have given away for free. Right. It's a very expensive endeavor. Do you think that that's still the case?

Do you think that we can actually do real work in the open source communities and academia without all of that money? So there's two answers. In the short term, absolutely, we rely on the large industry players like Meta who are pursuing open source. And there's one of the reasons why I'm at Meta 50%, because I want to strengthen that effort. I want to help open source AI leapfrog closed source. So this is in the short term. I think in the long term...

We need to put resources in the common interest, a CERN-like effort. Why could we build CERN with thousands of authors and we're not able of putting the resources that go into academia into one collective effort of doing a moonshot-like project of building the best models? If you think about the collective intelligence in academia, it dwarfs anything in any of the large labs. Of course, you can have thousands of research scientists at DeepMind, but you can't have

tens of thousands of brilliant young minds that we have in academia. And the interest should be such that there's enough players who don't want a monolithic future. And to me, this is a coordination challenge. And I think at one point we'll look back and say, why did it take us so long to realize that there's a huge opportunity here to bring together resources from a diverse set of players and try and make sure that

every single PhD student, every single postdoc, every single PI can be as efficient as possible to drive forward the vision of open source AGI. - Yeah, I love it. I'm projecting Kenneth Stanley here, but he said that serendipity plays an outsized role in our lives. And serendipity comes from having loads and loads of developers with diverse interests hacking around with things. In the paper, you said something along the lines of developers should not be held liable

for the things that they create. What did you mean by that? So what we mean by this is the developers of the tools, the models, I open source a model, I should not be liable for what happens with that tool. Imagine a world where if you're building a hammer, you get put in jail because somebody, a bad actor, takes that hammer,

and intentionally causes damage. Obviously you couldn't produce hammers anymore, right? Instead, the only way that you could get a nail into the wall is by hiring the hammering service, which holds all the hammers in their coffins and controls. And then you have to say, they will ask you, what kind of nail are you putting into the wall? Sorry, what picture are you hanging up? So we don't like that picture. We're not going to put that nail into the wall because we don't like it.

They say, "Well, it's my wall, it's my apartment." "No, no, no, sorry. Hammer Company says no." Does that make sense? That would be absurd. And yet when we get to AI models, we've gotten quite accustomed to the fact that we're handing over agency. I as a user have intentions. I am liable for my actions. And suddenly it's a computer says no, like, "Sorry, you're not being nice." What says nice to whom? And who says that I need to be nice to people? If I want to be annoying, I can be annoying. Ask my students.

Does that make sense? So to me, fundamentally, we've handed over agency. And I think it's going to be one of the big absurdities, the fact that we collectively have given away the keys to our collective intelligence infrastructure. It started with Google Search. We used to have libraries, there was a public index, there was fair and equal access to information. And Google Search gave away our collective hippocampus, the indexing structure into collective memory.

to a full profit entity. And now we're doing the same with AI access. It's like having a typewriter where as you're writing it, once in a while it says so and so, you can't say that. Obviously, I can type whatever I want. And I don't see a way out of this beyond open source and maybe in the long run, these holistic alignment systems that are fundamentally built up to be democratic. Jacob, it's been an honor having you on the show. Thank you so much for joining us today. Tim, thank you for having me. This has been brilliant. Great to talk to you.

Prof. Jakob Foerster - ImageNet Moment for Reinforcement Learning? 53:31 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

Prof. Jakob Foerster - ImageNet Moment for Reinforcement Learning?