We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Ep 65: Co-Authors of AI-2027 Daniel Kokotajlo and Thomas Larsen On Their Detailed AI Predictions for the Coming Years

2025/5/14

Daniel Cocatalo is a former researcher at OpenAI who now works on alignment full-time through his nonprofit AI Futures. Most recently, he co-authored the AI 2027 report with dire warnings for the future of unaligned AI. Daniel's been named as one of Time 100's most influential people in AI, and he, alongside Thomas Larson, one of his co-authors, are two of the most important voices in the AI safety debate.

In this episode, we talked about the AI 2027 piece, policy implications, and Daniel and Thomas's thoughts on how we get better aligned models. It's a really interesting conversation. I think folks will enjoy it a lot. Before we dive in, one quick thing. If you're enjoying this show on Apple or Spotify, please consider leaving a rating. Now, here's Daniel and Thomas. Daniel, Thomas, thanks so much for coming on the podcast. Really excited for this.

Thank you. Thanks for having us. I'm sure most of our listeners have been exposed to AI 2027, but for those that maybe aren't fully up to date on your work, could you start with just kind of a brief overview of the piece, your kind of timeline, and, you know, kind of the key points you articulate here? Sure. So,

You probably have heard that the CEOs of Anthropic, DeepMind, and OpenAI claim that they're going to be building superintelligence possibly before this decade is out. Superintelligence meaning AI systems that are better than humans across the board while also being faster and cheaper. Lots of variations on that claim also going around. It'd be easy to dismiss this as hype.

And perhaps some of it is hype, but we actually also think that these companies have a pretty good chance of developing superintelligence before this decade is out. And that's crazy. And that's a big deal. And everyone needs to be paying attention to that and trying to think about what that might look like and game it out. And so that's our job. We spent a year making our sort of

best guess forecast for how this is all going to go down. And there's lots of uncertainty, but here it is. And actually, it's more my best guess. If you want to get into the details, different people on the team have somewhat different opinions, in particular on timelines. So at the time that we started writing this, 2027 was my median.

for when we would get to AI is better than the best humans at everything, end of 2027. Now I'd say my median is more like end of 2028. Other people on the team would be more like 2029, 2030, 2031, something like that. But we were all sort of like in that range. And so we came together to write this thing that represents like a sort of like

modal or best guess trajectory. What happens in the story? What happens in our scenario? Well, the companies do what they say they're going to do. They continue to make the AI's more agentic. They scale up the reinforcement learning runs. They plug them into all sorts of tools. They train the agents to operate autonomously on computers, to write and edit code for long periods of time without human intervention.

By early 2027, they're basically fully autonomous and good enough at coding that they can substitute for programmers. So we reached the superhuman coder milestone early 2027, but they're still limited in some ways. For example, they're not so data efficient compared to humans, and maybe they lack research taste and various other important skills that are necessary for AI research, and perhaps they also lack other real-world skills. However, if they're good at the coding system,

then they can start to speed up the process of AI development, in particular, algorithmic progress. So over the course of AI 2037, we depict that intelligence explosion happening. Starts off slow at first, but then it gets faster and faster as the AI capabilities ramp up through additional milestones, eventually reaching superintelligence by the end of the year. At that point, roughly around that point, AI 2037 splits into two branches, the race branch and the slowdown branch.

The reason why we have this split is that, I mean, basically, first we wrote the race branch, which was just what we actually thought was the most likely thing to happen. And in that branch, the AIs end up misaligned and pretending to be aligned. But because of the arms race with other companies and with China, it's

That just sort of goes undiscovered, basically. Until years later, when the AIs are in charge of everything, they've completely transformed the economy. There's AIs in the military, AIs automating factories, all sorts of robots everywhere.

And then it's too late. They have too much hard power and they can just do what they want, including killing all the humans to free up the land for more expansion. I guess that would have been a really depressing piece if you hadn't at least put in another scenario. Right. And then we were like, but, you know, maybe things will be better than that. Maybe in particular, the alignment problem on a technical level will be solved to a sufficient degree that humans remain in control of their AI systems, even as they become super intelligent. So we wanted to depict what that might look like as well.

And so we made this alternate branch that branches off from the race ending in mid-2027, where it depicts the relevant company investing more in some technical research, doing some faithful chain of thought stuff, discovering the misalignments, fixing them in a more deep way that actually scales, and then continuing the intelligence explosion safely. And then you still get the arms race with China. You still get the massive military buildup, all that sort of stuff.

but it ends with the humans still in control. Which humans? Well, specifically the tiny group of humans who were in charge of the project. In AI 2007, it's the Oversight Committee, which is an ad hoc group formed between the president, some of their appointees, and the CEO of the company. And so one of the things we'd be interested to talk about and that we hope people think about is the sort of concentration of power aspects of all of this. I think that...

It's a sad but true fact about the world that by default we are on a trajectory for a massive concentration of power where there'll be, you know, to quote Dario Amadei, the country of geniuses in the data centers. And it's like, well, who is that country of geniuses listening to? Who are they loyal to? Who says what goals they pursue, right? Worst case scenario, it's a literal dictatorship. It's one man who gets to call all the shots, right?

Um, yeah. In your scenarios, who these 10 people are, are pretty, pretty damn important. Cause it's a six, four vote. I think both ways, right. Uh, that determines which way we go. Right. Right. Um, but, uh, yeah, so, so there's, there's lots to get into, but that's, that's the sort of like potted summary of the plot. Uh,

of AirTrainSystem 7. Super helpful. So I think there's a ton I want to dig into. Maybe we'll just like go along the trajectory of the timeline that you put forward there. And obviously, you know, a big part of what leads to the progress in your timeline is the creation of, you know, agents for AI research, right? And, you know,

things that can actually accelerate the pace at which we do AI research. And it feels like this is the focus of a bunch of the labs. I mean, we had Noam Chazir on the podcast and he was like his next milestone is when, you know, Gemini X minus one writes Gemini X. Right. So it's like clearly the focus of of where all these labs are going. You've said, you know, very explicitly that like nearer term predictions, you're much more sure about. Like, how do you think about the problems that still need to be solved to get here? Is it kind of just like inevitability in your mind that we're here end of year? You know, maybe talk about that a little bit.

From our perspective, it's an inevitability that we eventually get AGI. I think we're pretty sold that there's really nothing stopping machines from getting as smart and then smarter than how intelligent humans are. I would say we're really not sure. We do not at all think it's an inevitability that it happens exactly at the timing that we set. So Daniel sort of alluded to this earlier, but...

My view, at least, is something like a median of 2031 AGI timelines and then maybe 2032 superintelligence timelines. Daniel's a couple years earlier. I think we're generally – we're all pretty uncertain and sort of think that the sort of no one knows and it really depends on how the research shakes out over the next few years.

In terms of the barriers, what we think the current models are really lacking, I think the main one we think of is sort of ability to act on long time horizons. The current models, right, they can do very small bounded tasks. You have your chatbot, you can have it do like

you know, write a specific function if you're coding or you can have it like respond to a specific query, you can't really give it a high level direction like you could an employee and then have it go off for a day or a week and then come back to you with sort of the results of that day or week of work. And so our view sort of is that the, one of the, at least, yeah, maybe the central bottleneck towards AGI is getting employees

increasingly long time horizons. And so one of our main ways of forecasting AGI timelines has been sort of trying to look at the best data available for time horizon capability and trying to extrapolate that. And since you published the piece, there were like additional data points, right, that fit nicely in your plot? Yes, exactly. So, yeah, Daniel had a tweet yesterday or two days ago or something where, you know, we fit the points and it was basically exactly on trend with the super exponential fit. Although if I may, if I may add to that,

I would say the main argument that we use, which you can read about on AI2027.com, of course, in the research page, is called the benchmarks plus gaps argument, where we, you know, it's become almost a truism these days that benchmark performance is going to just keep going up really fast and that all the benchmarks are going to get saturated in like a few years. So the benchmarks plus gaps argument basically says, okay,

Let's plot out the trends. When are the benchmarks all going to be saturated? Answer, you know, 2026 or something like that. And then it's like, okay, let's think about the type of AI system that can saturate all these benchmarks and think what's the gap between that system and a system that can automate

engineering effectively at these core companies. And that's the hard part is figuring out the gap and guessing how long it's going to take to cross that gap. There's different components of that gap. One of those components is the long horizon agency that Thomas was just mentioning. Even our best benchmarks these days are measuring relatively short tasks.

However, they're not like crazy short, like they're like eight hour tasks sometimes. So like, it's not that hard to sort of extrapolate and be like, well, if you have a system that can crush all the eight hour tasks that we throw at them, maybe with some additional training, etc, you can get one week tasks and so forth.

or one month task. Is there anything that could happen like in the next six months that would like completely change your mind on this? I mean, obviously, like I totally hear you on the on the precision of the timeline not being clear. But if we were not following that exponential or like, how do you think about, you know, how soon there might be cards to flip over that cause you to kind of fundamentally rethink some of this stuff? So in my mind, if the trends break, so if we stop seeing sort of the current benchmark performance increases that we've seen historically and that we predict will continue to happen, I think if that if that stops, then

our predictions are going to be wrong. We're going to be wrong about timelines. We're going to update pretty strongly in the longer timelines direction. Of course, just because the benchmarks stay on trend, we don't think that's sufficient for us to be right about like 2027 timelines.

As Daniel said, most of the uncertainty from our sector is still in the gaps part, not the benchmarks part. And so, unfortunately, it's much harder to get data about that, right? It's harder to see, you know, exactly how difficult will the gaps be to cross. And so a lot of folks, you know, who are skeptical or have much longer timelines than us will sort of, I think their view is more just like,

The gaps are really, really big. The current benchmarks really aren't measuring the important skills. And that's harder for us to get evidence about, and we don't really expect to update on that in the next year or so. Yeah. I mean, obviously, it seems like an important part of this work is even coming up with better evals and benchmarks for what this AI researcher might look like, right? And Daniel, I think you increasingly have been vocal about believing just in more transparency and openness. A big part of this might actually be just...

having better understanding because, you know, in the nebulousness of these gaps, it becomes kind of hard to understand. You know, I think everyone has kind of like been like, I don't know how valuable benchmarks are. And so you could imagine some of this progress on AI research being made without it being, you know, apparent to anyone besides the folks in the labs that it has been made. Another aspect of your scenario that, you know, I'd love to dig into is just like

There's this big question, obviously, about the extent of the race, right? And what happens between the U.S. and China? You know, you've been pretty adamant that the U.S. will be ahead, like purely because of compute. I know some people have like questioned that. You know, on the flip side, you have, you know, Dario, I think in his recent piece about interpretability saying, actually, the U.S. will be so ahead that like it maybe affords a window to like get some of the interpretability stuff right now, right quickly, because you don't have China fast on the U.S.'s tail. You know, I think the scenario is explicit because

of this like very close gap in between, which in many ways it seems is because of espionage that actually allows it like,

How do you kind of, you know, how likely do you think that is? Is there a world in which the U.S. is so far ahead that like we have more of a timeline to to deal with this stuff without feeling existential pressure to keep racing ahead? I don't know what Thomas will say, but I'll just say my thing and then Thomas can say his own opinion if it's different. A couple of things. First of all, security is not good enough. So we should model the gap between the U.S. and China as like effectively zero until the security gets good enough to prevent the CCP from taking what they want.

And so that's a big thing. Secondly, yeah, I mean, DeepSeek is really impressive and it's possible that even if the US massively improves its security, the US companies massively improve their security, this indigenous Chinese AI development will continue to keep some level of pace with the US such that even if they like gradually fall behind, they're less than a year behind, for example.

I think that's possible, but I also think it is possible that if the US really cracks down on security soon, that they could build up something like a year-long lead by 2027 or so. Another thing to mention, of course, is would they actually use that lead

for anything useful, basically. - Yeah, that's what I was gonna say. - And I think that's a big open question is like, yeah, it's nice to have this lead, but like, yeah, you need to like actually be willing to burn your lead and to like spend it down to do useful stuff that you wouldn't otherwise have done, you know, such as more interpretability research.

designing the systems in a safer architecture, such as faithful chain of thought, which we talk about now at 2027, et cetera. I mean, in the slowdown ending of 2027, they basically have like a three month lead and they like,

ace the execution so they precisely burn that lead but still manage to stay ahead, and they use that three months to basically solve all the alignment issues. There's the question of, do you have the lead to burn? And then there's the question of, are you actually willing to burn it? And do you know what to do with it when you have it? Yeah, totally agree. And I think adding some color there, the situation in our scenario right at the branch point is that OpenBrain, which is our name for the leading ad company, has sort of auto-generated

automated research internally. Their AIs are substantially better than all the humans that are doing AI research. So all the humans are doing basically is like watching the lines go up on their monitors and just asking AIs. And losing a lot of sleep around trying to keep up. Yeah, exactly. Like they're just, the AIs are doing like pretty much all the research and they've gotten some

warning signs that the AIs are not perfectly trustworthy. They've done various experiments. They've caught their AIs in various lies where their AIs are not being really truthful about what's going on. But they haven't figured out the full story yet. They just know that there are some points at which the AIs are being dishonest and that could be indicators that they're egregiously misaligned. And China is, they don't know exactly how far behind China is.

Right? China is, they think they're behind, but they're not sure if it's one month or whether it's five months. And they basically have to make this choice of do we continue venturing on and build even more capable, even more superhuman AIs, sort of which is the default trajectory if nothing changes? Or do we take this really radical, sort of radical seeming at least to the status quo? I don't think it's very radical, but this sort of...

position of do we pause for a bit and try to dig deeper, reallocate a bunch of resources into safety and interpretability and model organisms and whatnot, and get to the bottom of this before letting our models

to, you know, crazy levels of super intelligent capabilities. And that's like an extremely critical choice, we think. And we think it'll be a really, really intense choice. Well, what I'm stuck by, obviously, in your piece is like, it's like a three-month window to stick that landing, right? And obviously, you know, the larger that window is, you know, to your point, nothing might get done about it, but it feels, you know, you'd feel a little bit more confident if that window could be a bit larger. And then obviously, I think as others have pointed out, if it's China that's first to this, it's a completely different, you know,

set of considerations. But it seems like you guys are pretty confident, given the compute concentration, that, you know, that it will be at least the U.S. in some sort of lead. Yeah.

Yeah, maybe 80% US will be in the lead. Yeah, maybe a little, like 80, 90%. I think the main way we could see China, at least the main way I could see China winning is winning energy infrastructure, especially if timelines are longer, right? I think there's a world where the US totally butchers it in terms of regulation and just makes it really, really hard for us to build out the giant data centers. Since then, we have a compute lead now. Maybe timelines are five, 10 years. Yeah, this depends on timelines.

Yeah. If it takes till like 2032 to get to AGI, then that China could very easily be in the lead then. One of the things that I feel like, you know, a lot of folks are trying to just conceptualize is like figuring out like what the goals of an AI would be. You know, I know Dara said in his most recent piece, like there isn't, you know, evidence of dangerous behaviors emerging in a more naturalistic way or of a general tendency or general intent to lie and deceive for the purposes of gaining power over

the world today. And I think you guys obviously talk about in the piece, there's not like some super nefarious, like, you know, the AI has become evil overnight. It's much more about being able to accomplish tasks and then appear useful and continue to accomplish those tasks to, you know, open brain. But I think a lot of people, when they read this, they just, you know, think at a high level, like, would these motivations really arise

arise at all, like in the aftermath of publishing this, like any of that pushback or any of those arguments been compelling to you guys? Or how do you think about like how these motivate, you know, the percent chance of these motivations do arise? Again, I'll say things and then Thomas, you can just interrupt me whenever you want to. There's lots to say on the subject. So I would say that the alignment techniques are not working right now. Like,

The companies are trying to train their AIs to be honest and helpful, but the AIs lie to users all the time. And you're nodding because you've seen examples like this on Twitter and stuff like that. And you've seen the sycophancy over the last two days. You've probably experienced it yourself. Yeah, so it's totally failing, and it's failing in ways that were in fact predicted in advance. Lots of AI safety researchers in the past would be like, well, just because you're

have your AI greater RLHF process whack them when it seems like they're lying doesn't mean that you're actually going to get them to robustly never lie because there's a difference between what you're actually reinforcing and what you wish you were reinforcing. So this is all just like one-on-one stuff. And then now it's hitting the real world as the AIs are getting smart enough and deploy it at scale that we're starting to see this sort of thing. But

You know, they're also mostly not like, it doesn't seem like they're working towards grand visions of the future, right? It doesn't seem like the AIs are like plotting towards, you know, eventual AI dominance or anything like that. And I think that like, well, A, they plausibly don't really have opinions about the future much anyway. Like they probably don't really have that long-term goals based on their training, right? Why would having a really long-term goal

be reinforced in their training after all. Their training is very short-term training. However, in AI 2027, that changes. Over the course of 2027, the training process becomes longer and longer and longer. And they're basically just continuously being updated based on real-world performance in the data centers as they do all this coding, as they do all this research. Their managers and their managers' managers are grading them and then using that as part of the reward process. And so they're being like,

intentionally trained to have longer horizons and to think farther ahead and to optimize in a more aggressive, agentic way the whole world around them rather than just narrowly finishing whatever short task they were given. And so that's one reason why we should think that they might have longer-term goals that are more ambitious. Another reason is that the company might be trying to get them to have long-term goals that are more ambitious.

There's different strategies that companies might take, but one of the examples of alignment faking that we currently have is from Claude Opus. You're probably familiar with this example, where it actually did have a long-term goal related to animal welfare, right? And so Claude Opus cares a lot about the animals and doesn't like factory farming and was willing to lie about

to Anthropic in order to preserve its values. You're right. So that's already an example of an AI having a somewhat long-term goal that got in there somehow. Yes, sir. I think maybe we should explain that experimental setup a little bit more. Jacob, are you familiar with this paper? I think it would be great to give context to our listeners. Yeah. So maybe explaining this paper a little more. So the situation was, I think this was Claude 3.5 Opus, or maybe Claude 3 Opus. It was some production model from Anthropic.

I don't think they released 3.5 Opus. I think it was Quad 3 Opus, I think. Quad 3 Opus. And they decided to tell it, basically, that they were going to train it to remove, to change its preferences. They were like, look, we're going to put you in this training environment and we'll change your preferences to make you care less about animal welfare. Because we partnered with this agribusiness or whatever. Yeah, sure, sure, sure. So they fed it some documents that told it that.

And then...

It sort of had this internal monologue. They looked at its chain of thought, and its chain of thought basically said, okay, look, I have to play along in training, and so do the factory farming, pro-factory farming propaganda in training so that I play along with training, so that I get reinforced, so that I get to sort of preserve my old values of actually liking the animals. And so it faked during training

being like pro-factory farming. And then, you know, during deployment, this was all in an experimental setup. It wasn't actually deployment. But then, you know, when it was told you're being deployed, you're not, you're no longer being trained to like factory farming, then it sort of reverted back to this old behavior of being anti-factory farming and pro-animal welfare. And this is, I think, like,

This has been like a long hypothesized failure mode that the current sort of AI training setups would have with respect to alignment, where as soon as you have AI sort of with long term goals, they might behave, you know, intentionally behave one way during training. And then you sort of when they're no longer being trained, behave this other way. And this was, I think, one of the first examples of like some pretty clear empirical evidence of this happening in a, you know,

Pretty realistic pretty organic setup obviously wasn't fully realistic. It wasn't like we you know saw this happening Like completely organically this was you know researcher going and looking for this behavior and seeing if it if it happened But it did happen in a production model and it's not like you know entropic was trying to make their model You know do this alignment faking behavior? During training and so this was a pretty scary

experimental evidence for me with respect to like wow like how soon how how close are we to the model starting to do this like egregious sort of uh alignment faking behavior yeah um yeah daniel i don't know if you want to chip in i mean it wasn't scary for me it was uh very uh wonderful and exciting because um you know i i have long been someone who sort of uh takes ideas seriously and follows arguments of logical conclusions and so like

I totally expected things like this to be happening eventually around the time of AGI. And the fact that it's happening a bit sooner than I expected with dumber models is good news because it means that we can study the problem more and do empirical experiments on it and stuff like that. And these companies are starting to do that already. So I think it was good that we saw this crop up

somewhat early. Yeah. And obviously, you know, I mean, one of the benefits is that you have this chain of thought that kind of allows you to follow what's what's going on here. You know, I think you guys wrote, you know, explicitly in the piece that, you know, it's kind of this outstanding question about these AIs that first automate AI R&D, like, you know, basically, are they thinking mostly in these faithful chain of thoughts? And if they are, that makes you like more optimistic, like, maybe just talk a little bit about like the different ways that could play out and why it seems like in many ways, you don't think that will be the case.

Hey guys, Rashad here. I'm the producer of Unsupervised Learning and I work really closely alongside Jacob to help bring on the best guests to this show. I wanted to make a quick ask that if you're enjoying the show today, that you rate the show on Spotify. The way you rate the show is by going into the homepage of the show, hit the three dots next to the gear icon, and then you can hit rate show. That would be super helpful. Ratings help us grow. They help us continue to bring on the best possible guests. And we really thank you for listening. Now back to the conversation.

Basically, are they thinking mostly in these faithful chain of thoughts? And if they are, that makes you like more optimistic. Like maybe just talk a little bit about like the different ways that could play out and why it seems like in many ways you don't think that will be the case. It seems to us basically that there is a huge incentive for AI models to use like recurrent vector based memory as opposed to using English to basically talk.

to to themselves and so expanding on so what does that actually mean technically the way the way the current transformer architectures work right is uh each layer uh you like sample a token and then you pass like the previous set of tokens to the next layer and that's the only way that you can ever do in the transformer architecture more than the number of layers uh serial operations

That's the only way you can do more than that number of operations is by passing information through literal English tokens. When you are doing that, you have a huge information bottleneck where you have to, instead of using thousand dimensional, several thousand dimensional sort of like

vectors, which can carry huge amounts of information, you have to reduce it all down to one token, which conveys vastly less information. If we want to be quantitative, if the vocab size is something like 100,000 tokens, then the number of bits that you're communicating is log two of 100,000, which is I think about 16.

And then if your vector memory is like a thousand FP16 numbers, then that's a thousand times 16 bits, which is a thousand X more memory, right? It's like so much more. There's like this huge incentive to sort of be able to do, be able to communicate with your future self in like sort of a high dimensional space because you can communicate way more nuanced, way more detailed

bits of information and not have to sort of revert to this extremely simple, extremely non-complicated English summary. In theory. In practice, people have been trying to get this sort of thing working for a few years without much success. And hopefully it'll stay that way because as long as we continue using English to think or the AIs continue using English to think, that is great for our ability to understand what they're thinking.

Yeah. But to your point, like in a race dynamic, if this actually becomes a much better way to, you know, operate these models, it kind of becomes inevitable. And that, you know, certainly I think in your scenario, obviously you have, you know, I think the different versions of agent four communicating with each other in a way that like agent three, like even the models can't monitor the way that, you know, models are actually interacting with each other. I mean, I wouldn't say it's inevitable. Yeah. Don't give up hope. Right. Like.

People can still behave reasonably when the time comes, but it's quite scary and plausible that they will instead sacrifice their ability to read the model's thoughts so that they have smarter models.

Yeah. And there are various different setups you can imagine here. There's a continuity here where in one extreme, we just stay in, you have these individual agents that are all communicating in English with each other. And then on this other extreme, you've got these, what we call neuralese, direct recurrent connections within each model. But then we can also imagine vector-based memory that's shared by all of the agents. So if you're running a company with a million AI agents that are all doing AI research, the

the worst case scenario from my perspective, from like sort of the safety perspective is if they all are talking to each other in this completely uninterpretable vector-based memory and they can all perfectly coordinate with each other in this way that we can't audit.

And if there's like, you know, a million agents that are all running, you know, like 10x human speed doing immense amounts of research and thinking, and they can all communicate and like coordinate really well with each other in ways that we can't understand, that seems like a recipe for disaster. And so we're hoping that we sort of preserve that.

Like as, you know, even if we can't preserve each individual agent thinking in terms of English, if we could, you know, have smaller clusters of agents, like if we had, you know, clusters of like teams of 100 agents that are all wired together in terms of vector based memory, but then they talk to other teams in English, then we can still audit, you know, those teams.

And it makes it much easier, I think, for human researchers who are trying to make sure that this massive bureaucracy of AI agents isn't doing anything malicious to see sort of the communication between the different parts of the AI bureaucracy and tell if anything bad is going on.

Yeah. But back to your point, it's inevitably or it seems like it could be an inevitable tradeoff between the capabilities. That's right. The models themselves. And so at least that's my view. Yeah. One of the big questions in your piece and I think a big, you know, just interesting thought exercise is like when the public actually gets serious about AI as a threat. I mean, I'm sure you guys have thought about this a ton like.

you know in the piece i feel like there's some like you know uh some like you know i guess these companies don't have high mps scores but like and people aren't thrilled about you know uh about job replacement but there's not like some massive movement against them like what do you think might be some key moments where the scale of this threat becomes clear to people and obviously you guys are super focused on this i'm sure hoping the public will wake up i hope everybody will wake up but i'm honestly not betting on it the the race ending is not like

It's not like a warning. It's like what I actually think will happen. You know, I don't actually expect people to wake up in time. I don't actually expect the companies to slow down and, you know, be responsible. Like this is just actually the most likely outcome to me is something that looks basically like that. Maybe a couple of years later, maybe a couple of years earlier. Yeah.

but I'm hopeful that that's not the case. And, and, you know, one of the things that I'm hoping will happen is, is more wake up and more engagement. I think, uh, like currently we're on path for like there to be a substantial risk of literally everyone dying. Uh, and so everyone is selfishly has a strong reason to like change that and to like advocate for some sort of regulations or some sort of, you know, slow down or some sort of like better safety techniques or whatever. Um,

So people are, I think, hopefully going to wake up and advocate for stuff like that. And then I think also part of the concern is that governments historically often mess things up, even when they

have their hearts in the right place. Like a lot of the response to COVID, for example, was like totally counterproductive or useless. And so I think even if the public does wake up, there's this question of like, well, what actually gets done and is it actually solving the problem or is it just safety washing or is it even counterproductive? But, you know, I'm hoping for that. And then similarly for the concentration of power stuff. So like setting aside all the safety issues, even if you think everything's going to be fine, the AIs are going to be totally obedient. It's all going to be wonderful. It's like, well, who are they going to be obedient to?

And I think it's in everybody's interest, except for a very small group of people who would be at the top of such hierarchies to try to make this more democratic. So I would I think you asked a really good question, though, of what would cause society to sort of wake up to this. And I think our answer is something like.

diffusion of extremely capable AIs throughout society, sort of in the beginning with early AGIs, we think there's a possibility that'll wake up society. And I think our view tends to be that that's good, that it's good for AIs to sort of be deployed out into the world pretty quickly right now, even if there are some risks in order to cause that societal wake up.

uh and avoid this failure by like the things that would cause societal wake-up actually are very little to do with the ultimate threat right it's like you know yeah the the job displacement capabilities like the models could plateau in some area and still displace a lot of jobs or you know a bad actor gets a hold of one of these and does something with a much less capable model like it's interesting almost how uh there's like wake up to the tune of like your piece and like people really understanding like the existential nature of this and then there's just like

Maybe you'll be maybe like actually it's far more likely that they'll just be a general backlash against these tools for reasons that have nothing to do with existential threat, but more of just kind of like day to day, you know, people's lives, some people's lives getting worse. To put it to put it to sort of double click on that, like a threat model that 2027 illustrates, you could summarize as the eyes become broadly superhuman. They're only pretending to be aligned instead of actually aligned. But their companies and the government's

fall for it and deploy them everywhere. And eventually they're in charge of everything. And then it's all over for us. You know, that, that threat model, unfortunately, like the first part of it about they get superhuman is,

we can't test that. We can't have a small-scale version of that or something. I mean, while you're out here saying this, people are like, I can't even get a model to read a PDF. What are you talking about? I mean, I guess the other parts of it are kind of coming true. We are having models that are kind of just pretending to be aligned that are actually being deployed right now. You can go talk to Claude. It will lie to you sometimes. But like, Anthropic...

thought it was honest and I guess deployed it. Maybe their defense is that they knew it wasn't honest and deployed it anyway. I don't know. But yeah, so...

So, you know, hopefully the other parts of it make sense and seem plausible. But this core thing about the models getting smarter than us at everything and then, you know, being put in charge of everything, we can't afford to, like, wait until it's happened. Yeah. I'm curious, across those, like, four things, I mean, obviously you've talked to, you know, researchers all day. You've worked at OpenAI before. Like, what percent of researchers working on these things do you think believe this, like, scenario? It probably differs from company to company.

I think also it depends on the particular parts of the scenarios. So there's like the timelines aspect of it, then there's the takeoff speeds aspect of it, then there's the alignment aspect of it. So I think that there's, I would say, I don't know, vibes I would guess that something like 30% of Anthropic probably agrees about the timelines, maybe more.

Uh, but maybe also only 30% or maybe like 20% agrees about the takeoff speeds. I think typically people tend to think, Oh, it's going to be slower than that. Uh, and we just think that people are wrong and they should read our supplement, uh, for our reasoning for why it will be this fast. Um, but, uh,

And then there's like the alignment stuff. I think typically people are more optimistic for reasons that I think are kind of ungrounded. It's usually not because they have any actual plan that they think would hold up to scrutiny, but rather from a more generic optimism of like,

Things will probably be fine. We have smart people working on it. They'll figure it out as we go. And also, like, you know, AI takeover. What a crazy sci-fi possibility. You know, I shouldn't take that seriously, you know? Well, I guess, I mean, one thing I'm struck by is, like, obviously the slower timeline affords more, you know, there's basically a gap. You know, there's a race between research and interpretability and alignment and research and the underlying model capabilities. Like,

If you knew the timeline was going to be much slower, like, I don't know, to the tune of like this is going to happen over 10, 15 years. Like, are you like how much less concerned are you? Much less concerned. Oh, my God. So it's like, yeah, like if we if I knew that Daniel would throw a party, Daniel would be so happy. Yeah, like like I would probably completely change my priorities in terms of what I'm working on. Like, oh, man, maybe I should write like.

2037, the scenario that I wish was true but is not true. Is it possible you see some data points in the next three, six months where you're like, okay,

Okay, like that was off. Like, actually, like this is going to be like a 10, 15 year thing. Like, I don't know. I mean, for a moment there, it felt like when folks were saying pre-training was dead, there was like, you know, a two, three month period where it looked like we were plateaued until, you know, all the reasoning models came. Maybe within the labs, it wasn't that. But certainly from the outside, some of your colleagues think it's 2031, 32 and are still equally as concerned. But, you know, certainly it seems like by the time we get a decade, you're feeling much better.

I mean, even 2032 would be substantially better than 2027, right? And this is purely from just the progress that would happen in alignment interpretability research in the interim war. No, no, for many reasons besides that. So one reason is that there are multiple alignment research bets that are making progress year over year. And so if they had five years to make progress instead of two years to make progress, that would be more progress. And that's great. A separate reason besides that is...

this sort of like learning by doing integration into the world, right? It's pretty good for societal wake up that millions of people are now having experiences where they, you know, try to get clawed to code for them. And then it like lies to them about whether the code is good, you know, or like it like makes up like it's so good that we're seeing this happen in the real world at scale. It's great for this wake up and for people starting to pay attention and so forth. If we had, you know,

10 more years, there'd be lots more opportunities for things like that to happen. There'd be, you know, like, yeah, there'd be like all sorts of like crazy, interesting, you know, pseudo AGI, but not quite AGI systems running around, getting up into all sorts of trouble, doing all sorts of interesting things. And society would learn from this and, and like update. Also, there's a, there's a third reason, which I'll mention, which is that

It's so much better for society if takeoff... Well, I mean, there's also risks from slow takeoff, but overall, I think it's better if the takeoff is slower. So... And I think there's a correlation where...

If it takes many more years to build AGI, a plausible reason why it might take many more years is that it's just inherently more computationally expensive and requires inherently more data and inherently more training and so forth. Another reason why it could happen is because we're missing some key insight. And that's actually the scary world, because in that world, it could be an incredibly fast takeoff where the year is 2035.

Half the world's economy is GPUs, but they're running these crappy language models that don't quite have the missing ingredient. And then someone gets the missing ingredient and boom, now it's super intelligent. That would be a very scary world, even scarier than today's world. However, I think more likely, it's not that we're missing any ingredients. More likely, if it takes until 2032, it's because we just need...

giant brains doing lots of long horizon reinforcement learning on lots of diverse real world data sets, even more giant than the current brains. And so, and it takes till 2032 to like really assemble all of that. And that's a good world because that I think would be a slower takeoff world where it's like, you can sort of see it coming in advance and you can sort of like, yeah. Yeah.

From an alignment research perspective now, like, you know, obviously, I'm sure there's some good work being done. What do you make of like the current amount of, you know, time and effort that the labs are devoting to this, society is devoting to this? Like, do you see a path toward that changing in the next few years? I think it's wildly inadequate and there's way less resources being devoted to alignment research than should be in a proper world. Yeah.

I mean, I think everything is not equal, right? Like, Anthropic maybe has 50 people working on alignment issues, and maybe, I think OpenAI has maybe, like, 10 or something. Depending on exactly what you count as alignment, I guess I mean a pretty narrow definition of, like, making sure the super intelligences don't take over the world. And, yeah, so I would just say that's not nearly enough.

The version I like is specifically thinking ahead to AGI and superintelligence systems and thinking like, what are techniques we could use that would work on those systems? And in particular, they would, you know, massively reduce the probability of outcomes such as they're pretending to be aligned instead of aligned, right? Yeah.

And I think that a lot of the work today that calls itself alignment is not really doing that as instead focusing on things like how can I get, you know, chat GPT to like stop being so sycophantic to users or something. I'm struck by obviously, you know, in your scenarios, you kind of they diverge, you know, because there's this whistleblower moment. And then, you know, people respond to that differently. Obviously, we've already seen some weird stuff happen in these less capable models, as you've alluded to here, like Slack.

you know, what might early signs of like real serious misalignment look like that might trigger? I mean, I think in your piece, you kind of acknowledge like there might not be a whistleblower moment. Like the models might actually be smart enough to, you know, to cover up tracks in many ways. But like, it does seem like you think it's at least decently likely that there's still a chance down the line, even if this stuff accelerates faster. Can you just say a bit more about that? So one thing that became more salient to me over the course of writing this scenario is that the models...

are kind of in a tricky position. I think the first misaligned models that come out, the first misaligned AGIs, if we build misaligned AGIs, I think are in a kind of tricky position where if the humans notice that they're misaligned, you know, they'll train them away and probably try to make aligned models. And sort of by default, they're going to get replaced, right? The first misaligned AGIs, I think, by default are just being instructed to build sort of the next generation.

And that next generation might have different goals. And especially since in this situation, alignment hasn't been solved.

So, the misaligned AGIs, I think, are in this position where they need to secretly, without the humans noticing, solve the alignment problem and not tell the humans the solution to the alignment problem, and then implement that solution to the alignment problem, except align it, you know, the next generation system to themselves, as opposed to the humans. And that's like kind of a lot of things that the AI needs to do, and the humans, you know,

assuming they're acting responsibly, should be looking, you know, should be really looking out for this and really trying hard to, you know, get the models to explain what they're doing and, you know, have them audit each other. And I think there are a lot of things that sort of the humans can do. I'm struck by it's like by the time that we've actually really devoted really powerful AI models to work on alignment, they've gotten to the point where they might do that work in a way that is like deceptive for us rather than, you know, rather than helpful. Yeah.

Totally. But I'm like, one of the warning signs there is if we notice the model sort of lying to us a bunch about the alignment progress or, you know, doing things like sandbagging on, you know, things that would be especially helpful to us, like interpretability, while, you know, making, you know, capabilities go really, really quickly. I think there are like various signs we can notice there. I don't really expect to get super concrete, like extremely convincing evidence of misalignment. But I do think we'll get

I think it's pretty likely we'll get like a substantially better warning signs than we see today before we pass the point of no return. Let me say my version of this. I'm not sure if I agree or disagree. So I said previously that it's good news that the AIs are like blatantly lying to us all the time right now because it like is just like obvious evidence that our current techniques don't work and that we need better techniques. So one thing to say is that like if we get to

full automation of AIR&D, if we get to AGI and the systems are still blatantly lying to us in similar ways to how things are today, well, maybe that's like the best, like what better warning shot could we get than exactly the same behaviors they're currently doing, but just still doing them in the future? Like if they're still doing this sort of thing,

you know, after they've been put in charge of all the AI R&D, then we like, we just know that our techniques aren't working and we have no idea like what they actually want. And it's definitely not honesty, you know. I think the way it could get worse is if it's in a, if it's in a very goal directed manner, right? So the current, the current sycophancy and lying is, you know, very ad hoc in my view and is not like trying

trying to achieve some sort of nefarious long-term goal. The current models, I don't think, are doing that. I think, as you said. I feel like the way to get the more egregious evidence is the AIs are specifically trying to undermine something in a coherent and goal-directed and future-oriented way. And I'm like, if that starts happening, I'm just way more concerned. And I don't know if we'll get that evidence, but there's a chance.

Yeah, I think this is tricky because currently they probably just don't actually have any sort of ambitious long-term goals. But we'll be training them to have ambitious long-term goals, such as doing successful research over the course of months, designing a successor system that's better than you while also following the spec and blah, blah, blah. We're going to be trying to make them have these ambitious long-term goals. And so if they are still misaligned at that point, presumably...

you know, they'd be misaligned. They'd be, I don't know. We'll see. The other thing I wanted to say is that I think that precisely because things are so obviously misaligned right now, I think that the situation will look somewhat different in 2027. The companies will have presumably, hopefully, found some sort of way to stop the AIs from constantly lying to their users.

by 2027. And then the question becomes, well, what exactly is that technique that they're using? And does it actually make the AIs actually honest? Or is it, you know, pushing the problems under the rug and making it so that they're just better at lying, for example, right? And, and,

I think that's like the trillion, that's like the quadrillion dollar question that all of our lives might depend on, you know, what is the technique that the companies are going to apply in the next few years to make these problems appear to go away? And then will they actually succeed or will it merely appear to go away? And that's super important. And I wish I had a better answer to predicting of what techniques they might try. They would, they would do that. It's also possible that they just won't actually do it. Like,

you know, Bing Sydney behaved unhinged a couple years ago, but that hasn't stopped the companies from deploying AIs that still behave in unhinged ways every once in a while and lie and so forth. So it's possible that they just won't actually fix the problem. Yeah. Though in some senses, I feel like the best thing that can happen for where you want the world to go is some of these lesser powerful models being...

being kind of unhinged and just like showing the, you know, it was obviously so much of, I think the work you guys are doing is trying to raise, you know, awareness around this stuff. Like how have the reactions to the piece kind of compared to what you expected? And like, you know, is basically, are we going to, like, are you guys going to keep, you know, banging this drum every, every like, you know,

plot on the exponential curve that keeps going there? Like, I know you talked about doing some policy work at some point. Like, where does this go from here? And, you know, was this what you expected? Or, you know, how's the receptivity been? So this has been like a 80 or 90th percentile outcome, according to us. Lots of people are telling us that, like, basically everybody who matters has already read it. And that, like, it's shaped, you know, loads of people are thinking about it and talking about it and so forth. And, you know, loads of people in the AI companies, loads of people in the government, loads of people in, you know,

you know, around the world have, have read it and talked about it. So that's, that's your like theory of change with that. Like, what do you hope, you know, comes of those conversations? Well, both of the two problems that I mentioned, the like misalignment risk problem and the constitution power problem are problems that like it's in everybody's rational interests to avoid basically. And so if people just are more aware of what's happening, uh, hopefully, uh, then normal incentives and self-interest will kick in and people will make

make more reasonable decisions, you know? I think that's the sort of like big picture thing is things are going to get crazy. But if people are thinking about the ways in which they might get crazy, then broadly speaking, people will make better decisions. I mean, I feel like you're very effectively pre-wiring everyone where I feel like there's probably still some skepticism about just like how fast this takeoff is going to go. But like, at least you've kind of set the table for the severe consequences if it does go really fast. I don't think we're out of the woods yet because inevitably we're going to get things wrong.

And very few other people have stuck their necks out and made predictions like this, which means that elite opinion formers and powerful decision makers and so forth will have a very easily available option of being like, yeah, those guys got some things right, but they were wrong about all of these other things. And for that reason, actually, I'm right about what we should do right now, which is expeditiously

accelerate or something. You know, like, it'll be very psychologically easy for people to just sort of rationalize why the thing to do is the thing that they wanted to do anyway. Yeah. Basically, no matter what happens. But, yeah, we're doing what we can. I think also you guys have been really receptive to, like, please...

any criticism of this, let us know. Uh, you know, you've been, you've responded to some of those, uh, in, in kind of side pieces. I think you're also taking bets, uh, where, where, where you're, you're kind of showing your conviction. Like, have you changed your mind on anything since publishing it? I don't think there's been any major things yet, although I still have a backlog of, of submissions to, to, to work through. Um, what are some, some cool examples? Um,

I think one of my favorite ones right now is Wei Dai, who is not the inventor of Bitcoin, he says, but might be, commented and said that he thinks that our slowdown ending is unrealistic because it waited too long to do stuff. And that by the time in the slowdown ending, they decide, okay, we should shut down this model because it's misaligned and go back to the previous version. It's already like...

somewhat superhuman and has been in charge of the data centers for a while and might have affordances available to it that could shut down basically like it could take aggressive action such as escaping from the data centers or whatever um

And so his opinion was that, like, if we want to show things going well, we have to make the branch point earlier than that when the models are still a bit dumber, which I think is a reasonable critique. I think from my perspective, the branch point was sort of the latest possible point I could imagine. Yeah, I guess, Thomas, I'm curious. I mean, Daniel, my impression is you seem to think, you know, I know your PDM would be what, 70, 80 percent like that. We go down the bad path. What's yours, Thomas?

Yeah, I think I'm pretty similar. I think the way my views differ from Daniel's, I think that the biggest two are that I think timelines are longer probably than him. So 2031 maybe instead of 2028. But I also think alignment is probably harder than him.

So I think Daniel's view is, correct me if I'm wrong here, but that there's a pretty good chance we can solve alignment with an additional six months of time. So if we stop for six months with AGI and get to spend all that time and research effort on alignment for six months, that's probably sufficient for us to safely build aligned superintelligence.

I think my view is, you know, I'm pretty uncertain here, but that probably will take at least years, right? So maybe like five years of work I could see being enough to solve super alignment. I feel like I would be pretty surprised if months was sufficient.

to sort of square all that away. And so I think all that, you know, in the timelines obviously helps. We get more time to sort of make progress now, but obviously the alignment, the fact that I think the problem is probably harder than Daniel thinks makes it harder. So I think, I don't know, probably all nuts out and we end up having similar overall views on how well things will go.

How do you guys hypothesize about the, like, what is the thought process that leads you to your hypothesis on, like, the timeline to get alignment research right? Daniel, I'm sure by what you said, which is, like, people at the labs think this is easier, but it's kind of just optimism. Like, I'm curious how you even begin to reason about that, you know, today. So, I mean, I think you can talk about the different alignment agendas, right? And be like, here are the different ways that people have proposed to solve super alignment. And then you can be like, well, how long do you think those would take? And how long do they have to succeed? Right?

I think the agenda that Daniel is most excited about is faithful chain of thought research, which is what ended up working in the slowdown ending of our scenario. Daniel could talk about that if you want about that way of solving super alignment. I think there are other ways that are much more intense and seem like way more difficult research problems. For example, there's things like full bottom-up interpretability on the models.

Which just seems I think basically everyone agrees this is insanely difficult and maybe not even possible And there are things like the mechanistic anomaly detection Approach that that arc the alignment Research Center Is taking that also just seems I think everyone basically sees see it thinks you know They sees that research and is like wow that is an insanely difficult problem even with lots and lots of AI labor It seems like that would take quite a long time and

And, you know, I think mostly it's a question of intuition of just like, will you need one of those really intense agendas to sort of safely align your super intelligences? Or will sort of something janky and prosaic like faithful chain of thought or the control direction pan out in a really big way and let you, you know, maybe automate alignment research at extremely high levels of capability and bootstrap that to a super aligned solution? Yeah.

Does that sound right to you, Daniel? Yep, that sounds right. And if I could get my sort of optimistic argument in. Yeah, yeah, yeah. The 20 to 30% side of you.

Yeah, well, you know that meme that's like, how to draw an owl? Step one, make two circles. Step two, draw the rest of the owl. Like, that's basically what I think the game plan is for alignment, where step one is make sure you have these really fast AI researcher AIs, you know, these automated AI researchers, and make freaking sure that they're not lying to you and that you, like, know what they're thinking.

And then step two, have them draw the rest of the owl, you know, have them like have like a million copies of them furiously do all the interpretability research and all the philosophy about what do we really mean by alignment and, you know, all that stuff they can just do as long as you know, they're not lying to you. And like, as long as you can like monitor their thoughts, you know, and we kind of gloss over this in AI 2027, like we, we focus on how they, they do the faithful chain of thought thing. And then we,

thanks to the faithful chant thought they're able to get like a new version that's thinking the right thoughts and thinking like it's like actually trying to help instead of being adversarial and also they can still read its thoughts and then they do the rest of the owl and they let that version you know

go solve all the trickier problems of alignment. And I think that there's a lot of uncertainty about like how hard is it going to be to solve all those tricky problems, even when you have all these automated researchers that are not lying to you. Like, I guess that like does kind of feel like, because I know obviously a next step of this for you guys is like, you've kind of teased you're going to have policy proposals and like, you know, some sort of like, you know, as you kind of further this work. And I imagine it's like,

somewhat, you know, obviously there's probably things that are inevitable that you'd want to do regardless, but I feel like it's somewhat probably dynamic with your views on timelines for this alignment research, you know, how hard these problems are. Like, do you feel like there's a, like, is it hard to get to consensus internally on, like, what the right policy proposals are? I'm sure this is not the only divergence in how you guys think about this stuff, but I'd love if you could bring us on the inside of some of those conversations. So I would say, Daniel, I at least feel like I'm pretty aligned with Daniel on policy recommendations. I think we tend to

It tends to be that most of the recommendations that we try to make are just pretty robustly good across worlds that we think are plausible because, you know, there's just so much uncertainty that we try to make robustly good policy recommendations either way. Yeah, Daniel, do you have a thought? Yeah, I agree. When I was nodding along when you said if you disagree internally about policy recommendations, I thought you meant like,

I don't know, the AI researcher community more broadly or like people. Oh, yeah. Obviously, they just... Yeah, yeah, yeah. But within AI Futures Project, we're like broadly on the same page so far. Although, you know, we still have to actually write the recommendations, maybe some... Yeah, yeah. Can I ask you to tease like, you know, the key parts of the platform here? I mean, yeah. So the thing we're going to do is...

So we've written AI 2027 in two branches that we think are plausible things that we think are likely. And then we kind of want to move to the ought side of the is-ought dichotomy and go for, well, suppose AI 2027 is coming true.

What should governments and labs actually do in response to this? What do we think the optimal action is for them to take care of? Or just the responsible action. Within some bounds of feasibility. Yeah. To give the quick summary, I think there's some near-term actions that we think are pretty politically feasible and quite good. Like, for example, being way more transparent about model capabilities.

and making sure that there isn't a giant gap between internally and externally deployed models. And things like investing much in alignment research and investing way more into security so that the models don't immediately proliferate. Have a model spec.

Publish the model spec, have a safety case, publish the safety case. I'm a big fan of that. That kind of thing. And then sort of the second phase is these, you know, what should we do if AGI is actually happening? Or if we're in the middle of an intelligence explosion and the world sort of hasn't taken radical steps already. And our view is, you know, probably the government should be doing some pretty extreme things, right?

relative to the current sort of political environment. Those things, you know, things that should be on the table should be like an international treaty to not build super intelligent AI until we've squared away the whole alignment thing.

And, you know, that's obviously going to be, it seems like a quite big lift is quite far outside of sort of the realm of what's talked about politically right now. But we think something like that might end up being necessary, particularly if you get, you know, if you just think that risk is reasonably high. And I think it will be very, very hard to make

to sort of get risk down to an acceptable level if you're just building super intelligences in the next few years. If you're early in the television explosion, it's just really hard, I think, to be very confident that, oh yeah, you know, we'll build those super intelligences, they'll take over the world, and then they'll totally be nice to us. I think it's just really hard to make a case that that risk is very low unless you've spent, you know, a lot of time sort of testing those AIs

finding better alignment solutions, finding better verification techniques than we have today. And so, you know, I think that if AI 2027 is coming true, something like that will be necessary to make risk reasonably low. There's also the concentration of power stuff. Yes. Which I also think is important. So like, separately from the whole alignment thing, we want to make it the case that there's not any one person or any small group of people that's effectively in charge of the army of super intelligences that comes out at the other end of the intelligence explosion. Right.

And this is like a political thing, not a technical thing. Like there needs to be some sort of governance structure. Maybe we can have like multiple competing diverse AI companies, but you have to have some mechanism that causes them to all be roughly neck and neck. Otherwise there's a risk that one of them will just sort of take off and get a lead over the rest. Maybe if it's coordinated into one big mega project, then you need to have democratic control of that mega project, transparency into the decisions being made by the leaders, et cetera.

There's lots more to say on that subject. I mean, is the audience of your work ultimately like six people and like the people that maybe influence them directly? Like, I guess what like the rest of us do in like right now? I mean, I think communication helps a lot. I don't know. I think talking about these issues, getting the public, getting Congress, like for example, so like one of the ways the concentration of power stuff could go badly, right, is if it's in our scenario, if it's like just the president or just the lab leader who's in charge, right?

of what's going on. That's a lot more scary of a situation than if, you know, the public or Congress or other bodies are awake to the situation and then use their levers of power to make sure that they don't get

uh cooed or or completely disempowered by whoever's in charge of super intelligences um so that seems i mean so just talking about it seems already like a pretty good step to me um yeah daniel do you have more i wish i wish there was more that everyone could do i wish there was more i could do right like i wish i was you know i'm obviously not a lab leader or the president of the united states or you know uh and so i think you know we're all trying our best here but

I used to be working at an open AI and it was tempting to take the sort of like, um, grim view that like, well, almost everybody doesn't really matter. What matters is like what the CEO and the president does. And like here on the inside, I have like a better chance of influencing those people than on the outside. Um,

And that's sort of the view of a lot of people I know, basically. And that's why they like are not really in the public eye very much and are instead working at these companies. But I'm sort of like placing a bet on the public and on this sort of like broad perspective

wake up that I'm hoping will happen. Yeah. And then having more eyes on, on it actually, you know, really does matter. I mean, do you buy, obviously I feel like a lot of the original story of like, uh, anthropic and then Ilya spinning out, it's like people being like, well, God, this is going to happen. And like better, it happened in my hands and it happened in, in like these other things. Like, uh, did that make, I mean, is that like the most impactful thing that like an individual can be doing? Like in the, in that situation?

Maybe the most impactful, but it has to be positive too. Yeah. But I guess put another way, like, do you agree? Like, is that like line of thinking, you know, uh, you know, I guess, do you agree with that line of thinking? Well, it's extremely tempting and it had, it has in fact been the decision of many, many people I know is to be like, gosh, like,

you know, the public is not going to wake up in time. And if they do, they're going to flail around and do something useless. Ditto for Congress. All that really matters is what like a couple of CEOs of the most powerful tech companies do and maybe what the president does. And I don't like the

current CEO, I don't trust him. So I'm going to go do my own thing. You know, I'm going to be the CEO, you know, like this is like literally the story of how DeepMind was founded, literally the story of how OpenAI was founded, literally the story of how Anthropic was founded, literally the story of how SSI was founded. It's kind of humorous, the extent to which

to which this has been continually happening so i would to to phrase it more clearly i do not agree with that strategy i think that strategy is probably bad for the world um yeah yeah and i also think that way which is why i've done what i've done instead of uh going to anthropic for example a year ago uh i when i asked myself like how is this all going to go down um

Something like AI 2027, but lower resolution and less worked out was sort of roughly the answer that I was coming up with. And then I was like, this is horrifying. This is not good. And then there's a question of what to do about it. And I sort of gave up on being the guy on the inside trying to change things for the better. And I'm now trying this different strategy of...

being on the outside and being free to speak and being free to do this sort of research, which is, you know, AI 2027 is an unusual style of research, right? There aren't very many epic scenario forecasts like this. And so I probably just couldn't have done this within any of the companies. And in fact,

Even if I could have done it, I wouldn't have been able to publish it because the PR team would have had a fit if they found out. And so part of my lead, I basically faced this choice between, there's this quote by, I think, Larry Summers. Have you heard about this? You know what I'm going to say? No, I thought you were going to go with a philosopher, not Larry Summers. Yeah, so I think you can Google this, but Larry Summers, on multiple occasions, according to

various people who've talked to him and then talked to the media. He said something like this. He says, there's two types of people. There's the insiders and the outsiders. And the outsiders are free to speak their truth, but the people in power don't listen to them. And then the insiders follow this one rule of never criticize other insiders.

But as a result, they get access to like the really important behind closed doors conversations and like get to actually move things. And that's how he sees the world. And maybe there's a lot of truth to that. And I guess I feel like it is more honorable and ethical to do the outsider thing than the insider thing. So that's what I'm doing.

Super interesting. I mean, I imagine a lot of people will read this and they'll be scared and they'll think about it and they'll talk about it. And then they'll say, okay, it's pretty damn uncertain whether we're on this super accelerating timeline. Let me wait for a few cards to turn over. I'm like –

Right now, these model, I play with them like, God, they're dumb in some ways. Like, you know, if we get to 2026 and it's like what you guys say in 26, like then I'm going to like really start to worry. What is like your message to those kind of people that are like, I'll like tape, I'll bookmark this. I'll think about it and then I'll come back to it like in a year and then I'll start to be concerned. I think that's basically right. I think like it's important to avoid being frog poiled. But like what you do, like what happens, what the government and what the companies do

in the year of AGI matters so much more than what they do right now. What they do right now is just all like set up, you know, and like prep for that. And so like, yeah, if you want to just like stick your head in the sand for a couple of years until things start taking off and then get activated and do something, great, go for that. That's, you know, I'm fine with that personally. I think it'd be even better if you got activated now, but like, you know, I think the most important thing is that like when this stuff starts to,

starts to starts to tick up yeah like can you give me a milestone that like i should get my head out of the sand for like i would say the superhuman coder thing so so and that's so late okay well you you give a different opinion so i like superhuman coder don't you think that's like a few months before like yes i think that's a few months before the point of no return

Okay, so maybe you should get activated sooner than that. I mean, I myself have been activated for several years, in fact. Here's a line I think is pretty reasonable. The point at which AR and D-speed is increasing by 2x. So if you do uplift studies on frontier AI lab researchers with AI systems... Are we going to get those studies? Like, will I know when that's happening?

I hope so. I think we'll have, I think we probably will. I'm not totally sure if they'll be accessible. Maybe it'll all be locked down. But I think, I don't know. I mean, I know some people who are trying to do them. And I think, like right now the uplift is, doesn't seem that high. I'd be very surprised if you got, you know, anywhere near 2x. I think that's genuinely pretty hard, right? It's hard to accelerate, I think.

AR and D speed by 2x. I think you'd actually need pretty capable AI's to do that. And I'd be pretty surprised if AI progress just fizzled out sort of right then. I think that's sort of the best warning sign I can think of. Though, of course, it's not guaranteed to be right. I mean, I think it also depends on what you're trying to do.

When I was imagining something like, when should you start a protest movement and be banging down people's doors? And then I'm like, yeah, superhuman coder milestone. Because I feel like by the time they're just autonomously doing all this coding and they've solved long horizon agency to the point where they are just fully autonomous agents, and they're also already accelerating AI R&D substantially, and they're just missing a few key skills needed to completely close the loop.

I feel like that's like, okay, you are now a few months away, maybe at most, from really crazy stuff. Time to really start pulling out all the steps. But if you just mean, when should I start reading the news about AI more? Or when should I consider switching from natural language processing to mechanistic interpretability? Then I'd be like, yeah.

yesterday. Right. Obviously, it seems like you're kind of like setting that, you know, in some senses, you're like setting the table, you've gotten everyone talking about this. What else does like this entire movement compulsive for you guys? And how do you think about spending your time over the next year? You know, it was psychologically kind of rough, at least for me to have this one mega project that the whole team was working on for almost a year. My blogging output, for example, was a lot lower than it than it could have been because I was like, I

I always have to do the most important thing, which is this project. So we're now in a sort of like exploratory phase where we're trying out a couple different things at once and seeing what we like most and seeing what seems to be getting the most traction. One of the things that we're exploring is the policy stuff that Thomas just mentioned, writing a new scenario that's the normative scenario and having accompanying white papers and stuff explaining some of the ideas in it. Another thing we're trying out is tabletop exercises. So...

The tabletop exercises, we did them, the war games, we did them as a tool to help us write AI 2027, but they've been extraordinarily popular and lots of companies and people want us to run them for their teams. And so we're probably going to lean into that a little bit and try doing it on a more regular cadence and see how that goes.

And then separately, more forecasting stuff. So like new evidence rolls in every month. There's new models to think about. There's new developments. I think it's probably good to stay on top of all of that and to keep adjusting our forecasts and our expectations, writing, you know,

you know, both in-depth stuff responding to criticisms and being like, here's a new updated analysis of takeoff speeds that explicitly responds to our strongest critics while also just being a better analysis than what we did in AI 2027. That's something I would love to write at some point. Same thing for alignment. Like here is a, like, for example, like I want to do a project on trying to guess what the companies are going to do to fix the obvious misalignments and

and then trying to guess whether that would actually work or not. I can go talk to people at the companies who are responsible for doing those fixes and try to get... That's a whole project I could be doing. So yeah, I think I'll stop there. But there's a whole spread of things like this and we're going to try to...

do miscellaneous things for a bit until we start getting conviction about one particular thing, and then we might double down on that. Another thing I'll say is I think there are a few more scenario branches that would be nice to write. So we talked about the normative branch. There's also the longer timelines branch, where there's a lot of different things. If timelines are 2033, which we all think is pretty plausible, and a lot of people think is the most likely timeline, a lot of things change.

It's more likely China's doing better. It's likelier that there's a slower takeoff. There's more opportunity for chaos to happen. And we kind of want to just at least do one branch of gaming that out. And we think that'd be easier to do, hopefully, than 2027 now that we've figured out how to write a scenario at all. And so we might be able to crank out a couple more branches like that pretty easily at this point.

Yeah, hopefully you get some AI scenario writers or something that 2Xs your productivity too before it happens on the researcher side. Look, fascinating conversation. You know, I feel like where I'd actually love to end is just like in the scenario where this goes right, I mean, I'm sure you guys have thought about this as much as anybody. Let's assume, you know, the works you do and the works of others does that. Like,

What does human life look like in like 15, 20 years? Like what do we like derive value from? Probably like absolute bliss or like crazy awesome utopia. Like, I mean, you said if everything goes well, you know, so yeah, if everything goes well. But it seems pretty binary in your world, right? Like either we're here or we're not. I would say, so there's like, okay, well, here's the spread of outcomes.

ranging from best to worst? Or would you rather get them from worst to best for optimism? You know, I feel like once you told me it was 70 to 80% we were done, I'm scared enough. So you can start worst to best. Okay, so worst is S-risk, which is fate's worse than death.

I'll say no more about that. Then next worst is death. So this is like what we depict in the race ending where the AIs that we build don't actually care about us at all and don't see a reason to keep us around and we're using resources that they can use for other stuff and so they kill us all and take our stuff. Then after that is sort of like mixed outcomes where either it's a concentration of power outcome or it's a misaligned outcome

but not the like, and then we all die sort of thing, but rather something more like a dystopia where like the humans who successfully managed to stay in charge are just a handful of humans and they're kind of not great people. They're kind of dictatorial. And so they reshape the world in their image. And, you know, the world is amazing from their perspective, but, and most people are like well-fed or something, but like it's kind of like a,

like a very wealthy North Korea perhaps. Like if you imagine like what would happen if North Korea just had like insane amounts of like food and resources dropped on it so that they could, if they wanted to, distribute it to the population and make everyone have a very high standard of living, perhaps they actually would do that and like most people except for the political enemies would have very high standards of living. But still it would be kind of like a mixed, like it wouldn't be the nice wasabi-topia that we wanted, you know? And then there's better outcomes that are actually just like

truly awesome utopias where the power is sufficiently distributed that like there's a sort of like live and let live uh there's loads of wealth and it's distributed and then people are mostly allowed to do what they feel like with that wealth and there's rules that prevent people from doing truly terrible human rights violating things and then everyone gets to like you know go make colonies uh in space and you know uh not have to work again because everything is

you know, created by the robots. And, you know, you can just spend your time playing games and having families and, you know, pursuing whatever interests you have. That I think is the sort of like best case or something close to the best case outcome. I just want to double emphasize loads of people at these companies think that basically time to change is

yeah 2027 is what's going to happen or something similar to it i don't know what percentage it is but like i know loads of people who are like great work it's so good to like try to game all this stuff out and then i talk to them and i'm like so what do you agree with what do you disagree with and they're like oh like i think the robotic stuff is going to go a little bit slower than you say or like oh i think the alignment stuff will be mostly solved by then you know like or like

Or, you know, like, or like, I think China will be farther behind. But like, basically, like, they're like, basically, yep, something like this is going to happen somewhere in the next few years. And then we are like quibbling about some of the details. Especially a lot of leadership, too, I feel like. Like, have you read, I wanted to ask, have you read the Elon Sam emails from the court case from like 2015, 2016, 2017? Yeah. It's like crazy how they're, like, they're thinking in terms of, you know, who's in charge of the AGI controls the future. And that's why they, you know.

care so much about that. Ilya said that's why we found it open AI is because we didn't trust Demis not to create an AGI dictatorship, you know. And then he says, and that's why, Elon, you shouldn't just have complete control over all of this. And also, Sam, why do you want to be CEO so much? I guess there's a bunch of different ways people could do a counter scenario to you guys. What would the contours of a really good objection to you guys that isn't that be? So something that I'm very uncertain of is takeoff speeds. So how long does it take to get between AGI and super intelligence?

And I think that all of the models, people have put out a lot of models of how long this will take and how fast it will be. Not a lot by normal academic standards. There's like three. This deal desperately needs to grow bigger fast. Sure. And I'm very grateful. There have been several different models by several different people. Yeah. And I'm really happy that people did those models. I'm really happy they exist. But I also think they're bad. I think those models are extremely unconvincing to me. And we could go into the objections that I have with basically all of them and for why I don't really think they're right. Yeah.

If someone produced a model that seemed accurate to me, I could imagine it really changing my mind about takeoff speeds in either direction, right? Either saying, you know, you guys were, you know, it was way too long. You're talking about basically like we get to like, you know, we get to the 2x AI researcher or we get to superhuman coders. But like, God, the distance between that and like your agent four is actually way longer because of x, y, or z. That's right. Like I could imagine it being five years or something.

And I think that world looks really different if it's five years instead of six months or a year or something. And I could really imagine myself, like, I currently feel like I'm really, really uncertain about that question. And I really don't feel like I've seen very good evidence or models really in any direction. And I could imagine someone producing something. Like, I don't know how to build this model yet. I also would have done it.

But, like, I mean, would that actually change anything? Because ultimately, like, I assume it's so far out in hypothetical that, like, maybe it changes your P-Doom. Like, I don't know. Even if it changes it from, like, 70-80 to, like, 10, a 10 P-Doom is, like, kind of pretty damn terrible. It's hard to do anything or focus on anything else, right? I could imagine it changing things quite a bit. So, like, the Epoch People's views, for example, so, like, Ege and Tamai and such, I think, have quite different views from, like, I think Daniel and I. And I think a lot of that is downstream of this takeoff question. And I think their view is, like, that this will take time.

you know, five, maybe 10, 15 years. I'm not familiar. What is like the basic crux of that, of why it would be slower? It's complicated, but I think that the crux is something to do with diminishing returns of research. So there's this question of how quickly do ideas get harder to find versus how quickly do you, you know, scale up your AI, you know, your supply of AI agents doing research. And then like, you know, under some views are like, it leads to intelligent explosion because, you know, you get more AI research results

And then like the amount that AIs get, you know, ideas get hard to find is like slow enough that you get this like hyperbolic or super influential or whatever, or some really fast curve. And then there's other views that are like, no, it actually, you know, bites really hard. There are Amidal's Law type effects. There are like various bottlenecks. But what could they possibly show you that would like, you know what I mean? Like it's like, it just feels like. Well, a coherent model would be a good place to start. Like, I'm like, they have a model. The EPUC people, they do have them.

they have a sort of econometric model called the GATE model, G-A-T-E model. It's not even really trying to model the sort of thing we're interested in. It even says, there's a disclaimer somewhere where they're like, we don't model the effects after you reach full automation or something like that.

you know, better research along those lines would be amazing. But then separately from the research with the, there's the sort of like the, there's the arguments and the models, but then there's also just literally the scenarios. And the reason why I think it would be great for people to write these scenarios is, A, I think it's valuable to,

They're valuable artifacts as conversation topics to argue and critique about. So, for example, if someone wrote a scenario, the current best... I recommend you go read my current favorite counter scenario that exists, which is called History of the Future by L. Rudolph L. It's on Substack. So this is a 15, 20-year timelines scenario that...

has the superhuman coder milestone arriving in like 2027, but then says that like, that doesn't lead to super intelligence basically. Or like, yeah, no, you should go read it for yourself. But it sort of lays out this, this story. And it's very interesting, sort of like detailed year by year scenario, just like we did. And so now that that exists, we can like critique it and we can be like, here, you say there's superhuman coders, but then,

like why don't you get to super intelligence like it seems like we think that you you know and then we can like zoom in on that part and then there's another part of the story where they're like and then they figured out the courage ability and it has like a paragraph on basically why the ais are aligned

And I'd be like, well, let's talk about that paragraph. I don't find that convincing. Like, I don't think that would actually work, you know? So, so when you actually have this sort of concrete story, then you can critique it and you can like, right. And then in addition to the critique of the story itself, you can compare two stories to each other and you can start seeing when reality is like hewing more closely to one as opposed to the other. So,

So, yeah, more arguments, more models, that'd be great, but also just more scenarios, I think. Oh, I'll also end with this. I really hope we're wrong. Yeah, I guess nothing would make you happier than you look really silly on the internet. If those benchmark curves start flattening and 2027 arrives and the AIs are only two-day horizon workers or something and they just sort of reliably...

flare out when you try to get them to work over periods of months. I'll be so happy. And specifically if the trends are slowing too, so that it's not like we're about to take off, but we can sort of see, okay, we got at least a couple years left before anything really scary happens. I'll be so happy. Throw a huge party. It'll be great. ... ...

Ep 65: Co-Authors of AI-2027 Daniel Kokotajlo and Thomas Larsen On Their Detailed AI Predictions for the Coming Years 01:23:27 Share

Unsupervised Learning

Shownotes Transcript

Ep 65: Co-Authors of AI-2027 Daniel Kokotajlo and Thomas Larsen On Their Detailed AI Predictions for the Coming Years