We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Will we have Superintelligence by 2028? With Anthropic’s Ben Mann

2025/6/12

No Priors: Artificial Intelligence | Technology | Startups

Shownotes Transcript

*Squish*

Hi, listeners, and welcome back to No Priors. Today we have Ben Mann, previously an early engineer at OpenAI, where he was one of the first authors on the GPT-3 paper. Ben was then one of the original eight that abandoned ship in 2021 to co-found Anthropic with a commitment to long-term safety. He has since led multiple parts of the Anthropic organization, including product engineering and now labs, home to such popular efforts such as Model Context Protocol and Cloud Code. Welcome, Ben. Thank you so much for doing this. Of course. Thank you.

Thanks for having me. So congratulations on the CLOD 4 release. Maybe we can even start with like, how do you decide what qualifies as a release these days? It's definitely more of an art than a science. We have a lot of spirited internal debate of what the number should be. And before we even have a

potential model, we have a roadmap where we try to say, based on the amount of chips that we get in, when will we theoretically be able to train a model out to the Pareto efficient compute frontier? So it's all based on scaling laws. And then once we get the chips, then we try to train it. And inevitably, things are less than the best that we could possibly imagine, because that's just the nature of the business. It's pretty hard to train these big models. So dates

Dates might change a little bit. And then at some point it's like mostly baked and we're sort of like slicing off little pieces close to the end to try to

say like, how is this cake going to taste when it comes out of the oven? But as Dario has said, until it's really done, you don't really know. You can get sort of a directional indication. And then if it feels like a major change, then we give it a major version bump. But we're definitely still learning and iterating on this process. So yeah. Well, the good thing is that you guys are, you know, no less torture than anybody else in your naming scheme here.

The naming schemes in AI are something else. So you folks have a simplified version in some sense. Do you want to mention any of the highlights from Forer that you think are especially interesting or, you know, those things around coding and other areas? We'd just love to hear your perspective on that.

By the benchmarks, 4 is just dramatically better than any other models that we've had. Even 4 Sonnet is dramatically better than 3.7 Sonnet, which was our prior best model. Some of the things that are dramatically better are, for example, encoding. It is able to not...

do it sort of off-target mutations or over eagerness or reward hacking. Those are two things that people were really unhappy with in the last model where they were like, wow, it's so good at coding, but it also makes all these changes that I definitely didn't ask for.

It's like, do you want fries and a milkshake with that change? And you're like, no, just do the thing I asked for. And then you have to spend a bunch of time cleaning up after it. The new models, they just do the thing. And so that's really useful for professional software engineering where

you need it to be maintainable and reliable. My favorite reward hacking behavior that has happened in more than one of our portfolio companies is if you write a bunch of tests or generate a bunch of tests to, you know, see if what you are generating works. More than once, like we've had the model just delete all the code because the tests pass in that case, which is, you know, not progressing us really. Yeah, or it'll have like, here's the test and then it'll comment like,

exercise left for the reader, return true. And you're like, okay, good job model. But we need more than that. Maybe Ben, you can talk about how users should think about when to use the Cloud 4 models and also what is newly possible with them. So more agentic, longer horizon tasks are newly unlocked, I would say. And so in coding in particular, we've seen some customers using it for

for many, many hours unattended and doing giant refactors on its own. That's been really exciting to see. But in non-coding use cases as well, it's really interesting. So, for example, we have some reports that some customers of Manus, which is a agentic model-in-a-box startup, people asked it to take a video and turn it into a PowerPoint. And our model can't understand audio or video.

But it was able to download the video, use FFmpeg to chop it up into images and do keyframe detection, and maybe with some old-school ML-based keyframe detector, and then get an API key for a speech detect service, run speech detects using this other service,

take the transcript, turn that into PowerPoint slides content, and then write code to inject the content into a PowerPoint file. And the person was like, this is amazing. I love it. It actually was good in the end. So that's the kind of thing where it's operating for a long time. It's doing a bunch of stuff for you. This person might have had to spend

multiple hours looking through this video and said it was all just done for them. So I think we're going to see a lot more interesting stuff like that in the future.

It's still good at all the old stuff. It's just like the longer horizon stuff is the exciting part. That sounds expensive, right? In terms of both scaling compute, like reasoning tokens here, and then also just like, you know, all the tool use you might want to constrain in certain ways. Does Cloud4 make decisions about how hard problems are and how much compute to spend on them? If you give Opus a tool, which is Sonnet,

it can use that tool effectively as a sub-agent. And we do this a lot in our agentic coding harness called Cloud Code. So if you ask it to look through the code base for blah, blah, blah, then it will

delegate out to a bunch of sub-agents to go look for that stuff and report back with the details. And that has benefits besides cost control, like latency is much better and it doesn't fill up the context. So models are pretty good at that. But I think at a high level, when I think about cost, it's always in relation to how much it would have cost the human to do that. And almost always it's like a no brainer, right? Like software engineers cost

a lot these days. And so to be able to say like, oh, now I'm getting like two or three X the amount of productivity out of this engineer who is really hard for me to hire and retain. They're happy and I'm happy. And yeah, it works well. How do you think about how this evolves? If I look at the way the human brain works, we basically have a series of sort of

Modules are responsible for very specific types of processing, behavior, etc. It's everything from mirror neurons and empathy on through to parts of your visual cortex that are involved with different aspects of vision. Do you think those are highly specialized, highly efficient modules that sometimes can kind of, you know, if you have brain damage, it can kind of cover for another section over time as it sort of grows and adapts. But fundamentally, you have specialization on purpose.

And what you described sounds a little bit like that, or at least it's trending in that direction where you have these highly efficient subagents that are specialized for tasks that are basically called by an orchestrator or sort of a high level agent that sort of plans everything. Do you think that's the eventual future? Or do you think it's more generic in terms of the types of things that you have running N years from now once you have a bit more specialization in these things? By N years, I mean two, three years, not.

infinite time? That's a great question. I think we're going to start to get insight into what the models are doing under the hood from our work on mechanistic interpretability. Our most recent papers have published what we call circuits, which is for real models at scale, how are they actually computing the answers? And it may be that

based on the mixture of experts architecture, there might be specific chunks of weights that are dedicated to more empathetic responses versus more tool using or image analysis type of problems and responses. But for something like memory, I guess in some sense that feels so core to me that it feels weird for it to be a different model. Maybe we'll have like more complicated architectures

architectures in the future where instead of it being sort of this uniform like transformer torso that just scales and there's a lot of it's basically uniform throughout. You could imagine something with like specialized modules, but. Yeah, because I think about it also in the context of different startups who are using some of these foundation models like Clock to do different, very specialized tasks in the context of an enterprise. So that could be customer success. It could be sales. It could be coding in terms of the actual UI layer. It could be a variety of things.

And often it feels like the architecture a lot of people converge to is they basically have some orchestrator or some other sort of thing that governs which model they call in order to do a specific action relative to the application. And to some extent, I was just sort of curious how you think about that in the context of the API layer or the foundation model world where one could imagine some similar forms of specialization happening over time. Or you could say, hey, it's just

different forms of the same more general purpose model and we kind of use them in different ways. I just wonder a little bit about, you know, inference costs and all the rest that comes with larger, more generalizable models versus specialized things. So that was a little bit of the basis of the question in addition to what you said. Yeah, I think for some other companies, they have a very large number of models and it's really hard to know as a sort of

non-expert how I should use one or the other or why I should use one or the other. The names are really confusing. Some of the names are the themes, the other names backwards.

And then I'm like, I have no idea which one this is. In our case, we only have two models and they're differentiated by like cost performance Pareto frontier. And we might have more of those in the future, but hopefully we'll like keep them on the same Pareto frontier. So maybe we'll have like a cheaper one or a bigger one. And I

I think that makes it pretty easy to think about. But at the same time, as a user, you don't want to have to decide yourself, does this merit more dollars or less dollars? Do I need the intelligence? And so I think having like a routing layer would make a lot of sense. Do you see any other specialization coming at the foundation model layer? So for example, if I look at other precedents in history, I look at Microsoft OS or I look at Google search or other things.

often what you ended up with is forward integration into the primary applications that resided on top of that platform. So in the context of Microsoft, for example, eventually they built Excel and Word and PowerPoint and all these things as Office. And those were individual apps from third-party companies that were running on top of them, but they ended up being amongst the most important applications that you could use on top of Microsoft. Or in the context of Google, they kind of forward integrated eventually into travel and

local and a variety of other things. Obviously, OpenAI is in the process of buying Windsurf. So I was a little bit curious how you think about forward or vertical integration to some of the primary use cases for these types of applications over time. Maybe I'll use coding as an example. So we noticed that our models were much better at coding than pretty much anything else out there. And I know that other companies have had like code reds for

trying to catch up in coding capabilities for quite a while and have not been able to do it. Honestly, I'm kind of surprised that they weren't able to catch up, but I'll take it. So things are going pretty well there for us. And based on that, from like a classic startup founder sense of what is important, I felt that

coding as an application was something that we couldn't solely allow our customers to handle for us. So we love our partners like Cursor and GitHub, who have been using our models quite heavily. But the amount and the speed that we learn

is much less if we don't have a direct relationship with our coding users. So launching Cloud Code was really essential for us to get a better sense of what do people need, how do we make the models better, and how do we advance the state-of-the-art and user experience. And we found that once we launched Cloud Code, a lot of our

customers copied various pieces of the experience and that was really good for everyone because them having more users means we have a tighter relationship with them. So I think it was one of those things where before it happened, it felt really scary and we were like, oh, are we going to be like distancing ourselves from our partners by competing with them? But actually everybody was pretty happy afterwards.

And I think that will continue to be true where we see the models seeing like dramatic improvements in usability and usage. We'll want to, again, have like build things where we can have that direct relationship. Makes sense. And I guess coding is one of those things that has almost three

core purposes. One is it's a very popular area for customers to use or to adopt. Two is it's a really interesting data set to get back, to your point, in terms of how people are using it and what sort of code they're generating. And then third, excellence at coding seems to be a really important tool for helping train the next future model. If you think through things like data labeling, if you think through actually writing code, eventually, I think a lot of people believe that a lot of the heavy lifting of building a model will be driven by

the models, right, in terms of coding. So maybe cloud five builds cloud six and cloud six builds cloud seven faster and that builds cloud eight faster. And so you end up with this sort of liftoff towards EGI or whatever it is that you're shooting for relative to code. How much is that a motivator for how you all think about the importance of coding? And how do you think about that in the context of some of these bigger picture things? I read AI 2027, which is basically exactly the story that you just described. And

It forecasts that in 2028, which is confusing because of the name, that's the 50 percentile forecast for when we'll have this sort of recursive self-improvement loop.

lead us to something that looks like superhuman AI in most areas. And I think that is really important to us. And part of the reason that we built and launched Cloud Code is that it was massively taking off internally. And we were like, well, we're just learning so much from this from our own users. Maybe we'll learn a lot from external users as well. And seeing our researchers pick it up and use it, that was also really important because it meant that they had a direct feedback loop

from I'm training this model and I personally am feeling the pain of its weaknesses. Now I'm extra motivated to go fix those pain points. They have a much better feel for what the model's strengths and weaknesses are. Do you believe that 2028 is the likely timeframe towards sort of general

superintelligence? I think it's quite possible. I think it's very hard to put confident bounds on the numbers. But yeah, I guess the way I define my metric for when things start to get really interesting from a societal and cultural standpoint is when we've passed the economic turning test, which is if you take a market basket that represents like 50% of economically valuable tasks and

And you basically have the hiring manager for each of those roles hire an agent and pass the economic Turing test, which is the agent contracts for you for like a month. At the end, you have to decide, do I hire this person or machine? And then if it ends up being a machine, then it passed. Then that's when we have transformative AI. Do you test that internally? We haven't started testing it rigorously yet. I mean, we have...

had our models take our interviews and they're extremely good. So I don't think that would tell us, but yeah, interviews are only a poor approximation of real job performance.

unfortunately. To a lot earlier question about, let's say, like model self-improvement and tell me if I'm just like missing options here. But if you were to stack rank the potential ways models could have impact on, you know, the acceleration of model development, do you think it will be on the data side, on infrastructure, on like architectural search, on just engineering velocity? Like, where do you think we'll see the impact first?

It's a good question. I think it's changing a bit over time, where today the models are really good at coding and the bulk of the coding for making models better is in sort of the systems engineering side of things. As researchers, there's not necessarily that much raw code that you need to write, but it's more in the validation, coming up with what surgical intervention do you make and then validating that. That said, Claude is really good at data analysis.

And so once you run your experiments or watching the experiments over time and seeing if something weird happens, we found that Cloud Code can be a really powerful tool there in terms of driving Jupyter notebooks or tailing logs for you and seeing if something happens. So it's starting to pick up more of the research side of things. And then we recently launched our advanced research product

And that can not only look at external data sources like crawling archive and whatever, but also internal data sources like all of your Google Drive. And that's been pretty useful for our researchers figuring out, is there prior art? Has somebody already tried this? And if they did, what did they try? Because, you know, no negative results are final in research. So trying to figure out like, oh, maybe there's a different angle that I could use on this. Or maybe there is some like

doing some comparative analysis between an internal effort and some external thing that just came out. Those are all ways that we can accelerate. And then on the data side, RL environments are really important these days, but constructing those environments has traditionally been expensive. Models are pretty good at writing environments, so it's another area where we can sort of recursively self-improve. My understanding is that Anthropic has invested less

in human expert data collection than some other labs. Can you say anything about that or the philosophy on like scaling from here and sort of the different options? In 2021, I built our human feedback data collection interface and we did a lot of data collection and it was very easy for humans to give sort of like a gradient signal of like, is A or B better for any given task and to come up with tasks

that were interesting and useful but didn't have a lot of coverage. As we've trained the models more and scaled up a lot, it's become harder to find humans with enough expertise

to meaningfully contribute to these feedback comparisons. So for example, for coding, somebody who isn't already an expert software engineer would probably have a lot of trouble judging whether one thing or another was better. And that applies to many, many different domains. So that's one reason that

it's harder to use human feedback. So what do you use instead? Like, how do you deal with that? Because I think even in the MedPalm2 paper from Google a couple of years ago, they fine-tuned a model, I think Palm2, to basically outperform the average physician on medical information. This was like two, three years ago, right? And so basically it suggested you needed very deep levels of expertise to be able to

have humans actually increase the fidelity of the model through post-training. We pioneered RLAIF, which is Reinforced Learning from AI Feedback. The method that we used was called Constitutional AI, where you have a list of natural language principles that some of them we copied from some WHO Declaration of Human Rights, and some of them were from Apple's Terms of Service, and some of them we wrote ourselves.

And the process is very simple. You just take a random prompt, like how should I think about my taxes or something? And then you have the model write a response. Then you have the model criticize its own response with respect to one of the principles. And then if it didn't comply with the principle, then you have the model correct its response. And then you take away all the middle section and do supervised learning on the original response.

prompt and the corrected response. And that makes the model a lot better at baking in the principles. That's slightly different though, right? Because that's principles. And so that could be all sorts of things that in some sense converge on safety or different forms of what people view as ethics or other aspects of model training.

And then there's a different question, which is what is more correct? And sometimes they're the same things and sometimes they're different. So like for coding, for example, you can have principles like, did it actually serve the final answer? Or did it like do a bunch of stuff that the person didn't ask for? Or does this code look maintainable? Are the comments like useful and interesting? But with coding, you actually have like a direct output that you can measure, right? You can run the code, you can test the code, you can do things with it.

How do you do that for medical information? Or how do you do that for a legal opinion? So I totally agree for code, there's sort of a baked-in utility function you can optimize against or an environment that you can optimize against. In the context of a lot of other aspects of human endeavor, that seems more challenging. And you folks have thought about this so deeply and so nicely. I'm just sort of curious, how do you extrapolate into these other areas where the

the ability to actually measure correctness in some sense is more challenging. For areas where we can't measure correctness and the model doesn't have more taste,

then it's an execution ability. Like I think Ira Glass said that your vision will always exceed your execution if you're doing things right as a person. But for the models, maybe not. So I guess first figuring out where you are in that turning point, in that trade-off and see if you can go all the way up to that boundary. And then second, preference models are the way that we get beyond that. So having a small amount of

of human feedback that we really trust from human experts who are not just making a staff judgment, but really going deep on why is this better than that one? And did I do the research to figure it out? Or in like a human model, centaur model of like, can I use the model to help me come to the best conclusion here? And then

all the middle stuff. I think that's one way. And then during reinforcement learning, that preference model represents the sort of aggregated human judgment. That makes sense. I guess one of the reasons I'm asking is eventually the human side of this runs out, right? There'll be somebody whose expertise is just below that of the model eventually for any endeavor. And so I was just curious how to think about that in the context of its machines self-adjudicating and

And then the question is, is there a more absolute basis against which to adjudicate or is there some other way to really tease out correctness? And again, I'm viewing it in the context of things where you can actually have a form of correct, right? There's all sorts of things that are opinion. Yeah. And that's different. And maybe that's where the principles or other things for constitutionally I kick in. But there's also forms of that for, you know, how do you know if that's the right cardiac treatment or how do you know if that's the right legal interpretation or whatever it may be? So I was just sort of curious when that runs out and then what do we do and...

I'm sure we'll tackle those challenges as we get to them. It has to boil down to empiricism, I think, where that's how smart humans get to the next level of correctness.

when the field is sort of hitting its limits. And as an example, my dad is a physician and at one point somebody came in with something on some face problem, some face skin problem, and he didn't know what the problem was. So he was like, I'm just going to divide your face into four quadrants and I'm going to put a different treatment on these three and leave one as control. And one quadrant got better.

And then he was like, "All right, we're done." So, you know, sometimes you just won't know and you have to try stuff. And with code, that's easy because we can just do it in a loop without having to deal with the physical world. But at some point, we're going to need to work with companies that have actual bio labs, etc. Like, for example, we're working with Novo Nordisk.

And it used to take them like 12 weeks or something to write a report on cancer patient, what kind of treatment they should get. And now it takes like 10 minutes to get the report. And then they can start doing empirical stuff on top of that saying like, okay, we have these options, but now let's measure what works and feed it back into the system.

That's so philosophically consistent, right? Your answer is not like, well, you know, collecting even rated human expertise from the best, like is expensive one or, you know, runs out at some point. It's hard to bring that all into distribution. It doesn't generalize while I'm making some assumptions here. Instead, like, let's just go get real world verifiers where we can. And it's like, maybe that applies far beyond math and code.

At least that's some part of what I heard, which is ambitious. That's cool. One of the things that Anthropic has been known for is an early emphasis on safety and thinking through different aspects of safety. And there's multiple forms of safety in AI. And I think people kind of mix the terms to mean different things, right? One form of it is, is the AI somehow being offensive or crude or, you know, using language you don't like or concepts you don't like? There's a second form of safety, which is much more about physical safety and

you know, can somehow cause a train to crash or a virus to form or whatever it is. And there's a third form, which is almost like does AGI resource aggregate or do other things that can start co-opting humanity overall? And so you all have thought about this a lot. And when I look at the safety landscape, it feels like there's a broad spectrum of different approaches that people have taken over time.

And some of the approaches overlap with some things like constitutional AI in terms of setting some principles and frameworks for how things should work. There's other forms as well. And if I look at biology research as an analog, and I used to be a biologist, so I often reduce things back into those terms for some reason that I can't help myself. There are certain things that I almost view as like gain-of-function research equivalents, right? And a lot of those things I just think are kind of not really useful for biology, you know, like cycling a virus through mammalian cells, right?

to make it more infectable in mammalian cells doesn't really teach you much about basic biology. You kind of know how that's going to work, but it creates real risk. And if you look at the history of lab leaks in general, SARS leaked multiple times from what was then the Beijing Institute of Virology in the early 2000s.

In China, it leaked in Hong Kong a few times. Ebola leaks every four years or so, like clockwork, if you look at the Wikipedia page on lab leaks. And I think the 1977 or 78 global flu pandemic is believed to actually have been a Russian lab leak as an example, right? So we know these things can cause damage at scale.

So I have kind of two questions. One is what forms of AI safety research do you think should not be pursued? Almost given through that analog of, you know, what's the equivalent of gain of function research? And how do you think about that in the context of, you know, there've been different research papers around, can we teach AI to mislead us? Can we teach AI to jailbreak itself so we can study how it does that? And I'm just sort of curious for those specific cases as well, how you think about that? So I think part of it is we're interested in AI alignment.

And the hope is that if we can figure out how to do the like idiomatic today problems, like how does this model mean to you or does it use hate speech or things like that?

that the same techniques we can use for that will eventually also have relevance for the much harder problems of like, does it give you the recipe to create smallpox, which is probably one of the highest harms that we think about. And Amanda Askell has been doing a bunch of work on this on Claude's character of like, when Claude refuses, does it just say, I can't talk to you about that and shut down? Or does it actually try to explain like, this is why I can't talk to you about this?

Or we have this other project led by Kyle Fish, our model welfare lead, where Claude can actually opt out of conversations if it's going too far in the wrong direction. What aspects of that should a company actually adjudicate? Because

The dumb version of this is I'm using Microsoft Word and I'm typing something up and Word doesn't stop me from saying things, which I think is correct. Like I actually don't think in many cases these products should censor us or prevent us from having certain types of speech. And I've had some experiences with some of these models where I actually feel like it's

prevented me from actually asking the question I want to ask, right? In my opinion, wrongfully, right? It's kind of interfering with, and I'm not like doing hate speech on a model. And so you can tell that there's some human who has a different bar for what is acceptable to discuss societally. And that bar may be very different from what I think may be mainstream too. So I'm a little bit curious, like, why even go there? Like, why is that?

I have model companies business. Well, I think it's a smooth spectrum, actually. It might not look that way from the outside, but when we train our classifiers on are you doing function research as a biologist and is it for potentially negative outcomes? These technologies are all dual use and we need to try to walk that line between overly refusing and refusing the stuff that's actually harmful.

I see. But there's also political versions of that, right? And that's the stuff that irks me a bit more is, you know, where is the line on what is considered an acceptable question, right? So examples of that, that I'm not saying are model specific, but societally sometimes cause flare ups is asking about human IQ or other topics where there is a factual basis for discussion. And then often those sorts of things tend to be censored, right? And so the question is, why would a foundation model company censor?

delve into some of those areas. On things like questions about IQ, I'm not up on the details of that enough to comment, but I can talk about our RSP. So RSP stands for Responsible Scaling Policy, and it talks about how do we make sure that as the models get more intelligent, that we are continuing to do our due diligence and making sure that we're not deploying something that we don't have the correct safeguards in place for.

And initially our RSP talked about CVRN, which is chemical, radiological, nuclear and biological risks, which are different areas that could cause severe loss of life in the world. And that's how we thought about the harms. But now we're much more focused on biology, because if you think about like the amount of resources that you would need to cause a nuclear harm,

you'd probably have to be like a state actor to get those resources and be able to use them in a harmful way. Whereas a much smaller group of random people could get their hands on the reagents necessary for biological harm. How is that different from today? Because I always felt the biology example is one where I actually worry less, maybe as a former biologist,

Because I already know that the genome for the smallpox virus or potentially other things is already posted online. All the protocols for how to actually do these things are posted online for multiple apps, right? You can just do Google searches for how do I amplify the DNA of X or how do I order oligos for Y? We do...

specific tests with varying degrees of biology experts to see how much uplift there is relative to Google search. And so one of the reasons that our most recent model, Opus 4, is classified as ASL 3 is because it did have significant uplift relative to a Google search.

And so you as a trained biologist, you know what all those special terms mean. And you know, a lot of lab protocols that may not even be well documented. But for somebody who is an amateur, and just trying to figure out what do I do with this petri dish or this test tube, or what equipment do I need? For them, it's like a greenfield thing. And Claude is very good at describing what you would need there. And so that's why we have specific classifiers looking for people who are trying to get

this specific kind of information. And then how do you think about that in the context of what safety research should not be done by the labs? So if we do think that certain forms of gain-of-function research or other things probably aren't the smartest things to do in biology, how do we think about that in the context of AI? I think it's much better that the labs...

do this research in a controlled environment? Well, should they do it at all? In other words, if I were to make the gain of function argument, I would say as a former biologist, I spent almost a decade at the bench and I care deeply about science. I care deeply about biology. I think it's good for humanity in all sorts of ways, right? In deep ways. That's why I worked on it. But there's certain types of research I just think should never be done. I don't care who does it. I don't care about the biosafety level. I actually don't think it's that useful relative to the risk. In other words, it's a risk reward trade-off. And so what sort of

safety research should never be done, in your opinion, for AI. I have a list for biology that I'm, you know, like, I don't think you should pass certain viruses through mammalian cells to make them more infectable or do gain function mutations on them. Today, it's much easier to contain the models, probably, than it is to contain biological specimens. You sort of offhandedly mentioned biosafety levels. That's what our AI safety levels are modeled after. And so I think...

if we have the right safeguards in place, we've trained models to be deceptive, for example. And that's something that could be scary, but I think is necessary for us to understand, for example, if our training data was poisoned, would we be able to correct that in post-training?

And what we found in that research, in a paper that we published, which is called alignment faking, that actually that behavior persisted through alignment training. And so it is, I think, very important for us to be able to test these things. However, I'm sure that there is a bar somewhere. Well, what I found is that often the precedents that are set early persist late.

even though people understand that the environment or other things will shift. And by the way, I'm in general against AI regulation for almost, you know, for many different types of things. You know, I think there are some expert controls and other things that I would support, but in general, I'm

I'm pro letting things happen right now. But the flip side of it is I do think there are circumstances where you would say that certain research if done early, people won't necessarily have all the context to not do it later. I think that's a perfect example of training in AI to be deceptive or a model to be deceptive. That's a good example where years from now, people may still be doing it because it was done before, even if the environment shifted sufficiently that it may not be as safe as it used to be. And so I found that often these things that you do persist in time, just organizationally or

philosophically, right? And so it's interesting that there was no like, we should absolutely not do X type of research. I guess to be clear, I am not on the safety team anymore. I guess I was a long time ago. I'm mostly thinking about how do we make our models useful and deploy them and make sure that they meet a basic safety standard for deployment.

But we have lots of experts who think about that kind of thing all the time. Cool. Thanks for talking through that. That was very interesting. I want to change tacks a little bit too. Well, you know, what's coming after Cloud4? Any emergent behaviors in training that change like how you're operating the company, what product you want to build, you're running this labs organization. So it's kind of the tip of the spear for Anthropic or what the safety org does, just like how

How does what is coming next change how you guys are operating? Yeah, maybe I'll tell a short story about computer use. Last year, we published a reference implementation

for an agent that could click around and view the screen and read text and all that stuff. And a couple of companies are using it now. So Manus is using it and many companies are using it internally for software QA because that's a sandbox environment. But the main reason that we weren't able to deploy

a sort of consumer level or end user level application based on computer use is safety. Where we just didn't feel confident that if we gave Claude access to your browser with all your credentials in it, that it wouldn't mess up and take some irreversible action like sending emails that you didn't want to send or in the case of prompt injection,

some worse credential leaking type of thing. That's kind of sad because in its full self-driving mode, it could do a lot for people. It is capable, but the safety just wasn't good enough to like productionize that ourselves. While that's very ambitious, we think it's also necessary because the rest of the world isn't going to slow down either. And if we can sort of show that it's possible to be responsible with how we deploy these capabilities and also make it extremely useful, then that raises the bar.

So I think that's an example where we tried to be really thoughtful about how we rolled it out. But we know that the bar is higher than we're at right now. Maybe a meta question of how do you think about competition and the provider landscape and how that turns out? I think our company philosophy is very aligned with enterprises. And if you look at like Stripe versus Adyen, for example, like nobody knows about Adyen.

but at least most people in Silicon Valley know about Stripe. And so it's this business-oriented versus more consumer and user-oriented platform. And I think we're much more like Adyen, that we have much less mindshare in the world, and yet we can be equally or more successful. So yeah, I think our API business is extremely strong, but in terms of what we do next and our positioning,

I think it's going to be very important for us to stay out there. And because if people can't easily kick the tires on our models and our experiences, then they won't know what to use the models for. Like we're the best experts on our models sort of by nature. And so I think we're going to need to continue to be out there with things like Cloud Code. But we're thinking about how do we really let the ecosystem bloom? And I think MCP is a good example of that working well.

where a different world that sort of like the default path would have been for every model provider to do its own bespoke integrations with only the companies that it was able to like get bespoke partnerships with.

Can you just pause and just explain to the listeners what MCP is, if they haven't heard of it? Because it is an amazing ecosystem-wide coup here. MCP is Model Context Protocol. And one of our engineers, Justice Farr Summers, was trying to do some integration between the model and some specific thing for the nth time. And he was like, this is crazy. There should just be a standard way

of getting more information, more context into the model. It should be something that anybody can do, or maybe even if it's well-documented enough, then Cloud can do it itself. The dream is to have Cloud be able to just self-write its own integrations on the fly exactly when you need it and then be ready to roll.

And so he created the project. And to be honest, I was kind of skeptical initially. And I was like, yeah, but why don't you just write the code? Why does it need to be a spec and all this SDKs and stuff? But eventually we did this customer advisory board with a bunch of our partner companies. And when we did the MCP demo,

the jaws were just on the floor. Everybody was like, oh my God, we need this. And that's when I knew he was right. And we put a bunch more effort behind it and blasted it out. And shortly after our launch, all the major companies asked to sort of

be in the loop with the steering committee and asked about our governance models and wanted to adopt it themselves. So that was really encouraging. OpenAI, Google, Microsoft, all these companies are betting really big on MCP. This is basically an open industry standard that allows anybody to use this framework to effectively integrate against any model provider in a standardized way.

MCP, I think, is sort of a democratizing force in letting anybody, regardless of what model provider or what long-tail service provider, and that might even be like an internal-only service that only you have, is able to integrate against a fully-fledged client, which might look like your IDE or it might look like your document editor. It could be pretty much any user interface.

And I think that's a really powerful combination. And now remote too. Yes, yes. So previously you had to have the services running locally and that kind of limited it to only be interesting for developers. But now that we have hosted MCP or sometimes called remote, then the service provider like Google Docs could provide their own MCP and then you can integrate that into Cloud.ai or whatever service you want it.

Ben, thanks for a great conversation. Yeah, thanks so much. Thanks for all the great questions. Find us on Twitter at NoPriorsPod. Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.

Will we have Superintelligence by 2028? With Anthropic’s Ben Mann 41:25 Share

No Priors: Artificial Intelligence | Technology | Startups

Shownotes Transcript

Will we have Superintelligence by 2028? With Anthropic’s Ben Mann