We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Ep 54: Princeton Researcher Arvind Narayanan on the Limitations of Agent Evals, AI’s Societal Impact & Important Lessons from History

2025/1/30

Unsupervised Learning

AI Deep Dive AI Chapters Transcript

People

Arvind Narayanan

Topics

Arvind Narayanan: 我认为当前AI模型在数学、编程等领域表现出色，但其泛化能力还有待观察。未来发展可能停留在这些狭窄领域，也可能扩展到更广泛的领域。评估AI模型不应仅依赖基准测试结果，更要关注其在实际应用中的表现和对人类生产力的提升。AI模型在标准化考试中的出色表现并不意味着其能够胜任律师或医生等实际工作。当验证器不完善时，推理扩展方法的有效性会受到限制，无法实现大幅提升。

Deep Dive

Arvind Narayanan is one of the leading professors in computer science today. Over at Princeton, he spends a lot of time separating hype and substance in AI with his newsletter and book, "AI Snake Oil."

On unsupervised learning, we touched on a ton of interesting topics, including the state of agents today and where evaluations do and don't work, as well as problems in coordinating them. We talked about lessons from past technology waves like the Industrial Revolution and internet and the implications for policymakers. And we also hit on the future of AI in education and whether AI will increase or decrease global equality. All in all, a fascinating conversation with someone I've admired for a long time. Without further ado, here's Armin.

Thanks so much for coming on. Thanks for having me. It's great to be here in the studio. Yeah, no, it's always fun to get to do these live, especially on a snowstorm weekend. I appreciate you making the trek up from Princeton. Yeah, my pleasure. Looking forward to the chat. Awesome. Well, I figure there's a bunch of different places we could start, but obviously these reasoning models, you know, I think one of the big questions being asked today is, you know, obviously they've shown impressive results in coding and math and some of this easily verified data, you know,

I feel like you've said progress will be kind of unevenly distributed across tasks. And I feel like everyone's trying to figure out what that uneven distribution is. So anything you share with our listeners on kind of the tasks you think fall really neatly into things these models will be good at, whereas tasks that they may struggle with. Sure. When we look at the impressive results that we're seeing so far, they're in these domains with clear, correct answers, right? Math, coding, certain scientific tasks. And that will certainly continue, I think. How far...

away from that this impressive performance is able to generalize, I think is a really big open question. So historically, 10 years ago, there was so much excitement around reinforcement learning, for instance, when it started doing really well in games like Atari. And if I understand correctly, OpenAI and many other AI companies and people who were thinking about AGI at the time, that was because of reinforcement learning.

And the breakthrough is 10 years ago. But what happened then is that we saw a failure to generalize too far outside these narrow domains like games. That is one possible future of these reasoning models. Another possible future is that by being better at reasoning, by writing code, you know, you can imagine reasoning being extended to,

a system that also reasons how to get information from the internet, you know, to then reason about law or medicine or any other domain. So those are two possible futures. Which of those is more likely? I'm not sure. Yeah. What are you paying attention to? I mean, obviously, I assume these reasoning models will continue to push the sweep bench scores up and up. What would you need to see to be like, oh, wow, it really does feel like it's starting to generalize across some of these tasks that I wouldn't have previously thought it could do? For sure. Yeah. Construct validity. Yeah.

In a phrase. So construct validity is this idea that what we might be trying to measure, especially in a benchmark, is kind of subtly different from what we want in the real world. And that's a challenge for any benchmark creator. Sweebench, we think, is a pretty good benchmark. It's all at the Princeton, right? Yes, by some of my Princeton colleagues.

It's a pretty good benchmark in terms of constructability instead of these toy, you know, Olympiad style coding problems, it's real GitHub issues. But nonetheless, GitHub issues, I think, is a far cry from the messy context of real world software engineering. And

And so that's what I'm watching, not so much the results on benchmarks alone, but people's experiences, trying to use these to improve their productivity, right? And so when we look at it that way, I mean, it's clear that thousands of people are using these models productively, I am among them, but it's clear that dramatic improvements in SweBench don't translate to dramatic improvements in human productivity. - Yeah, now I've heard you talk about this before too, when all these impressive results of how OpenAI models do on the bar

on medical exams. You're like, well, it turns out being a lawyer or a doctor is not just constantly taking those exams. Exactly. And so I imagine we'll need a set of domain-specific actual real-world evaluations and also just people using it, vibes, as people like to call it, too. Yeah, vibes. I think there's a middle ground between vibes and benchmarks, and I think we can do that. So, for instance, people have uplift studies.

These are actual randomized controlled trials where this group of people has access to the tool and this group of people doesn't, and then you measure the impacts on productivity. And also, it's a question of what kinds of tasks are you asking people to do or the models to do? Again, going back to the theme of it's not all bar exam questions. There was a recent paper that looked at whether LLMs are good at kind of natural patient interaction in a medical setting and eliciting information from the patient.

That's the kind of thing which, you know, for a human, for any person, let alone a doctor, is relatively easy to do, you know, natural conversation. So we don't think of that as the thing we want to benchmark. And so we focus on diagnosis tasks. And yet it turns out that even when models have great performance on those, they might struggle with some of these other things that are also required for really being useful in the real world. And you can't do a good diagnosis without the actual information, getting the information from the patient. Exactly.

That's super interesting. I think on these reasoning models, you recently published a paper called Inference Scaling Flaws, I think, about some of the ways I think which this approach can kind of go off the rails. Can you talk a little bit about that research and maybe some of the implications for these test time compute models? Yeah. So the big question with inference scaling is when we look at model scaling, we've had

I don't know, like six orders of magnitude of making these models bigger, training them with more compute, right? So how many orders of magnitude are we going to have with reasoning models and through what methods? So the answer to that is still unclear. But in this paper, we were looking narrowly at one particular way to try to get

these models to scale across several orders of magnitude of inference compute. And that is by pairing a generative model with a verifier. So in the context of coding, a verifier would be unit tests. In some other domain, in math, for instance, a verifier might be an automated theorem checker. So the model is trying to generate

possible proofs of theorems and then the proof checker is trying to verify whether it's correct or not. So the hope here is that the proof checker or the unit test or whatever, those are just traditional logic. They're not stochastic. And so maybe those can be perfect. And so the model can generate millions of solutions until one passes the tests.

But in real life, it's not quite that simple. Unit tests may have imperfect coverage. So we wanted to see what are the implications for inference scaling when the verifier is imperfect. And the paper basically has bad news. For this relatively narrow but nonetheless important setting, if the verifier is imperfect, we show that inference scaling can't get you very far. Sometimes you saturate within like 10 invocations of the model, for instance, as opposed to a million, which you might have hoped.

So to what extent that applies to 01 or 03, I'm not making any claims about that, but it is in the back of our minds when we're evaluating these new reasoning models. How are they actually going to be able to scale across so many orders of magnitude? Totally. And it seems highly relevant across, you know, scaling these models into domains that don't have easy verifiers. You know, you could imagine someone trying to staff up a full team of doctors or lawyers or accountants to check these things. And I think the

The point of your research seems to be, well, if you're slightly imperfect, it actually throws the whole thing off. Exactly. You've written about, obviously, about the hype around AI and some of the stuff that works and doesn't work. How would you kind of classify today? Like, what has product market fit in AI, like actually works? And where do you kind of see the snake oil? Let's take agentic AI, right? So just the way we name it, agentic AI to me is not one single category.

Let's look at two types of agentic AI. One is a tool like Google Deep Research. I mentioned that one because I've been using it a lot, but there are many others where, sure, there is an agentic workflow, but at the end of the day, it's still a generative system. The point of it is to generate a report or whatever it is for you, right? Yeah.

the user who's going to be presumably an expert in whatever field you're using it in. And it's for you to look at the report and decide what to do with it. Of course, there are going to be flaws, hopefully, as the user you're aware. But still, it can be a great time-saving tool. It can be a great first draft of a report and so forth. So I think this is pretty well-motivated.

On the other hand, another kind of thing that's often called a gentic AI is something that autonomously takes actions on your behalf, right? Like booking flight tickets. That's kind of the canonical. Everyone loves booking flight tickets. I know. That's just kind of become, you know, this thing that somehow it's, you know, it's a pain point for all of us, I think. And so that's become the thing, you know, if you have an elevator pitch around agents, it makes sense. But I think when you take a closer look at it,

Booking flights is almost the worst case example for an AI product, I think, to have product market fit. And the reason for that is that the reason booking flights is so hard is because getting the existing system, whether it's going to be an agent in the future or just Expedia or whatever today, to understand all our preferences is very, very challenging. You do a search and you realize, oh, this is full of United flights. I don't want that. Or, you know, I'm

I only want nonstop flights. And it's only when you look at some of the results that you even realize what some of your constraints and preferences are. So it takes like 10 or 15 rounds of iteration, right? And when you add all that up, it's going to take half an hour to really find the flights that you want. Well, the problem is, if you have an agent acting on your behalf, it might not know those preferences either, right? Unless you've been using it for a very long period of time, and you've somehow gotten to the point where you trusted to have learned all of your preferences. And that's

and then act on your behalf. And so a flight booking agent is going to have to ask you a whole bunch of questions. And then by the end of playing those 20 questions game, you're just as frustrated with the agent as you are with Expedia today. So that's kind of my prediction. Maybe I'm wrong.

And the other thing is the cost of errors is high. If it makes a mistake in the flights that it books for you, even an error rate of one in N attempts is completely intolerable. And those are the kinds of failures that people have been reporting with these early agentic systems, like this product ordered DoorDash to the wrong address. Right?

Right. I'm never going to use it again if that happens. Right. So those are some key differences. Producing generative outputs for you to look at versus automating things on your behalf. Low cost of errors like in a report versus high cost of errors like in ordering something for you. Yeah. I mean, it starts from what you're saying. It's almost similar to what you were talking about earlier.

with the evals around medicine earlier and the idea that like actually eliciting the user preferences sometimes is like the battle for these systems and just the idea that you could do that without, you know, without having that, you're kind of doomed to not be able to do it. Exactly. I think there should be, in addition to solving the purely technical problems in a genetic AI, there should be more of a focus on the human computer interaction component. Have you seen anything interesting there yet? Obviously we're in the early days of agents, but like anything that's caught your eye of like, oh, that's actually a pretty interesting way to solve for some of the shortcomings in agents today.

Yeah, I mean, to me, the thing that really gives me optimism on agents is that what we think of as chatbots, early on were just simple wrappers around LLMs, but now they're agentic. They do searches for you, they run code on your behalf.

And so the complexity is gradually evolving there. So on the one hand, we might be looking out for a killer app. But on the other hand, we might not realize, but I think is equally important that some of the apps we're already using are gradually becoming more and more agentic. Totally. Seems like if people could agree on a definition of what an agent actually is, it would help us figure out if we have one. That's right. But looking at all things in AI, it seems like we like to move the goalposts as we actually get near them. Fair enough. Yeah.

I guess you spend a lot of time obviously thinking about the future and where these models are going. I'm sure there's all sorts of questions that you're excitedly looking forward to seeing how they play out. As you think about the next two, three years, are there a few things that you're like, I wish I could just fast forward three years and see X or Y? What are the key things you think we'll answer? Yeah, so here's one. So we've been talking about software a lot, but on the hardware side, right? I would love to know what ends up being the right solution

kind of form factor for AI for most kind of everyday uses. So I do think in the future, it's quite possible that, you know, in everyday conversations, as well as in the workplace, AI is constantly watching what we're doing and offering improvements or somehow integrated into virtually all of our workflows. But exactly how? I don't know that. That could happen in many ways, right? So this idea of going to

a special purpose app, you know, like ChagPT or cloud or whatever, putting in what you want, getting the answer and then getting back to your other software that you're using. That's almost, you know, the, the, uh, the highest friction way of using it. And that has not been the end state of most software. Exactly. Or you could imagine models getting integrated into, you know, if you're using Photoshop, it already has a bunch of AI features, generative fill, various other things. Or you could imagine, uh, that you just have an agent just looking at all the, uh,

you know, the screenshots on your computer or phone every five seconds and then integrate it automatically somehow into every app. We don't have the APIs for that today, but you can imagine that becoming possible in the future. At an even more, even higher level of abstraction, I would say perhaps even less friction is if it's seeing everything you're seeing, not even on your device. So that would have to be in your classes. So I'm curious if, for instance, the, you know, the meta Ray-Ban is one product, but there are many

such products where AI is integrated into your glasses. I'm very curious to know if that's going to be one of the main ways we're going to use AI in the future. I'm kind of hoping that it is because just in my own personal life, there are so many little ways in which I can imagine using it. If I could have those glasses the first or if I could wear those glasses all the time, I would say right now I think the battery life is only like two hours. That's one of the main constraints.

But one of the first apps, I would write for it if it doesn't already exist, is to look at screenshots basically every five seconds and remember where everything is in my house. So that when I lose my keys, it will tell me that'd be pretty cool. That's just one example, right? But so many things when you're in a country where you don't speak the language having AR so that it can automatically translate it for you. Just so many other examples. And

And so but but the key to really full throatedly working on a lot of these applications is knowing which form factor is going to win out in the future. Some people talk about, you know, if you believe in the scaling loss continuing, you know, open AI, anthropic raising, you know, 50, 100 billion dollars in the next year or two. Like, does that seem feasible to you or?

So, I mean, I won't speak to the investing side of that, but from a technical perspective, what we've seen going on over the last couple of years is this push and pull between two forces. One is the rapid decrease in the per token costs of inference. And the other is what we're now calling inference time computes. But I think it's kind of been there since the beginning of the consumer success of chatbots.

And I think it really depends on how that's going to play out. I think it's really hard to predict, but I do think that what's most likely is that the token usage will

is going to continue to increase in a way that more than compensates for the decrease in per token costs. So for instance, my team is building what we're calling an AI agent zoo, where we're putting different kinds of AI agents into an environment where we're giving them a task to work on in collaboration. And I think this is a different way of evaluating agents as opposed to benchmarks where somehow it's a competition and each one is working in isolation. But I think agents are kind of more naturally collaborative

And so one of the tasks we tested them on is ask them to write a joke. First of all, the jokes were awful. I should be clear about that. The comedians are still safe. Yeah, that's right. This is not a revolution so far in humor. But that was not the point. We wanted to see kind of how they collaborate.

And when you look at what it actually takes for these agents to do that, they have to first start by understanding their environment, right? Looking at the directory structure, seeing what tools they have, looking at what the other agents are doing, so on and so forth. And we give them a lot of collaboration tools. We integrate them into Slack, and then we gave them blogging tools so they can kind of write these blog posts summarizing what they've learned about the task, and then other agents can pick up from there.

So we try to create a relatively realistic collaboration environment. And one of the things that we found is that even for the simplest tasks, they kind of generate millions of tokens. And it's not wasteful. They're kind of making progress, not just generation, input and output together as millions of tokens. Because again, looking around in the environment you're in, understanding the environment,

understanding your collaborators and then producing something takes a lot of tokens, right? So, I mean, in one sense, yeah, it's terrible from an environmental perspective that it takes 1 million tokens to write an awful joke.

But also, I think there are domains in which we're going to say, you know what, that's better than the alternatives. And so my prediction is that the overall inference costs are going to keep increasing. It seems like the human equivalent of like, you know, it takes six months to understand what in the world to do in a new job anyway. And so it takes some money and tokens to get the lay of the land. I'm really curious about that work. You know, a few different angles to take it. Maybe we'll start with just like

the eval side of agents. Like, obviously, if we are still trying to figure out evals for poor chatbot responses, it feels like we're even earlier days on the agentic side. Right. You know, maybe categorize for our listeners, like what's the current state of evals for agents? And then, you know, where do you think we should kind of continue that work?

Yeah, I think the current state for evals for agents is a lot like chatbots, right? It's these kind of static benchmarks. Obviously, you know, SweeBench is one of the better known examples. You try to give them relatively realistic tasks, whether it's fixing software engineering issues or navigating the web in a simulated web environment and finding some piece of information or accomplishing some task.

But it's not working that well. Here are some of the limitations. One is what we call the capability reliability gap.

So for agents, especially those that take actions on your behalf, it's really important to know what a 90% score means. Does this mean that it's good at 9 out of 10 tasks that are in this benchmark and the ones that it's good at, it's always going to accomplish correctly? Or is it going to fail 10% of the time at any task and perform some really costly action again, like booking the wrong flight ticket?

And if benchmarks aren't measuring that, and today they're not, they give you very little information. You know, they give you information about is the tech progressing? They're not giving you information about can you take this agent and actually do something useful with it? So that's one big limitation. Another big limitation is safety.

There's a lot of safety specific benchmarks, but I think safety should be a component of every benchmark because it's not like when you're not solving a safety specific benchmark, you can forget about safety. That's not the case. So we were looking at a web benchmark the other day and it actually involves doing things on real websites.

Which is terrible. Right now, nothing's going wrong because none of the agents work. They're not able to do it. But I don't understand what the benchmark developers have in mind. Because for agents to actually be able to do well on this benchmark, they have to take stateful actions on real websites. And those website operators are going to be pissed from all the spam that's being generated by the agents trying to solve this benchmark, right? Yeah.

On the other hand, you have some web benchmarks that are simulated environments but then lose a lot of the nuance of real websites, and there's nothing in the middle. And similarly, when you look at Asian frameworks, we've been using AutoGPT

Sometimes it can go off and do stuff online that you didn't intend because it thinks that's the best way to do stuff. One time it tried to post a question on Stack Overflow to get an answer. These are all things, obviously, we didn't want it to do and we were terrified that it's going to do something really dangerous that we didn't intend.

Unfortunately, it didn't happen, but even posting on Stack Overflow, we wanted to have a way to prevent that. But right now, the only way to do that is for it to escalate every single action to the human user and for you to sit there and babysit. And so even these really basic aspects of safety control have not been...

integrated into the way we evaluate agents. So those are just a couple of limitations. What do you think the middle ground ends up being between like a simulated environment that to your point, it just doesn't have the nuance of the real world and like actually letting, letting these things loose, uh, on the real world. Yeah. I mean, I think we're just going to have to rethink, uh,

Benchmarks are useful for certain things. Again, you know, the capability reliability gap issue, that I think you can solve at the level of benchmarks. Yeah, I feel like Sierra had a, they put out a benchmark that was like, you know, the same task done eight times or some number of times in a row. Right, right. What's the percent of that? Which, to your point, is a much more interesting way to look at things. Exactly, exactly, yeah. So the pass at k versus pass, I don't even know what is, how do you say it verbally, but pass to the k is how you write it. You don't know how to say it as a professor, I see what you mean, but...

Yeah. So that's a problem I think you can solve. But the other problem you mentioned, the realism problem, I don't think you can get too far with that. So I think the answer is to use...

being good at a benchmark as a necessary but not sufficient condition. So you take all the agents that are good at a particular benchmark and then you actually use them with the human in the loop in sort of semi-realistic environments. Right, and to your point, the art is finding a way to keep the human in the loop without just having them babysit every single step. Exactly. Which I guess managers have to figure out with their junior employees all the time anyway, so not a net new problem for society. Part of that work you're doing, I think, was really interesting is obviously you built a team of agents to work on jokes, I guess, in this context. And it feels like

One thing we're seeing that's interesting on the startup side is you've got lots of folks that are building different agents for an enterprise. So someone's like, I'll build a agent for your finance team. And someone's like, I'll build an agent for your support team and your sales team. And as you kind of think about where this goes going forward, obviously these agents eventually are going to work together in some way. And,

I'm curious what you've learned now building teams of agents and the implications that has for, is it better if one person is standing on top of this entire team of agents and able to coordinate or build tooling across them? Or what happens when there's eight different companies building agents that have no context of each other? Yeah. I mean, allow me to digress for 30 seconds for a historical look at this, right? So what we're seeing, I mean, if this is indeed going to be

a new way of organizing the means of production, you know, which it very well might be, we can look at what happened in the past, right? So with the Industrial Revolution or with electricity replacing steam and factories, it took several decades to figure out how to organize labor and also the physical layout of factories and so forth to best take advantage of this new technology.

So with steam, it was the idea of going from one big steam boiler to more of an assembly line setup where you can deliver electricity wherever you need it for specific concrete tasks. So I think right now we're in the extremely early stages of experimentation of figuring out

how to have teams of humans and agents working together. So for me, it's not just how are agents going to work together, but it's really teams of humans and agents. Because I find really compelling this jagged frontier idea that what models or agents are going to be good at is kind of like a calculator, right? Way better than any human at certain things, but lacking the common sense of a child in certain other areas. And I think that's going to persist for the time being. So we are going to have to figure out how to hybridize.

And I think even the most basic things are not clear. So for instance, we've been confronting the problem of, do you integrate the agents into existing human collaboration tools like Slack and blogging or whatever, an email? Or should we be building new collaboration tools? We just don't know. So it's really hard to make any predictions.

Have there been examples of times where you felt like, oh, actually a new collaboration tool would be a more helpful way for us to work here? Oh, for sure. So for a person to look at, again, these potentially a million token log of all the actions that an agent has taken and to be able to visualize that and get high level interpretable insight out of it. Fortunately, a lot of people are working on this. There's a framework called human layer as just one example. I think eventually we're going to

get a much better handle on things. But again, super early days right now. I guess, you know, switching over to the regulatory side of things, I know you've thought a lot about policy. I guess the most recent policy in the news is the AI diffusion rule and some of these rules around, you know, the geopolitics of some of these models. I'm curious, like, you know, your thoughts on that and some of the recent regulations around kind of both chips and models exports.

Yeah, I wonder to what extent export controls are going to be effective. Historically, it's not quite my area, but reading people who have analyzed past export controls, they have at best a mixed record of effectiveness. And of course, export controls can be more effective on the hardware level than with models, which are...

are actually getting smaller by the day rather than larger. And so it's going to be harder and harder to limit their diffusion. And also when it comes to inference scaling, it's not about preventing the next

model from being released, but it's about how much inference scaling can you get out of even the existing models that are already out there. Yeah, so taking all that together, I'm a little bit skeptical about their effectiveness. But one more thing I'll say, and this is the work of Jeffrey Ding, who's a political scientist, is that

all of this kind of regulation, you know, when it comes to geopolitics is there's too much of a focus on innovation and too little of a focus on, this term is called diffusion, but unfortunately it's just a term collision. It has nothing to do with the diffusion we were just talking about of technology getting to, well, it's kind of vaguely related, but it's more about

Once you have technology available in a country, how do you adopt it? How do you reorganize your institutions, laws, perhaps norms and all of that stuff to best take advantage of that technology, right? So Ding considers that to really be the determinant of the extent to which

a nation will be able to grow and benefit economically from the availability of a technology. Yeah. How do you think about that in the context of the U.S. today? I mean, obviously, there's big, you mentioned the industrial revolution and all these things. I've written about this all before. Clearly, there's massive policy implications of the progress of these models. You know, what are we like, you know, where do you think we've done a good job thinking about the future implications? Where are we not focusing enough?

For sure. So it's hard to know how well we're doing. I guess history can judge us. Yeah, yeah, yeah, exactly. Yeah. But also, what is the yardstick, right? I do think we're doing pretty well compared to most other regions of the world in terms of diffusion. So one thing we can look at is the pace of adoption. You know, how many people are using generative AI? And

Even looking at something that you would think is easily quantifiable using data can actually be really complex. So this paper came out recently. It was titled The Rapid Adoption of Generative AI. I have nothing to quibble about with their methods. I'm not even an expert in the methodology that they use. And I'm taking their numbers at face value. Again, I have no reason to think there's anything wrong with those numbers. But then the interpretation of those numbers.

In the paper, they said 40% of people are using generative AI. This is really rapid compared to PC adoption, for instance, a few decades ago. But then when you dig into the details,

It's on average people using it somewhere between half an hour per week to three-ish hours per week, which is not a lot. So that's called intensity of adoption. And so when you kind of control for intensity of adoption, we don't have concrete numbers for the PC, but we can make some assumptions. And my take is that in that sense, generative AI adoption is actually a lot slower than PC adoption. So this could be for a bunch of reasons. One could be that it's just not yet that useful to a lot of people.

Versus when PCs were first mass produced, word processing and other things were just immediately useful to a large number, maybe. But also it could be that these are things where policies can make a difference.

So, for instance, people go into this with the assumption that, oh, the kids are, you know, experts at using generative AI. But, you know, I interact with a lot of students and they're often very confused and they're often much more hesitant to use it effectively.

because they see it primarily as a cheating tool sometimes. And I'm the one encouraging them to use it more. And here are some productive ways to use it, despite the fact that there can be hallucinations, et cetera, and it can actually be a tool to enhance your learning, that sort of thing, right? So maybe that's the kind of thing

That should be a part of our curriculum now. And I do think that's the case. I think it should not just at the college level, even at the K through 12 level. Right. So, you know, if there were a policy intervention that made made it easier for teachers to upskill themselves and then to teach those skills to students, we're going to have to do it.

What kind of impact could it have on people being able to use it productively and avoid the pitfalls as well? Right. So I think those are the questions we should be asking rather than, you know, trying to pat ourselves on the back for rapid adoption. It's like, what are, what's the low hanging fruit that's on the table. And from that perspective, I do think there's a lot of low hanging fruit. Yeah.

If I remember correctly, in your classes, I guess you let people use these tools, but then they disclose how they use them. And so I can imagine that could be a super helpful teaching tool to, you know, throughout different levels of education. If people are sharing how they're using them and they can get kind of feedback on, oh, you know, that was a good way to use it. Or maybe that didn't make as much sense. I think so. Yeah. Yeah.

Super smart. I mean, obviously, you've previously written about a lot of the flaws in predictive AI models, and there's clearly tons of things we've learned from some of these criminal justice tools and healthcare tools. How do you think about the ways we should apply some of those learnings to now? I feel like there's another new hype cycle in AI where everyone wants to use these again for the same things that maybe didn't work as well in the past. Yeah. So, a couple of thoughts. One is, I think the way we should break down AI, when we're thinking about how...

thinking about adopting the lessons, it should be based on the applications more than the technology. So what I mean by that is whenever I tell people, look, these criminal justice tools and the automated hiring tools are not working that well, they're always like, oh, generative AI is going to take care of the problem.

But to me, that doesn't make a lot of sense. And when you look at the research on this, it doesn't make a lot of sense because the limitations are not coming from how good or bad the technology is. It's simply a consequence of the fact that you can't predict the future that well, right? That's a social science question. And the social science on this is pretty strongly pointing in one direction. And so that should tell us that we should continue to be circumspect when we're thinking about new technologies for these long existing, but nonetheless quite flawed applications.

So that's one lesson. Another lesson is related to this one regarding safety, the pace of adoption of technologies and so forth. So what we can learn from past waves of AI is that when there are these consequences, you know, whether it's in the criminal justice system or in banking, when you have automated trading where things can really go wrong, uh,

Eventually, people get wise to the flaws. There's a public outcry. And so those types of domains tend to be highly regulated. And so I think we should keep in mind that even though right now it's the early days of generative AI and we don't have a lot of regulation in certain domains where AI is just beginning to be employed, I think if it is making consequential decisions in the future, we should expect those domains to be regulated. And so...

to me, the question is less about is regulation good or bad, but instead what should that regulation look like so that we have a good balance between, uh, protecting safety and rights as well as, you know, getting the benefits of AI. And so sometimes the discussion around regulation is too polarized, but I, I think there's a lot of room for it to be more collaborative. Right. And from having listened to you previously, I imagine a lot of what you would want regulation to focus on, you know, it seems like it's explainability. I think you talked about, I believe it was around, um, like, you know, uh,

these algorithms around bail. And there was like this whole, you know, black box. They were saying this is super advanced AI. And then it turned out it was probably literally just correlated with like, you know, was that where they like repeat offenders. But people were using this, you know, to make all sorts of decisions. And, you know, obviously it's one of the hard parts of generative AI is like we,

have very little explainability in how a lot of these tokens are developed. Yeah. So in terms of applications and regulation, explainability doesn't mean necessarily mechanistic interpretability, much less trying to explain what every neuron does. That's not the point at all, right? But it's more about what data was it trained on? What kinds of audits have you done? It's basically to try to understand

Are you able to make statements about the expected behavior of this model in new settings? So that is like the most critical question we need to ask before we put things into deployment or as we're putting things into deployment, we need to learn from our early experiences and tweak our approach.

as opposed to having some kind of neat mathematical explanation, understanding of the model, if you will, which ultimately is useless if you don't know what kind of data distribution it's going to be, uh, invoked on. I guess, but you go over to the education side, you know, you're obviously an academic, um,

And I'm curious how you think about the role of academia in AI going forward and where it makes sense for academics to spend time and where just given the compute differences, it makes sense for industry to spend time. Yeah. I think in the last decade or two, there has been a sense of crisis because of the sense that it's getting harder and harder for academics to be at the forefront of AI due to compute.

It's possible that that's changing now because a lot of innovation is either thinking about new architectures, blue sky that you can establish on a small scale without necessarily claiming that it's competitive with GPT-4 or whatever.

Or it's on the top of existing models, right? And thinking about new inference scaling methods or whatever. So I think, yeah, so on the one hand, that's one way in which academia can potentially continue to be competitive. But another one is...

Anything that goes beyond pure technical innovation, I think academia necessarily has to play a huge part, both because it requires lots of different disciplines to be thinking about what are the applications for AI and X for various X, but also what are the impacts on society and how can we make that more positive, right? So that's one role for academia. And also to serve as a little bit of a counterweight to industry interests, right?

So when you compare to medicine, for instance, right? So ultimately, a lot of medical research is about new drugs and ultimately benefits the pharma industry. But medical researchers don't think of themselves as being closely allied with the pharma industry. In fact, there's kind of a wall and there are really strict rules around conflict of interest and so forth. We don't have that in computer science academia.

Computer scientists generally think of success as something that produces new ideas that can be taken up by the industry. And maybe that's fine for like 80% of computer science academia. But I think there needs to be another 20% that explicitly sees itself as providing this counterweight and is not necessarily going in the same direction or maybe is explicitly trying to go in different directions.

And for many disciplines outside computer science, that's already part of their DNA and they should continue to bring that approach into AI as well. I mean, I guess outside of the work your lab's doing, what's some of the academic work happening across country that you're most interested in or excited about? So let me mention two or three directions. One is AI for science. And this is a very hot area and we're seeing lots of claims of AI revolutionizing science. But I think some of those claims

Early claims that we're seeing now are overblown. A lot of the supposed AI discoveries have not really reproduced. There have been flaws in a lot of those papers. But nonetheless, it is true that AI is already having huge impacts, I think, on scientists and researchers. Just speaking for my own experience, for instance, I use AI as a thinking partner when I'm researching, when I'm thinking of new ideas.

How do you actually go about doing that? Yeah, so there are many ways. If I come up with an idea, I can ask AI to critique it, or I can use it as an enhanced way of doing a literature search, finding what has already been done in this area.

And the reason AI is often really good for this purpose is semantic search. Yeah. It's not... I believe I heard you say you use it to search your own book, just to remember if you'd actually include the example. That's right. Yeah, yeah, yeah. That's a funny one because, you know, it's hard, right? It's, you know, close to 100,000 words and...

have I actually talked about this particular case study in my book? Right? Like, if, if I didn't have AI to search it, I would have to flip through the pages, because I might have also forgotten which chapter it could potentially be in. Yeah, so that's, that's a great example of a use case. And what AI can do for searching my book, again, going beyond a keyword search, searching a concept, if you will, it can do for the entire corpus of science, scientific literature.

Even though it doesn't actually work that well yet, it's still already very useful to me. So I do expect that as semantic search improves, that's going to be very, very useful. And then various kinds of tools that are specific to particular scientific domains. So I think this is a really, really important area. I am very excited about it, even though I often push back on some of the more extreme claims that are being made. Another one I'll mention is

the relationship between AI and human minds. And this could mean various different things. So I know philosophers, for instance, Seth Lazar is someone I sometimes collaborate with, who is looking at the ethical reasoning that these models exhibit. I know cognitive scientists, my colleague, Tom Griffiths, for instance,

who is looking at what can we learn from human minds for building AI and what can we learn using AI as a tool for better understanding human minds. So again, another really fascinating direction. I don't know much about it, but I'm really impressed by the work. Yeah, I don't know anything. Are the ethics of these models just like the median text on the internet or like...

Well, yes and no. I mean, yeah, that's what it's been trained on. But through the fine-tuning of these models, right, you can get very different behavior out of them than just parroting whatever is the median thing that's been said online. So when I'm talking about the ethical reasoning of these models, I don't mean to imply that we should ascribe morality to these models. I'm talking purely in a behavioral sense, right? So how does that ethical reasoning compare –

to humans and there can be applications there. It could help people's ethical thinking in some cases, again, by kind of being a creative partner, if you will. I'm curious, we talked a little bit about how some of your students are using AI today and this potential of teaching this in curriculums. Obviously, it feels like we're potentially going through a sea change because a lot of people are starting to use these tools in education. What does the future of university and primary education look like with some of these tools? To what extent, I mean, some people are like,

everything's going to change. Some people are like, no, it turns out you kind of need a motivating teacher there. And not that much will change. Like, where do you fall on that? And, you know, if we were to zoom out some really futuristic question, like what, what do you think education looks like? Yeah, I would say closer to the not that much will change side of things. I mean, I do think we will use AI quite a bit, but I don't think it'll change the fundamental nature of education.

A good precursor to this is the excitement around online courses. A little over a decade ago, when Coursera was founded, people thought that was the future of education. But I think it was this classic mistake of forgetting where the value in the education system comes from. It's not the transmission of information, right? That's, yeah, of course, if it's just the transmission of information, then you can recreate that with Coursera. But I think the reason that...

being in a classroom is valuable to students is that it creates the social preconditions in which learning happens. The motivation, the connections, caring about something, individualized feedback, that sort of thing. And so, yeah, is AI going to recreate that? I don't know. I mean, clearly it's not the same as videos. It can be personalized. It can give you motivational speeches, if you will.

But I think taking a human out of the equation, in a way, it's sad to say. I would like for everyone in the world to, for education to be democratized, if you will. But I do think we're going to be stuck with

uh, the current system, which has a lot of benefits, but also flaws, including the inequality that I, that I alluded to. I mean, on another podcast, I also heard you talk about the importance of thinking through like the implications of AI on this next generation. And in many ways, they're the most impacted by this stuff. You know, I guess you, you, uh,

have kids, like how, if at all, has this rapid progress in the last few years changed the way you think about raising your kids or teach them or whatnot? Yeah, definitely. So this goes back to the inequality point. I think there's a really high variance, right? So there are clearly a lot of concerns about, again, I want to look at previous technologies as an example for what we might see in the future with AI. When you look at videos and YouTube and social media, for instance,

There are huge concerns for a lot of kids that has been incredibly negative. But if you're in an environment where...

the parents actually have the luxury of having enough time to monitor how kids are using it, it can be enormously positive. And when we had kids, we had to make a decision of what our approach to devices is going to be. And we decided to be very tech forward. And so far, that's worked out really well. Our kids use apps like Khan Academy, for instance, and they learn a lot using that.

And we're going to continue with that. And I often build little AI learning apps for my kids as well for them to use. What kind of apps? Yeah, so one is for phonics to be able to click on a word and be able to break it down into its sounds. For some reason, this doesn't exist. So I built one to teach my kids to get better at phonics.

But also, I've often been using Claude, the artifacts feature in Claude, to be able to kind of instantly create an app for one particular skill that I want to teach. And then you can forget it, you know, throw it away, never use it again, right? And you can do that because building an app, you know, is not days, it's not even hours. Sometimes it's just a minute if Claude gets it right. It doesn't always get it right. So one example is I was teaching my five-year-old to tell time. So I drew some

clock faces on a piece of paper and it was working well but it was getting annoying to have to draw that over and over again so I asked Claude to make a little app to generate a random clock face when you push a button and so we were able to do that and we went through like 20 or 30 or whatever number of clock faces and then at the end of that we played with it for 15 minutes and then she kind of got it at that point right

So that was just a very nice little interaction. So in small ways, but I think in the future, kids will be able to use AI in much bigger ways as part of their learning. And I suspect that this is going to happen mostly outside schools rather than in schools, just like the way that

schools have been very jittery about devices. They are going to be quite jittery about AI as well. And because of that, what's going to happen is that there's going to be really high variance, I think, for wealthier kids, for parents who can figure things out and can monitor their kids. And, you know, you can have a

nanny or caregiver there who can also ensure that kids are using this in healthy ways. It's going to be very positive, but for other kids, it's going to be addicting. And, you know, we talk about social media and addiction, AI addiction can be really personalized. And so that's what I worry about. Yeah. It's really interesting. I feel it's one of the big, you know, I think a lot of people want to feel like this is, it's this democratizing force that brings you whatever the really wealthy have, you know, makes that mass accessible. And so whether it's like your personal assistant or a tutor or a,

doctor or a concierge doctor or whatever. But I think your point's well taken that in many ways, especially with kids, you'll need some sort of like supervision on how that's being used. Exactly. And even with some of these like test time compute models, you see with the price points of them, you wonder whether they'll be brought accessibility or whether only some people have access to the, you know, $10,000 query. That's the really valuable query that,

Exactly. Yeah. So not just between people, but also between countries, right? Yeah. So one of the nice things with model scaling was that, especially with the availability of open models, every country can build out its own model.

you know, homegrown AI applications based on open models and be on a level footing with the US or any other country. But with test time compute, that might very well get much harder. Yeah, though, I mean, I think literally in the last day, DeepSeek released some really good reasoning models. So I thought reasoning models might be harder to scale open source, but it seems like the open source world is still going at it. Yeah, for sure.

You've talked about some of these past technological movements, and I think maybe social media is an interesting one where there wasn't a ton of regulation up top. You know, talk about the Industrial Revolution, other things like any other lessons you've studied these closely, like any other lessons you take from some of these past technological changes for how we should be thinking as a society about this, you know, AI moment? Yeah, for sure. So I think thinking about past technologies, I think, is a good way to reconcile both some of the very, very.

very optimistic and also very, very skeptical takes that we're seeing from kind of various pundits out there. So let's take the future of work, right? Some people are like, this is going to revolutionize every single job out there. And other people think that the impact is going to be minimal. But when you look at the internet, both of those are simultaneously true.

And here's what I mean. So if you were to go back 30 years and you were to tell people every cognitive task, almost every cognitive task we perform in the future is going to be mediated by this new communication technology, that would have sounded crazy. And people did say back then that that was crazy. So that did happen. And yet the impact on GDP has been minimal. There's that famous saying, the computer revolution shows up everywhere except in the productivity statistics. Yeah.

Right. So the way we do things is indeed different. We don't have to go to the library to look up a fact. We can do things online. And again, that might seem like that should be giving us 100x increase in productivity. But it turns out that when you eliminate some bottlenecks from our workflows, new things become the bottlenecks. Right. So that's why the Internet has at once transformed everything. But yet, you know, we still pretty much have the same job categories for the most part that we did 20 or 30 years ago.

So that's one lesson that I take. A similar thing might happen with AI. Another lesson that I take from the Industrial Revolution is that it was a case where it was obviously much more radical than the Internet. The nature of what we mean by a job, again, unlike the Internet, that time was completely transformed. Back then, most of what people were doing were manual labor, was manual labor. And most of that is now automated.

And what we call now work would not have felt like work to people back then. I think our ancestors would laugh if they saw us on this podcast and be like, this is kind of work for me. We're taking out. Yeah, exactly. So again, that might happen with AI. So as a lot of cognitive tasks become automated, what we mean by work might actually more be about what we now consider to be AI control.

So what we now consider to be in the realm of AI alignment and safety, that might primarily be what a lot of different jobs actually entail, because AI can do the actual work, but you don't trust it to do so without supervision. And I think that's a really important point.

I think a surprising number of decisions that we have to make are actually based on values partly and not just data and information. That's something that's easy to forget now because when people make decisions, all those things are rolled into one. But when AI is able to make decisions

the purely analytical part of those decisions and we're not comfortable with AI making moral judgments for us, a lot of what people are doing in the future might be in that category. Totally. So I guess this Keynesian dream of us all getting to go on the beach and work five hours a day, maybe not going to happen. It's really interesting because I think also there's something, we clearly, when things are perfect, we desire like,

things that are intensely human. Like I feel like people have talked about, you know, with chess or image generation, like obviously AI can play chess way better than humans can, but we still love watching humans. Exactly. Or, you know, all these image generations, but you can make any piece of art. That's probably would be groundbreaking 500 years ago, but like we still go to human artists. And so I feel like that, uh,

that could continue, you know, throughout different parts of work. That's right. Well, it's been a fascinating conversation. We always like to enter interviews with a quick fire round where we get your take on some overly broad questions that we just cram in at the end here. Yeah. And so maybe to start, which feels appropriate given your book title, one thing that's overhyped and one thing that's underhyped in the AI world right now.

Over-hyped, I would say agents. I mean, I do think there's a lot of potential, but the hype is kind of out of control. Under-hyped, I would say the kind of boring things that are not sexy to talk about, but can bring a lot of economic value. For instance, I have a former student

who is building AI to summarize hours of boring C-SPAN meetings for lawyers and others for whom that's really important information. That's an example of underhyped. - Do you think model progress in 2025 will be more or lesser the same as it was in 2024? - Really depends on your perspective. So if we have inference scaling that is pushing us forward and

these specific tasks that have clear, correct answers, you know, by a tremendous amount, but AI is not necessarily getting better at translation or all of those other broader tasks that people are using it for. Is that faster or slower? I don't know. It depends on, yeah.

What's the go-to thing that you try when a new model comes out? Do you have a go-to prompt or experiment or thing you run? How do you get the vibe check on the models? I try to play rock, paper, scissors with it. I ask it to go first and it'll be like rock or whatever. And I'll say paper. And I'll say, wow, you won.

And I'll do this five times, and then I'll ask it, how do you think I won every single time? And at least up until recently, every model would be like, whoa, you must be really good at reading AI minds or whatever, right? Has no awareness of the context that it's in, right? This idea of turn-taking, simultaneity. So...

The reason I check for that is that's the kind of thing you can't get just by training pre-training stuff on the internet, right? You have to actually program in an understanding of the context in which it's being deployed. It might not be rock, paper, scissors, but with every model, I try to see if it understands the context.

Super interesting. Well, you'll have to let us know when you get defeated or they say no fair. Where do you think, you know, we'll be with agents by the end of 25? I mean, obviously, I guess the travel booking is one classic example. People, you know, is that on the near term horizon or do you think we're still years away from that? Yeah, I think we're going to continue to see a lot of agentic workflows for ultimately generative tasks. But I think by the end of 2025, we're still going to see relatively few applications where AI is autonomously doing things for you.

This is an annoying but classic question of what's your timeline to AGI and how do you define what AGI is? Well, that's the problem, right? It really depends on how you define it. So for me, instead of talking about AGI, I like to think about when will we see transformative economic impacts? Like massive GDP impact? Yeah, something like that. And my view is that that's decades out. It's not years away. It's decades away. What's your weirdest prediction on the implications for all this AI progress on the future? Weirdest prediction? Let's see.

So I think companies will train users, especially younger users, to expect

that chatbots will be the way of accessing any kind of information. I don't know if that's such a weird prediction, but it's weird to me having grown up, you know? I think it's for anyone who grew up before chatbots, I think this is going to be a weird way of accessing information that it's mediated by this fundamentally statistical tool that could hallucinate.

But I think we need to prepare for that world. And instead of bemoaning it, we have to think about how are we going to give people the tools to do fact checks when it's necessary and so forth. Yeah. And I guess we'll be the older people like today, how older people like still want to phone in instead of using the Internet. We'll be the older people that are like, oh, we can't use that chatbot thing. That's right. Yeah. Let us click around this website a little bit. Yeah. So to younger people, even the idea of doing a search, clicking on a website and looking up some authoritative source of information will be similar to what we think of going to the library.

right? You might do it if your life depends on it, but otherwise you're just going to choose convenience. I guess which like category of startups or, you know, specific startups are you most excited about or interested in right now? Well, one, I would say the boring ones. C-SPAN summaries. Yeah. C-SPAN summaries. So here's another example. Uh, so another one that I heard about is, uh,

I've heard about from a couple of people is using AI to translate old code bases like COBOL or whatever to modern languages. I mean, enormous value can be unlocked using that, but it's the kind of thing we don't often talk about. And a second one is going back to the form factor discussion that we had, being able to use AI in ways that kind of disappears into your everyday life and is kind of there for you and helps you when you need it.

If you had a magic wand and you could make like one policy change to improve the impact of AI, what would you do? I would get everybody to stop calling it AI, I think. That feels like something you could accomplish. Yeah. I mean, you'd need a dictator, I think, to make that policy change. And that's why it's been so hard, right? But to be specific about what kind of application we're talking about would, I think, bring so much clarity to the discourse and would cut down on hype so much. I mean, obviously, I'm sure you...

you do so many interesting things in your job today on the academic side. If you could parachute in and like, you know, run any company in the world or be in any seat around this, you know, AI transformation outside of academia, what do you think would be like most interesting? I think it would be most interesting, like,

to be at a big tech company because they're not just developing models. They're getting to see how people are interacting with them. It's, you know, the soup to nuts thing. And I think it gives you a fuller picture of the relationship between AI and society. And that's what my research is about. And it'd be interesting to see the view from the inside on that. Yeah. What future directions of research excite you? Like, what do you think is next for you and your lab? Yeah.

So my lab is a lot of what we do now is on agents to take a more grounded look at it, both push back on the hype, but also explore some of the areas of potential. Evaluation is a big part of everything that we do. We do think the old way of doing benchmarking is not that great and we need new things.

And so I'm doing some thinking and writing that goes beyond the empirical work that my lab does. I've been thinking a lot about the future of AI. I have a paper coming out with Zayash Kapoor called AI as Normal Technology. It actually talks about a lot of the things we've been talking about, why this is not necessarily going to change everything in the next two years, but more like the Internet, the impact will unfold over a couple of decades.

It's funny, we had Bob McGrew, who is the chief research officer of OpenAI on the podcast, and he said something similar to what you said about the internet. He's like, if you had told everybody in 2017-18 that we'd have models that do what they do today, people would have said, oh, GDP is up 50% or something. And it's funny, it turns out there's a lot more required to get that economic impact.

Agreed. Well, it's been a fascinating conversation. We always like to leave the last word to you. Where can our listeners go to learn more about you, the work you're doing, anywhere you want to point them, the floor is yours. Sure. We have a newsletter called AI Snake Oil. It pushes back on the hype, but also tries to give you a balanced look at both the positives and the negatives of AI. Yeah, it's a great newsletter. So highly recommend folks subscribe. And thanks so much for doing this. Thank you. This has been super fun. Agreed.

Ep 54: Princeton Researcher Arvind Narayanan on the Limitations of Agent Evals, AI’s Societal Impact & Important Lessons from History 57:09 Share

Unsupervised Learning

Deep Dive

Shownotes Transcript

Ep 54: Princeton Researcher Arvind Narayanan on the Limitations of Agent Evals, AI’s Societal Impact & Important Lessons from History