We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

NLW on the Future of AI Agents

2025/2/24

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

AI Deep Dive AI Chapters Transcript

People

Nathaniel Whittlemore (NLW)

Topics

我对于AI智能体的定义非常明确，我认为不必纠结于术语的精确性。人们通常将AI分为两类：一类需要人工操作，另一类可以自动完成任务。企业领导者在考虑部署AI智能体时，也主要关注这两类。目前，企业对AI智能体的需求主要集中在自动化处理重复性任务上，而非复杂的多智能体协同工作流程。企业应该从自动化处理简单的、重复性任务开始，逐步探索AI智能体的应用。我们应该将现有的工作流程进行“agent化”，逐步用AI智能体增强或替代部分工作。例如，我们可以将AI融入到播客制作流程中，从AI辅助逐步过渡到AI自动化。即使对于AI经验不足的企业，也可以从一些简单的应用场景开始，例如利用AI进行头脑风暴。在Superintelligent，知识库的自动化是目前最接近完全自动化的流程，但仍然需要人工参与。Twitter上的用户反馈对评估AI智能体的实用性很有帮助。 Deep Research等工具是未来的一部分，但需要持续实验和迭代才能找到最佳应用场景。Deep Research的局限性在于缺乏对当代期刊的访问，以及在快速变化的领域可能出现信息滞后。Deep Research的应用场景非常广泛，需要时间去探索其在不同领域的适用性。AI的幻觉问题对企业的影响远大于消费者，因为企业对信息的准确性要求更高。企业对AI评估的利用率仍然不足，很多企业尚未建立完善的评估体系。企业在部署AI智能体时，容易对其实现能力抱有不切实际的期望。企业在AI实施过程中面临的主要挑战包括数据准备、隐私安全和员工采用率。许多企业缺乏支持AI工具使用的基础设施，导致工具利用率低。企业应积极探索各种工具和方法来解决AI安全问题，包括构建内部解决方案。由于现成的垂直领域AI解决方案可能还不够成熟，企业可能会选择自行构建解决方案。

Deep Dive

Shownotes Transcript

Translations:

中文

Today on the AI Daily Brief, a special interview with me on the future of AI agents. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI. To join the conversation, follow the Discord link in our show notes.

Hello, friends. Welcome back to another AI Daily Brief. I am traveling this week, so we're doing a couple episodes that'll be different. I do have my podcast gear, so I will be recording some normal episodes. But for today, I'm sharing the first part of an interview that I did with another podcast, a great one called Tool Use, a couple of weeks ago about AI agents. Obviously, this is the topic du jour. And because I was in the interviewee chair for this one, I got to riff a little bit more broadly around what I think the future of agents actually looks like than I otherwise normally would.

So what I'm going to do is I'm going to share a little more than half of this episode, and then I'll send you out a link to where you can find the rest of it on their feed, the guys over at Tool Use interview builders and entrepreneurs, and other folks who are actually using AI day in and day out about how they're using AI. So if that's interesting to you, I highly encourage you to go check out their show. So again, today's episode is an interview with me about the future of AI agents.

Today's episode is brought to you by Vanta. Trust isn't just earned, it's demanded. Whether you're a startup founder navigating your first audit or a seasoned security professional scaling your GRC program, proving your commitment to security has never been more critical or more complex. That's where Vanta comes in. Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SOC 2 and ISO 27001.

Centralized security workflows complete questionnaires up to 5x faster and proactively manage vendor risk. Vanta can help you start or scale up your security program by connecting you with auditors and experts to conduct your audit and set up your security program quickly. Plus, with automation and AI throughout the platform, Vanta gives you time back so you can focus on building your company. Join over 9,000 global companies like Atlassian, Quora, and Factory who use Vanta to manage risk and prove security in real time.

For a limited time, this audience gets $1,000 off Vanta at vanta.com slash NLW. That's V-A-N-T-A dot com slash NLW for $1,000 off. This week, we're joined by Nathaniel Whittlemore, also known as NLW, the founder and CEO of Superintelligent, as well as the host of my favorite daily AI podcast, the AI Daily Brief. NLW, welcome.

Welcome to Till Use. Hey, it's great to be here. Thanks for having me. We're super glad to have you on. I guess we can kind of kick things off. I think everyone kind of has their own definition of what an agent is, it seems like. There's not really a very good definition. I'm kind of curious how you define an agent and kind of what that means to you. I actually have a super strong point of view on this. You'll find this is a common thread for me. So you see a lot of kind of hand-wringing, I think, among people who have been in AI for a long time or who are sort of more technical experts.

on how mangled the definition of agent is as it's, you know, found its way into enterprise and stuff. And I actually think that we should not care about that. I think that when people on average are talking about agents or referring to agents, they're bucketing AI into two categories, AI that I have to use and AI that does stuff for me, right? Like without me having to really, you know, tell it other than maybe that one first time. And obviously that's, you know, not super precise.

But I think broadly, it gets people kind of in the way to think about it. Like particularly if you're, you know, an enterprise leader and you're thinking about whether you're going to deploy, you know, kind of a assistant style AI or agents, like they really kind of broadly bucket into those two categories. I also think that we've so rarely had

as much like narrative consolidation around a single term that's like kind of in the ballpark that the fact that everyone kind of knows this term and is there like trying to kind of like, you know, get into the nitty gritty between agent and automation, I just think is ultimately sort of a not particularly relevant pursuit. I think what people are looking for when they're talking about agents is

It's stuff that actually takes big chunks of work off the table for me, not just makes me do that work better. Yeah, I found it something similar to where people say, oh, the newest agent from OpenAI deep research, which I've used and is great. And then people say like, well, what about code interpreter? Is that an agent? And ultimately it doesn't matter whether it's a tool or a workflow, as long as it solves a certain task for you. Through your use of them, what type of use cases are you excited for? What have you found to be actually helpful in the current state?

So we think a lot. So the main product right now that Superintelligent is being hit up for is something we call the agent readiness audit, which is basically an agentified process of looking across an organization's workflows and its procedures, its policies to help them understand what they need to do to be ready to use agents and which agent use cases might be a good fit for them based on current capabilities.

And I think that what we often end up kind of, you know, what ends up getting shared with them is, you know, we have these grand ideas of these sort of multi-agent workflows that are orchestrated perfectly and take, you know, giant chunks of tasks off. And that's really just not where things are. Where things are right now is still in this sort of discrete task, you know, repetitive discrete task that you can do, that you have to do over and over and over again.

And I think that the more that people and companies experiment with that in mind, the better suited they're going to be to actually taking advantage of where agents are now. I think it's going to change dramatically over the course of this year. So really single purpose, very specific agents. I think the way that I think about it, sort of from a personal perspective is, we

we haven't really agentified a ton of the like podcast processes yet. You know, we use AI for a bunch of them, but they're sort of like not fully automated. I think that when it comes to building super intelligent, we're in the midst of going through and sort of totally reevaluating how everything gets done and actually trying to embed agentic workflows in how we work, right? So the way that we build the products is changing, you know, based on cursor and sort of, you know, different approaches there.

We have, you know, the knowledge base that powers this agent readiness audit is a workflow that sort of automates a set of different agents or automations or however you want to do it. You know, there's.

a Zapier piece and a couple other pieces that all add up to automatically extracting information from the web around current agent capabilities that happens every day. So we're kind of going one by one through all the things that we're doing and just asking which parts of this could be sort of supported by, augmented by, or replaced by an agent and trying to redesign on that basis. That's really smart. Yeah, we've played a little bit with trying to use agents for kind of optimizing some of the podcast tasks. And I think we...

I have so much experience with AI that we kind of think in workflows, we think of agents, we kind of know what they're capable of. Some people we talk to like have almost no experience in AI, other than like the one chat GPT conversation they've had. And so like, even just understanding where AI like fits into their equation into their business is kind of a difficult thing. Like, where do you start with someone that doesn't have a lot of experience in AI? Like, how do you kind

kind of explain the benefit to them and how they can get started? Like what's one thing that they could start like this week with AI? One of the things that often comes up, and this has been the case for some time and is sort of not, not agent related is people underestimate the value of, of some very basic use cases. So we did a survey of super intelligent users a while back. And, and the number one use case for, for AI across the set of, you know, enterprise users was brain,

brainstorming, right? Basically making their work better by having ChatGPT act as like a consultant or a thought partner as they were thinking through things, right? And this is going to evolve over time. I think, you know, an interesting analogy is sort of like, imagine the marketing or social media that you guys do for the podcast. You've probably shifted, I would imagine, from doing it totally raw yourself to now like partnering with ChatGPT on some of the copy and using Midjourney for some of the images.

And so it's like now it's sort of this AI enabled process. So, you know, it's an AI assisted process where, you know, maybe the time has gone down, but I bet the benefit is more just sort of like quality increase and cognitive load decrease for you guys. However, I would imagine that over the course of the next year or so, we'll all be able to actually, I think that, you know, a social media agent seems like one of the easiest to actually execute against, you know, how many tweets per day do you want? What do you want them to relate to? How much are they replies versus this? You know, like you

You could kind of see how it comes together pretty quickly. What's the database that I have to pull from a previous messages? And so you're kind of going to see that process. And so if people are just getting started,

Just using the assisted level AI to, you know, see how it makes their work better before they worry too much about even time saving necessarily, I think is often a really good starting point, you know? Yeah, absolutely. And I've even seen the progression where you have those chats with Claude or ChatGPT to get some input, help with the brainstorming, coming up with titles.

to creating a cloud project when you can upload a bunch of documents, a bunch of standards and best practices, so you can get more consistent results over time. We've also experimented with the AI editors and we've yet to find success there, but it's interesting how the chasm between what works today and what is not quite working, what's a little ways off is just shrinking by the day. Have you noticed any tools in your workflow that have really allowed you to

completely offset a process or are you still a human in the loop a lot of the time for these type of things? So the thing that's closest right now in the super intelligent world is the automation of this knowledge base around current agent capabilities. So when we're trying to match this agent readiness audit, basically the way that it works is a company will come to us and we'll talk to them and

And then we deploy a voice agent that's, that does this interview where, you know, we've customized the set of questions, uh, and we can do this, you know, a very small handful of times and just get a very kind of high level overview, or we can deploy it all the way down to the employee level across hundreds or even thousands of people, right? You get all of this information. And then based on the interviews with all those people, we're running that through this knowledge base of agentic opportunities that includes, you know, what the agent is, what it does, uh,

what use cases are related to it, what industries are related to it, what compliance regimes that it fits with, what tech stacks it matters to. So it's not just the two or three vectors that you might imagine. It's a database of, I don't know, 20 or 50 rows or something like that around all this information that we're trying to gather. And we've really hyper-automated

the process of ingesting that information. Now we still have an additional layer of human interaction. So for example, like

a lot of the subjective information that I get as relates to agents comes from Twitter slash X, right? Like people saying, oh, this sucks. Oh, this is great. And that's actually quite useful to try to benchmark like where a thing is when you're trying to give a company an expectation of whether they should be using it or not. You know, like if subjectively, like half the sentiment on Twitter is that it's great and half the sentiment is bad, you can go in with the appropriate expectations. You know, we're not sure how ready for primetime it is. It might be like slightly over-promising or whatever. So

It's not fully automated in the sense that there's things that are still much more valuable for kind of like the human to do, but it's getting there, right? And I think that that's an important piece. When it comes to the podcast, there's nothing that's fully automated, although I did an experiment last week when I got sick

of, of a much more automated process. So basically I had pretty much lost my voice. And and so I, I took a topic, use deep research to write, you know, a paper on it. I think the one that I ended up using, I tried a couple of the one I ended up using was what, what economic predictions in the era of AGI, basically how AGI is going to impact

the economic landscape, um, wrote a research paper on it with deep research and then fed that into Google's notebook LM and let them turn it into a podcast. And that's what I published that day, um, as an experiment, uh,

it went over reasonably. It was sort of a cute idea. So I don't think I'm going to be turning back to that too frequently. When I think about where automations might come in the future for the AI Daily Brief, I think that there's, you know, there's probably not for my show because so much of it is like the context that I add implicitly around things. But, you know, news reports

Podcasts are going to be very easy to go from, you know, the automated feed that curates them and just turns it, you know, end-to-end pipeline into a podcast that gets pushed out, you know. It's a very, very sort of simple set of steps that, you know, each requires their own automation, but you could do it really effectively. Yeah, absolutely. And as a longtime listener, I can tell you that the added personality, the added perspective always helps besides just, you know, an information dump.

I actually wouldn't mind double click on deep research because I've also used it, had positive results. But as you mentioned, the Twitter vibe test, a lot of people didn't seem to like it. A lot of people did, but it was one of those right down the middle ones.

What's your experience been like with it? Do you think it's a step in the right direction? And even just like long running AI processes in general, do you think that's the future? Yeah, I, well, I think it's part of the future for sure. I think, I think that the, we're going to have to do a lot of experimenting and iterating to figure out exactly how these things work. My sense is that most of the, the people who have had positive experiences with deep research are

have used it for particular types of knowledge, you know, summarization that it was well-suited to do, and the people who've had bad experiences have started to figure out the jagged edges of where it's not so good, right? So it's very clear that, like, not having access to contemporary journals is a huge problem, right? It really, like, limits its ability to be super deep and contemporary when it comes to science or anything that requires access to, you know, journals that are behind paywalls.

The other thing that I found is that when it comes to really fast-moving spaces, there are...

it could be a challenge. So for example, this AGI thing that I did, it was mostly great. However, it was definitely like over-reliant on Nick Bostrom's super intelligence, you know, as a resource. And I think at one point it said that most scientists still think that AGI is a decade or two away, which is obviously like, so not the, you know, it's, it's not reading Twitter, let's put it that way. So I think that there's, there's, you know, we're just going to figure out like,

basically research or, you know, grabbing a bunch of sources and turning them into a consolidated bucket of knowledge is actually a very diverse use case. It's not one use case. It's about a thousand use cases embedded in one category of use case that it's going to take some time for us to figure out what pieces of it this particular tool is actually good at. I've enjoyed deep research so far. I think there's limitations. I think

What's also weird is some of the limitations are like when it hallucinates, it's hard to know that it actually hallucinated. Like it's like in these areas that I'm not an expert at, like it could just say something and then cite the reference. And I'm like, oh, yeah, that's true, because it read the thing. Right. So it's like it's so much harder to spot these hallucinations. I feel like hallucinations are still an issue in the world of AI. And I feel like that's something that we're still trying to solve.

How big of an issue is hallucinations, do you think? And is that like a primary complaint that you see with businesses? Yeah, it's actually a much bigger deal for businesses than it is for consumers. I think consumers have a higher kind of threshold for what they can deal with, especially if it's, you know, so much of the deep research use case isn't

It isn't trying to get something that's production ready. It's trying to get a kind of a thing that gets to 80 or 90 percent. Right. So one of the use cases that I've seen a number of people have success with is basically background market descriptions and sizing for their startups. Right. So they're trying to communicate and understand, like, how big the total addressable market for the thing that they're building is.

And, you know, it's really good at pulling a bunch of different resources in blah, blah, blah, blah, blah. But they're never going to just turn that over to an investor, at least if they're actually a good entrepreneur, they're not going to just turn it over to an investor. But it saves them a huge amount of time. Like I said, it gets them kind of 80% of the way there.

And so for them, they're maybe more in a position to actually spot those hallucinations. Where hallucinations become a real problem is when people are actually, you know, basically replacing a human information source with an AI or an agent information source that really relies on having the right information. So an insurance company that we work with,

found that the threshold of tolerance that people had for a human agent being wrong when they were giving them information was like 5% of the time, 7% of the time, something like that. Whereas with a robot basically giving them that information, it was like less than 1%, right? People expect it to be absolutely perfect.

And in certain cases, if you're in kind of like a, you know, highly regulated industry, the barrier is extra high because if you give bad advice, you know, if you think medicine, insurance, I mean, anything like that, the threshold is just extremely high. So hallucination is one of those weird things where it's like, it's often just kind of funny and silly when it comes to consumers, but is a major detriment to how far into deployment and production certain use cases can be for big enterprises. Yeah.

Absolutely. I've kind of tried to teach people to view it as like Wikipedia. Use it to get started, but it's not something you can put as a reference in your paper. In regards to the hallucinations, a lot of people try to solve this with evals or just build enough of a robust eval set that they're able to kind of mitigate against some of the risks of hallucinations.

Do you find businesses are implementing any other types of strategies or are they even following through with evals or just kind of yellowing it? What's the vibe in the business community? I think that evals are still actually underutilized. I think that a lot of companies like

They're just now coming to understand what the full stack of things they have to do to actually implement these tools. And one of the things that's frustrating or can be frustrating is when you realize you can't just build the thing, you have to build this other set of infrastructure to support the thing, to make the thing work. And there's often a resistance to that. I mean, all of the custom build shops that are either building custom agents or helping deploy things,

They always complain about how, you know, the budgets end when you get to evals and they don't want to put those into practice. So I think that even evals are there. To your point, I think that using them is a whole different issue. I think that the way that this will be most addressed in the short term, which is sort of has it's kind of good for a number of reasons.

I think that you're going to see humans in the loop for much longer than they theoretically need to be to help solve and spot this. And I think that human in the loop is, in addition to it just being a solve for technical problems of AI,

I also think it's a transitional tool to slow down the sort of like rate of full task and work replacement that AI could possibly do. Right. It creates a mechanism for, you know, continued human presence, even on areas that are that are that are being highly automated.

which doesn't solve all the issues for job replacement and things like that. But I think that we will over-index on things like that even more than are necessary because society is going to have to find ways to slow down what AI could do from a replacement perspective. Yeah, it makes a lot of sense. I do share the sentiment that evals are very underutilized. It's kind of remarkable to me how many, even like

startups aren't using very many evals. I feel like probably less than 10% of startups are using like true eval suites. Kind of brings me to my next question of like, what mistakes do you see most often when businesses try to implement AI? I think there's obviously, you know, there's hallucination, maybe overengineering, but I'm curious what like

are the primary mistakes you see when businesses are trying to implement agents? There's a lot of things. So if you look at what companies view as their biggest challenges with AI right now, kind of broadly speaking, the three that tend to come up are one, like data readiness and whatever complex set of things that that means, you know, like, is their data all, you know, in the same place? Is it ready to be used? I mean, this is a huge industry dealing with just that, just that problem.

The second is everything surrounding privacy, cybersecurity, all that sort of set of issues. And then the third is employee adoption and utilization. And usually it's sort of like in that order, roughly speaking. One of the things that you see constantly is

You know, Company X will have 10,000 Microsoft Copilot subscriptions, but, you know, only 33% of them are being used or something like that. And there's just no real support infrastructure. There's no enablement infrastructure surrounding that.

surrounding that sort of utilization. And I think that that's just a market gap that is very, you know, that is starting to be filled. I mean, this is sort of the space that Super plays in, obviously. But it just needs, you know, needs more people building more things in that space that can support, you know, adoption and implementation. So that's a big challenge. I think when it comes to agents, what we're going to see is sort of a lot of, we're going to have a misalignment of expectations. I think people are going to try to

to sort of, or they're going to imagine that they can do more than they can to start. And I think that, you know, they'll try to, you know, create these very complex systems right out of the gate that won't quite work. Yeah, there's probably a bunch more that I can think of as well. I wouldn't mind diving into the security aspect a little bit. We're familiar with this one project, open-source project, a code gate, which kind of acts as a local proxy that your LN requests route through so it can redact PII and stuff like that.

But it just seems to be just getting started. Do you have any either tools or advice for companies that are concerned about security and bringing LLMs into their workflow? I think that I would.

At any given time when anyone is listening to this, I think it's really worthwhile to go try to do an audit of what tools are available in that space because this is so clearly such a massively juicy problem that there are startups now, there are going to be more startups in three months, there's going to be many more startups in six months. They're all going to be taking different approaches to this or slightly different approaches to this.

There's a full, I mean, a full sort of spectrum of options of available support to the extent that people want to solve that now. Like there's companies that will come in and build things completely on premise. I mean, there's just a ton of options. I think that one thing that's interesting that's happening with this sort of technology shift that maybe didn't in quite the same way in previous eras is

companies are building a lot more than just buying off the shelf than they have in the past. So Menlo did a study, their enterprise adoption study. And between 23 and 24, the rate of build versus buy, there was a huge, huge shift. So in 23, it was something like 80 buy, 20 build. And then last year, it was...

53 by 47 build. So, I mean, huge, huge shift. Now, I think that this boomerang's back. I think that what that reflects is verticalized solutions and verticalized agents not quite being ready for prime time yet. And so companies that are in those verticals or that are thinking about those functions

seeing the opportunity, racing to build it, you know, because there's all these frameworks that are available to them. But what I think will happen naturally is that winners will emerge in the category that they had started to build in and then they'll naturally shift back over to whatever sort of the market leader is.

But it does create this whole interesting dynamic. And because of that, I think that one of the ways that people are more emboldened to solve some of these issues is if they can't get there with the available kind of security profiles of third-party vendors, there are options more than perhaps in the past that are accessible for kind of rolling your own solution that just has the kind of highest level of security that you can have.

NLW on the Future of AI Agents 24:27 Share

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

Deep Dive

Shownotes Transcript

NLW on the Future of AI Agents