We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

AI Agent Capabilities Are Doubling Every Three Months

2025/3/21

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

AI Deep Dive AI Chapters Transcript

People

Accelerate Harder

Cocktail Peanut

Ethan Malek教授

播

播客主持人

播客主持人，专注于英语学习和金融话题讨论，组织了英语学习营，并深入探讨了比特币和美元的关系。

马

马克·扎克伯格

Topics

马克·扎克伯格：Meta的Llama模型下载量已突破10亿，展现了AI模型的巨大潜力。 Ethan Malek教授：对Llama模型的下载量表示质疑，认为无需每次使用都下载模型。同时，也指出虽然AI智能体在处理长任务方面取得了快速进步，但其可靠性仍有待提高，50%的成功率对于企业应用来说是不够的。 Cocktail Peanut：认为自己贡献了大约100万次Llama模型下载。 Accelerate Harder：认为Llama模型的10亿下载量可能是由于Hugging Face的下载数据统计方式造成的。播客主持人：对AI智能体能力指数级增长的研究结果进行了详细解读，并分析了其对企业战略的影响。强调了指数级增长的概念，以及企业需要尽快制定AI智能体战略的重要性。同时，也讨论了摩尔定律在AI领域的应用，以及现有研究中的一些局限性，例如50%成功率的阈值设定，以及对未来发展趋势的不确定性。 Meryl Lutzky：Graphite的编码工具专注于根据开发者评论提供代码建议、生成代码摘要和修复代码错误，其收入在2024年增长了20倍。 Amy Deng：相信AI能力的指数级进步，认为到2027年底，一天的工作将实现自动化。 Lawrence Chan：选择合适的成功率阈值对于准确评估AI智能体能力至关重要。 Joshua Ganz：对AI能力提升趋势的长期持续性表示质疑。 Robin Hansen：如果AI能力提升趋势持续，那么8年内AI就能完成为期一年的项目。

Deep Dive

Chapters

Meta's Llama models surpass 1 billion downloads, sparking debate and anticipation for the upcoming LlamaCon and Llama 4. The sheer number of downloads raises questions about the methodology, but it highlights significant interest in Llama models.

Meta's Llama models reach 1 billion downloads.
Debate about download count methodology.
Anticipation for LlamaCon and Llama 4 release.

Shownotes Transcript

Translations:

中文

Welcome back to the AI Daily Brief Headlines Edition, all the daily AI news you need in around five minutes.

Mark Zuckerberg says that Meta's Llama models have hit a billion downloads. That is a big increase from last December when the company claimed 650 million downloads. Now, for some sense of comparison, TikTok has been downloaded over 5 billion times and Roblox has over a billion downloads for mobile users. But again, those are both popular consumer apps rather than open source AI models. Still, to some, something doesn't quite add up. Professor Ethan Malek writes,

sort of confused on how Llama could have been downloaded a billion times. You know you can just keep using the model and don't have to download it each time if you want to use it. Cocktail Peanut reposted that and said, I think I contributed to like one million of those. Accelerate Harder, meanwhile, writes, this has got to be by counting the Hugging Face download numbers, which are notoriously insane, right? What other possible explanation could there be for a billion Llama downloads?

One copy each for one out of eight humans? How many GPUs even exist on Earth? I think from my perspective, we can just give Zuck the win here and move on. There is clearly a lot of interest and a lot of deserved interest in the Llama family of models. And if we zoom that forward, there seem to be some exciting things coming. Meta is currently gearing up for their inaugural LlamaCon taking place at the end of April. It's been widely rumored that the event will feature the release of the Llama 4 model family, which will be natively multimodal and optimized to power agents.

Google is giving Gemini an upgrade, making it more feature-complete with a Canvas interface and a version of Notebook LM audio overviews. The new interface option is similar to the identically named ChatGPT Canvas tool and Anthropic Artifacts. As a total aside, we're starting to see some feature naming consistency. Google, OpenAI, and Perplexity all have a deep research feature, with Grok calling it deep search. And I kind of think that that's better for users than everyone trying to pretend they're somehow fundamentally different.

So maybe these are all supposed to just be called Canvas. My point being that multiple people calling it Canvas might not be a lack of creativity, it might actually be a user-friendly move. Anyways, what this offers is a new interactive space for collaboration with Gemini, allowing users to go back and forth with the AI on revisions to writing and coding projects. And indeed, whatever they call it, this interface style is starting to become a default feature for AI chatbots.

If you've used the before and after, it makes a huge difference to eliminate much of the copying and pasting and switching windows and manual updating. The interface also allows native execution of code for quick testing. Now, porting over the audio overviews feature of Notebook LM is an interesting choice from Google and one that makes a lot of sense. The tool went viral last year as users experimented with the ability to generate a podcast on any topic.

It felt like a natural fit within Notebook LM's research focus, but it also has a much wider set of possibilities, and it's likely that those other types of use cases might more come to the fore now that it's embedded inside Gemini. It also means that you can now use Gemini to generate a deep research report and immediately spin up a podcast to digest it. Google is definitely surging to bring all of these experiences natively into Gemini in a big way.

Lastly today, an update from the white-hot coding assistance sector. AI startup Graphite has announced $52 million in Series B funding and a doubling down on their coding tool. Graphite was founded back in 2020, a million trillion years ago, and started life as a mobile development tool company. They pivoted to code review shortly afterwards and have since built out AI tooling, largely based on their solution to internal pain points.

Co-founder Meryl Lutzky said,

So how does this differ from more general assistants like Cursor? Well, Graphite is basically a little bit more focused. It can make code suggestions based on developer comments, compile code summaries, and generate fixes for code failures. Their new tool called Diamond will be focused on automating bug hunting and will be offered as a standalone product. Graphite's platform also allows customers to define their own code-based specific patterns and filter sensitive information. Whatever they're doing, it seems to be working. As Lutzky said that revenue grew 20x in 2024.

So no signs of slowing down in this particular sector. And we haven't even gotten into the big $1 million no-code hackathon that's coming down the pipeline now. However, that is going to have to wait for another episode. For now, that is going to do it for today's headlines. Let's shift over to what is some unbelievably interesting research about the speed at which agents are getting better in the main episode.

Today's episode is brought to you by Super Intelligent and more specifically, Super's Agent Readiness Audits. If you've been listening for a while, you have probably heard me talk about this. But basically, the idea of the Agent Readiness Audit is that this is a system that we've created to help you benchmark and map opportunities in your organizations where agents could

specifically help you solve your problems, create new opportunities in a way that, again, is completely customized to you. When you do one of these audits, what you're going to do is a voice-based agent interview where we work with some number of your leadership and employees to map what's going on inside the organization and to figure out where you are in your agent journey.

That's going to produce an agent readiness score that comes with a deep set of explanations, strength, weaknesses, key findings, and of course, a set of very specific recommendations that then we have the ability to help you go find the right partners to actually fulfill. So if you are looking for a way to jumpstart your agent strategy, send us an email at agent at besuper.ai, and let's get you plugged into the agentic era.

Today's episode is brought to you by Vanta. Trust isn't just earned, it's demanded.

Whether you're a startup founder navigating your first audit or a seasoned security professional scaling your GRC program, proving your commitment to security has never been more critical or more complex. That's where Vanta comes in. Businesses use Vanta to establish trust by automating compliance needs across over 35 frameworks like SOC 2 and ISO 27001. Centralized security workflows complete questionnaires up to 5x faster and proactively manage vendor risk.

Vanta can help you start or scale up your security program by connecting you with auditors and experts to conduct your audit and set up your security program quickly. Plus, with automation and AI throughout the platform, Vanta gives you time back so you can focus on building your company. Join over 9,000 global companies like Atlassian, Quora, and Factory who use Vanta to manage risk and prove security in real time.

For a limited time, this audience gets $1,000 off Vanta at vanta.com slash NLW. That's V-A-N-T-A dot com slash NLW for $1,000 off. Today we have a super interesting conversation.

We're talking about this research that just came out that has a lot of chatter that's basically arguing for a Moore's law for AI agents, basically a way to think about how fast the capabilities of agents are improving. And the people behind the research not only have some interesting results, but they're

but also just a very interesting framing for the entire problem. Now, of course, why this matters is that right now, we are in the midst of this agentic transformation, one which I believe will basically lead to a huge portion of today's knowledge work tasks done by agents eventually. And what everyone is trying to figure out, especially the companies that are out there trying to buy and pilot their first agents,

is just how capable are they? What specific types of things can they do? And based on that, how to integrate them into today's existing workflows. But lurking behind all of that is this knowledge that they're improving at such a fast rate that everything that we do today to design new systems around them may be nullified in just a few months when they are more capable.

And so not only are enterprises and companies trying to adapt to the agent capabilities of right now, they're also trying to plan for a future which is on the one hand unknowable and at the same time totally inevitable. So that's the setup and the context for this. But before we talk about Moore's Law for AI agents, let's talk about Moore's Law. I asked Rock to explain it in a fun, easy-to-understand way, and its response was unbelievably, unfathomably cringe.

They tried to compare it to a video game where your character's strength keeps leveling up without you, quote, grinding for extra coins. They compared it to a magical candy store where every 18 months, the shopkeeper doubles the amount of candy you can get for the same price. But basically what this actually refers to is that Intel co-founder Gordon Moore noticed way, way back in the 60s that the number of transistors on a computer chip was roughly doubling at a pretty consistent pace. Basically every couple of years, the capabilities were doubling while the price was staying the same.

And so now anytime that there is a consistent or seemingly consistent pace of change in technology, we of course have to compare it to Moore's Law.

Anyways, let's talk about this specific paper. It comes from Meter, a nonprofit organization based in Berkeley that published a paper called Measuring AI Ability to Complete Long Tasks. They created a set of 170 real-world tasks, including coding, cybersecurity, general reasoning, and machine learning, and from there established a human baseline by determining how long it would take an expert programmer to complete each task.

They called this the, quote, task completion time horizon, and that the logic was essentially that the time taken to complete a task by a human expert is a good proxy for how difficult the task is. A selection of models were given control of a coding agent and put through their paces on the task list. The idea was to test where each model would fall below a 50% success rate.

Researchers tested models dating back to OpenAI's GPT-2 up to Anthropics' Claude 3.7 sonnet, so very contemporary. Their results show a remarkably consistent pace of advancement. And this is where the comparison comes from. They write, we find a kind of Moore's law for AI agents. The length of tasks that AI can do is doubling about every seven months.

To put some numbers around it, GPT-2, which was released in 2019, could complete a task that would take an expert programmer around two seconds, but start failing at anything more complicated. By the time you get up to GPT-4, released in 2023, AI could nail tasks that a human programmer would spend four minutes on. Zooming ahead, researchers found that Claude 3.7 Sonnet could complete tasks that take around an hour with 50% accuracy.

Now, if you're watching this video, you'll note that this exponential curve is plotted as a straight line with a logarithmic scale: 1 second, 4 seconds, 15 seconds, 1 minute. But if you look on a linear scale, you can see just how much more dramatic and exponential the growth curve is. The researchers actually also tested OpenAI's O3 Mini and DeepSeq R1, but found that they were less performant than Sonnet 3.7, and so decided to drop them from the data.

To verify the trend, the research ran a similar test using questions from the standard coding benchmark SWEBench or SWEBench. They found consistent results dating back to the release of GPT-4 with a doubling in capability every 70 days.

The uncertainty level associated with these tasks is pretty large, but the researchers commented, even if the absolute measurements are off by a factor of 10, the trend predicts that in under a decade, we will see AI agents that can independently complete a large fraction of software tasks that currently take humans days or weeks. Separating out just the more recent models, the researchers also found that the pace of improvement has increased. For models created since last year, the doublings in capability are occurring every three months.

In a post summarizing their conclusions, the researchers wrote, We are fairly confident of the rough trend of one to four doublings in horizon length per year. That's fast. Measures like these help make the notion of degrees of autonomy more concrete and let us quantify when AI abilities may rise above specific useful or dangerous thresholds. So as I said, this generated a ton of chatter. It's been seen 4 million times and has about a thousand people who have reposted it or commented on it. For many, this was the concrete data they needed to start feeling the AGI.

Researcher Amy Deng wrote, I didn't believe in exponential AI progress before working on this paper, but I believed in statistics, our methodology, and a straight line on a log scale graph. Now I live and breathe the fact that day-long work will be automatable by end of 2027, and AGI is coming.

Professor Ethan Malek quibbled with the methodology but acknowledged the result is very significant, posting, A new paper shows that AI agents are improving rapidly at long tasks, but they aren't reliable yet. That being said, this feels significant. More than 80% of success runs cost less than 10% of what it would cost for a human, level 4 software engineer to perform the same task. Ethan's specific gripe is that the threshold for success was only a 50% completion rate, which is not going to stand up to enterprise use cases.

The researchers actually addressed this in the paper, choosing a 50% success rate because it was the most useful for filtering out small variations in the data.

Co-author Lawrence Chan commented, If you pick very low or very high thresholds, removing or adding a single successful or single failed task respectively changes your estimates a lot. In further testing, the researchers found that increasing the reliability threshold from 50 to 80% reduces the average time horizon by a factor of 5, but the pace of doubling in the trend remained very similar. Point being that the paper, ultimately, isn't really trying to pinpoint how good agents are at the moment,

Instead, it's trying to measure the trend of improvement. And that's immediately what stood out to me. I don't think the specific finding of the time that agents can work is all that useful. I think what's useful here, especially from a very practical standpoint for companies that are trying to figure out what their agent strategy is going to be, is that we're seeing a doubling of that capability at the longest every seven months. And now it seems like more like every three months.

That means that by the time you next report quarterly results, the capabilities of the agents that you are not yet working with will have doubled. Two quarters from now, the agents that you haven't hired yet will be four times more capable, and so on and so forth, if this, of course, holds up.

Now, what about the concern that traditional coding benchmarks are basically soaked in useless and measuring further improvement from the current state of the art? The researchers actually commented that they, quote, think these results help resolve the apparent contradiction between superhuman performance on many benchmarks and the common empirical observations that models do not seem to be robustly helpful in automating parts of people's day-to-day work. The

The best current models, such as Claude 3.7 Sonnet, are capable of some tasks that may take even expert humans hours, but can only reliably complete tasks of up to a few minutes long. Joshua Ganz, a management professor at the University of Toronto who has written about the economics of AI, questioned whether it's correct to assume this trend will hold. He commented, "'Extrapolations are tempting to do, but there is still so much we don't know about how AI will actually be used for these to be meaningful.'" The researchers themselves questioned how long the trend is likely to hold.

Moore's Law held for a doubling of the number of transistors on a leading computer chip for over four decades from the 1970s. However, the trend slowed in the early 2010s as chip designers ran up against physical limitations having to do with atomic structure. This was coupled with the chipmaking industry focusing on power efficiency over raw power.

The researchers made a comparison to the constraints on AI, namely the limits to compute, writing, Basically, the point being that the researchers here are going to pains to simply present the data they found, not over-extrapolate what it might mean or how long it might continue. They, like us, are unsure about how this is going to play out. Then again, they also point out that advances in multi-agent systems...

improvements in agentic training, and more efficient training algorithms could all help bolster the trend.

And while the normal temptation when we get new research like this, as you can see in all the people that Nature asked to comment on the piece, is to try to poke holes in it and caution about why it might be overly optimistic, it is also worth, I think at this point, zooming out and thinking on the other side. What if the trend holds? Scientist Robin Hansen wrote, so around eight years till they can do year-long projects? The implied point, of course, is that even if we only get a fraction of that, that is a civilization-changing trend.

Next up, the researchers are going to explore how pairing of an AI agent with a human worker compares to a human worker alone, which should be really interesting as well. For now, though, if you take nothing away from this, if you disbelieve in the long-term trend, if you question the efficacy of agents right now, it still appears pretty clear that the capabilities of which you are skeptical are improving at an extraordinary rate.

Humans are historically unbelievably bad at thinking in terms of exponentials. It is just very hard for us to actually mentally get ourselves to a place where we can zoom out and understand that pace of change. We live and grow and learn in linear timelines. We are not wired for the exponential. And yet it appears that exponential is what we have here. Not for nothing, if you have not started to figure out your AI agent strategy yet, well, friends, the best time was yesterday, but the second best time is today.

For now, that's going to do it for today's AI Daily Brief. Appreciate you listening or watching as always. And until next time, peace.

AI Agent Capabilities Are Doubling Every Three Months 17:22 Share

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

Deep Dive

Shownotes Transcript

AI Agent Capabilities Are Doubling Every Three Months