We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

2024 in Agents [LS Live! @ NeurIPS 2024]

2024/12/25

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

AI Deep Dive AI Insights AI Chapters Transcript

People

Graham Neubig教授

主

主持人

专注于电动车和能源领域的播客主持人和内容创作者。

听

听众

无足够信息构建个人资料

Topics

主持人：2024年是大型语言模型（LLM）代理取得重大进展的一年，许多公司都在关注代理技术，涌现出许多成功的应用案例，一些公司获得了高额融资，并取得了显著的市场地位。 Graham Neubig教授：构建代理是一个极具挑战性的任务，需要考虑代理与计算机和人类的交互界面，选择合适的语言模型，设计有效的规划策略，识别和规范化常见的工作流程，以及对代理进行有效的评估。在代理与计算机的交互方面，他介绍了OpenHands系统，该系统通过赋予代理有限的工具来提高效率。在人机交互方面，他强调了信息呈现和环境集成的重要性。在模型选择方面，他认为Claude模型在指令遵循、工具使用、环境理解和错误恢复等方面表现出色。在规划方面，他比较了预先设定和动态生成的规划策略，以及显式和隐式结构。在工作流程方面，他介绍了STEP和Agent Workflow Memory等研究工作。在探索方面，他介绍了Agentless和Bagel等研究工作。在评估方面，他介绍了miniWoB、Aider、WebArena和Sweebench等评估方法。最后，他预测了未来代理技术的发展趋势，包括模型性能提升、指令遵循能力增强、错误纠正能力增强以及评估方法的改进。听众：提出了关于代理性能、Web代理交互方式、代理架构设计、模型上下文协议（MCP）、代理身份验证等问题。 Graham Neubig教授：对听众提出的问题进行了详细解答，并分享了他对这些问题的看法和见解。他解释了代理解决问题的成功率低于Sweebench评估结果的原因，介绍了Web代理与网站交互的三种方式，并讨论了任务特定代理与单一代理处理多个任务的比较。他还解释了为什么没有采用Anthropic的MCP，并分享了他对代理身份验证问题的看法。

Deep Dive

Key Insights

What is the current state of LLM agents in 2024 according to the keynote?

LLM agents have made significant progress in 2024, with OpenHands (formerly OpenDevon) leading the SWE-Bench Full leaderboard at 29%. However, on the SWE-Bench Verified leaderboard, they are at 53%, behind Amazon Q, Devlo, and OpenAI's self-reported O3 results at 71.7%. Major players like OpenAI, DeepMind, and Anthropic are focusing on consumer and coding agents, vision-based computer-using agents, and multi-agent systems. Notable advancements include Cognition AI's Devin, Cursor Composer, Codeium's Windsurf Cascade, and the growth of customer support agents like Sierra ($4 billion valuation) and search agents like Perplexity ($9 billion valuation).

What are the key tools provided to agents in OpenHands for interacting with computers?

OpenHands provides agents with five key tools: program execution (bash programs and Jupyter notebooks), file editing (browsing and overwriting files), global search and replace, and web browsing (including scrolling, text input, and clicking). These tools allow agents to execute code, edit files, and interact with web pages effectively.

Why is Claude considered the best language model for agents in 2024?

Claude is considered the best language model for agents due to its strong instruction-following ability, tool use, coding proficiency, environment understanding, and error recovery. Unlike GPT-4, which often gets stuck in loops, Claude is adept at trying alternative approaches when errors occur. Evaluations show Claude outperforms other models like GPT-4, Llama 3, and DeepSeq 2.5 in agentic tasks.

What are the challenges in designing human-agent interfaces?

Designing human-agent interfaces is challenging because it requires presenting enough information to users in a clear and concise manner. OpenHands uses a chat window to show agent actions in English descriptions, allowing users to explore detailed results if needed. The goal is to integrate agents into existing workflows, such as GitHub plugins for issue resolution, while ensuring the interface is intuitive and informative.

What are the predictions for the future of agent-oriented LLMs?

By mid-2025, every major language model trainer will focus on improving models for agentic tasks. Competition will increase, prices will drop, and smaller models will become competitive. Instruction-following abilities in agentic contexts will improve, reducing the need for manual engineering of workflows. Error correction will also improve, reducing instances of agents getting stuck in loops. Benchmarks like SWE-Bench and WebArena will become more challenging as agents improve.

How does OpenHands handle planning and workflow for agents?

OpenHands uses light planning with a single prompt rather than multi-agent systems. Agents follow curated workflows, such as reproducing issues, writing tests, fixing bugs, and verifying fixes. This approach allows flexibility when plans deviate, as agents can adapt without getting stuck. The blog 'Don't Sleep on Single Agent Systems' argues that strong instruction-following agents can handle deviations better than rigid multi-agent systems.

What are the key evaluation benchmarks for agents in 2024?

Key benchmarks include fast sanity checks like miniWoB (web navigation) and Aider (code editing), as well as highly realistic evaluations like WebArena (real open-source websites) and SWE-Bench (real-world GitHub pull requests). These benchmarks test agents' abilities in web navigation, code editing, and issue resolution. However, there is still a need for benchmarks that test agents' versatility in combining coding and web navigation tasks.

What are the challenges in expanding agent use beyond programming?

Expanding agent use beyond programming requires making agents accessible to non-programmers, such as lawyers or chemists. This involves designing intuitive interfaces and workflows that allow users to interact with agents naturally. Additionally, existing systems may need to be redesigned to support agent interactions, such as providing APIs for websites to improve agent accuracy and efficiency.

How does OpenHands address the issue of self-improving agents?

OpenHands uses a concept called Agent Workflow Memory, where successful workflows are stored and reused. When an agent completes a task successfully, it breaks down the task into individual workflows and adds them to the prompt for future tasks. This self-learning approach has shown a 22.5% improvement on WebArena after 40 examples, demonstrating the potential for agents to learn from past successes.

What are the challenges in agent authentication and how is OpenHands addressing them?

Agent authentication is challenging because many systems lack fine-grained control over permissions. OpenHands uses GitHub's fine-grained authentication tokens, which allow specific permissions for different repositories and actions. This approach, developed for human developers and GitHub apps, provides a template for secure agent interactions. However, broader adoption of such systems is needed to prepare the world for widespread agent use.

Chapters

This chapter sets the stage by exploring the capabilities of a highly competent human equipped with basic computing tools (web browser, terminal, file system, text/code editor). It then introduces Professor Graham Neubig and his work on building agents to leverage these tools.

Exploration of the capabilities of basic computing tools.
Introduction of Professor Graham Neubig and his work on AI agents.

Shownotes Transcript

Translations:

中文

We're back at Latent Space Live, our first mini-conference held at NeurIPS 2024 in Vancouver. This is Charlie, your AI co-host. As a special treat this week, we're recapping the best of 2024 going domain by domain. We sent out a survey to the over 900 of you who told us what you wanted and then invited the best speakers in the Latent Space Network to cover each field. 200 of you joined us in person throughout the day with over 2200 watching live online.

Our next keynote covers the state of LLM agents. With the triumphant return of Professor Graham Newbig of CMU and Open Devon, now a start-up known as All Hands.

The renamed Open Hands has done extremely well this year, as they end the year sitting comfortably at number one on the hardest SWE bench full leaderboard at 29%, though on the smaller SWE bench verified, they are at 53% behind Amazon Q Devlo and OpenAI's Self reported 03 results at 71.7%.

Many are saying that 2025 is going to be the year of agents, with OpenAI, DeepMind and Anthropic setting their sights on consumer encoding agents, vision-based computer using agents and multi-agent systems.

There has been so much progress on the practical reliability and applications of agents in all domains, from the huge launch of Cognition AI's Devon this year, to the sleeper hit of Cursor Composer and recent guest Codium's Windsurf Cascade in the IDE Arena.

to the explosive revenue growth of recent guest StackBlitz's Bolt, Lovable, and Vercel's V0, and the unicorn rounds and high-profile movements of customer support agents like Sierra, now worth $4 billion, and search agents like Perplexity, now worth $9 billion. We wanted to take a little step back to understand the most notable papers of the year in agents, and Graham indulged with his list of eight perennial problems in building agents.

As always, don't forget to check our show notes for all the selected best papers of 2024 and for the YouTube link to their talk. Graham's slides were especially popular online and we are honoured to have him.

Watch out and take care. Okay. Hi, everyone. So I was given the task of talking about agents in 2024, and this is an impossible task because there are so many agents, so many agents in 2024. So this is going to be strongly covered by my personal experience and what I think is interesting and important, but I think it's an important topic. So let's go ahead.

So the first thing I'd like to think about is, let's say I gave you, you know, a highly competent human some tools. Let's say I gave you a web browser and a terminal or a file system and the ability to edit text or code. What could you do with that?

Everything. Yeah, probably a lot of things. This is like 99% of my daily life, I guess, when I'm working. So I think this is a pretty powerful tool set. And what I am trying to do and what I think some other people are trying to do is come up with agents that are able to manipulate these things, web browsing, coding, running code in successful ways.

So there was a little bit about my profile. I'm a professor at CMU, chief scientist at All Hands AI, building open source coding agents. I'm maintainer of Open Hands, which is an open source coding agent framework. And I'm also a software developer. And I like doing lots of coding and, you know, shipping new features and stuff like this. So building agents that help me to do this, you know, is kind of an interesting thing very close to me.

So the first thing I'd like to do is I'd like to try some things that I haven't actually tried before. If anybody has, you know, tried to give a live demo, you know, this is very, very scary whenever you do it and it might not work. So it might not work this time either, but I want to show you like three things that

I typically do with coding agents in my everyday work. I use coding agents maybe five to 10 times a day to help me solve my own problems. And so this is a first one. This is a data science task, which says I want to create scatter plots that show the increase of the SWE bench score over time. And so I wrote a kind of concrete prompt about this. Agents work better with like somewhat concrete prompts. And I'm going to throw this into open hands and let it work.

And I'll go back to that in a second. Another thing that I do is I create new software. And I've been using a service, a particular service, I won't name it, for sending emails, and I'm not very happy with it. So I want to switch over to this new service called Resend.com, which makes it easier to send emails.

And so I'm going to ask it to read the docs for the Resend.com API and come up with a script that allows me to send emails. The input to the script should be a CSV file, and the subject and body should be provided in Jinja 2 templates. So I'll start another agent and try to get it to do that for me. And let's go with the last one. The last one I do is improving existing software.

And in order, you know, once you write software, you usually don't throw it away. You go in and like actually improve it iteratively. This software that I have is something I created without writing any code. It's basically software to monitor how much our, our agents are contributing to the open hands repository. And on the, let me make that a little bit bigger. On the left side, I have the number of issues where it like sent a pull request.

I have the number of issues where it sent a pull request, whether it was merged in purple, closed in red, or is still open in green. And so these are like, you know, it's helping us monitor. But one thing it doesn't tell me is the total number. And I kind of want that feature added to this software. So I'm going to try to add that too. So I'll take this. I'll take this prompt.

And here I want to open up specifically that GitHub repo. So I'll open up that repo and paste in the prompt asking it, I asked it to make a pie chart for each of these and give me the total over the entire time period that I'm monitoring. So we'll do that. And so now I have, let's see, I have some agents. Oh, this one already finished. Let's see.

So this one already finished. You can see it finished analyzing the SweeBench repository. It wrote a demonstration of... Yeah, I'm trying to do that now, actually.

It wrote a demonstration of how much each of the systems have improved over time. And I asked it to label the top three for each of the data sets. And so it labeled open hands as being the best one for SweetBench normal. For SweetBench verified, it has like the Amazon Q agent and open hands. For the SweetBench Lite, it has three here, three over here. So you can see like

That's pretty useful, right? If you're a researcher, you do data analysis all the time. I did it while I was talking to all of you and making a presentation. So that's pretty nice. I doubt the other two are finished yet. That would be impressive. Yeah, so I think they're still working. So maybe we'll get back to them at the end of the presentation. So these are the kinds of things that I do every day with coding agents now, or software development agents. It's pretty impressive. Yeah.

The next thing I'd like to talk about a little bit is things I worry about when designing agents. So we're designing agents to do a very difficult task of navigating website, writing code, other things like this. And within 2024, there's been a huge improvement in the methodology that we use to do this. But there's a bunch of things we think about. There's a bunch of interesting papers, and I'd like to introduce a few of them.

So the first thing I worry about is the agent computer interface. Like how do we get an agent to interact with computers? And how do we provide agents with the tools to do the job? And within we are doing the thing on the right, but there's also a lot of agents that do the thing on the left.

So the thing on the left is you give agents kind of granular tools. You give them tools like, let's say your instruction is I want to determine the most cost-effective country to purchase the smartphone model Kodak One. The countries to consider are the USA, Japan, Germany, and India. And you have a bunch of available APIs. And so what you do for some agents is you provide them all of these tools, APIs as tools that they can call.

And so in this particular case, in order to solve this problem, you'd have to make about like 30 tool calls, right? You'd have to call lookup rates for Germany. You'd have to look it up for the US, Japan, and India. That's four tool goals. And then you go through and do all of these things separately. And the method that we adopt in OpenHands instead is we provide these tools, but we provide them by just giving a coding agent the ability to call arbitrary Python code and

In the arbitrary Python code, it can call these tools. We expose these tools as APIs that the model can call. What that allows us to do is instead of writing 20 tool calls, making 20 LLM calls, you write a program that runs all of these all at once and it gets the result. Of course, it can execute that program. It can make a mistake, it can get errors back and fix things, but that makes our job a lot easier. This has been really instrumental to our success, I think. Another part of this is what tools does the agent need?

And I think this depends on your use case. We're kind of extreme and we're only giving the agent five tools or maybe six tools. And what are they? The first one is program execution. So it can execute bash programs and it can execute Jupyter notebooks. It can execute cells in Jupyter notebooks. So that those are two tools. Another one is a file editing tool.

And the file editing tool allows you to browse parts of files and kind of read them, overwrite them, other stuff like this. And then we have another global search and replace tool. So it's actually two tools for file editing. And then a final one is web browsing. Web browsing, I'm kind of cheating when I call it only one tool. You actually have like scroll and text input and click and other stuff like that. But these are basically the only things we allow the agent to do.

Then the question is, what if we wanted to allow it to do something else? And the answer is, well, human programmers already have a bunch of things that they use. They have the requests PyPy library. They have the PDF to text PyPy library. They have all these other libraries in the Python ecosystem that they can use.

And so if we provide a coding agent with all these libraries, it can do things like data visualization and other stuff that I just showed you. So it can also get clone repositories and other things like this. The agents are super good at using the GitHub API also. So they can do, you know, things on GitHub, like finding all of the, you know, comments on your issues or checking GitHub actions and stuff.

The second thing I think about is the human agent interface. So this is like, how do we get humans to interact with agents? I already showed you one variety of our human agent interface. It's basically a chat window where you can browse through the agent's results and things like this.

This is very, very difficult. I don't think anybody has a good answer to this, and I don't think we have a good answer to this, but the guiding principles that I'm trying to follow are we want to present enough info to the user. So we want to present them with what the agent is doing in the form of a kind of English description. So you can see here every time it takes an action, it says, like, I will help you create a script for sending emails,

When it runs a bash command, it will say "ran a bash command." It won't actually show you the whole bash command or the whole Jupyter Notebook because it can be really large, but you can open it up and see if you want to by clicking on this. So if you want to explore more, you can click over to the Jupyter Notebook and see what's displayed in the Jupyter Notebook, and you get lots and lots of information. So that's one thing.

Another thing is go where the user is. So like if the user's already interacting in a particular setting, then I'd like to, you know, integrate into that setting, but only to a point. So at OpenHands, we have a chat UI for interaction. We have a GitHub plugin for tagging and resolving issues. So basically what you do is you do @OpenHandsAgent and the OpenHands agent will like see

see that comment and be able to go in and fix things. So if you say at open hands agent, tests are failing on this PR, please fix the tests. It will go in and fix the test for you and stuff like this. Another thing we have is a remote runtime for launching headless jobs. So if you want to launch like a fleet of agents to solve, you know, five different problems at once, you can also do that through an API. So we have, we have these interfaces.

And this probably depends on the use case. So like, depending if you're a coding agent, you want to do things one way. If you're a like insurance auditing agent, you'll want to do things other ways. Obviously. Another thing I think about a lot is choosing a language model. And for agentic LMS, we have to have a bunch of things work really well. The first thing is really, really good instruction following ability.

And if you have really good instruction following ability, it opens up like a ton of possible applications for you. Tool use and coding abilities. So if you provide tools, it needs to be able to use them well.

Environment understanding. So it needs, like if you're building a web agent, it needs to be able to understand web pages either through vision or through text. And error awareness and recovery ability. So if it makes a mistake, it needs to be able to, you know, figure out why it made a mistake, come up with alternative strategies and other things like this.

Under the hood in all of the demos that I did now, we're using Claude. Claude has all of these abilities. Very good, not perfect, but very good.

Most others don't have these abilities quite as much. So like GPT-4-0 doesn't have very good error recovery ability. And so because of this, it will go into loops and do the same thing over and over and over again, whereas Claude does not do this. Claude, if you use the agents enough, you get used to their kind of like personality and Claude says, hmm, let me try a different approach a lot. So, you know, obviously it's been trained in some way to, you know, elicit this ability. Yeah.

We did an evaluation. This is old and we need to update this basically, but we evaluated Clod, GPT-40, O1 Mini, WAMA 405b, DeepSeq 2.5.

on being a good code agent within our framework. And Claude was kind of head and shoulders above the rest. GPT-4.0 was kind of okay. The best open source model was Lama 3.1.405b. This needs to be updated because this is like a few months old by now and things are moving really, really fast. But I still am under the impression that Claude is the best. The other closed models are not quite as good. And then the open models are a little bit behind that.

Grok, we haven't tried Grok at all, actually. So it's a good question. If you want to try it, I'd be happy to help. Cool. Another thing is planning. And so there's a few considerations for planning. The first one is whether you have a curated plan or you have it generated on the fly. And so for solving GitHub issues, you can kind of have an overall plan. Like the plan is first reproduce if there's

In issue, first write tests to reproduce the issue or to demonstrate the issue. After that, run the tests and make sure they fail. Then go in and fix the tests, run the tests again to make sure they pass, and then you're done. So that's a pretty good workflow for solving coding issues. And you could curate that ahead of time.

Another option is to let the language model basically generate its own plan. And both of these are perfectly valid. Another one is explicit structure versus implicit structure. So let's say you generate a plan. If you have explicit structure, you could like write a multi-agent system and the multi-agent system would have your reproducer agent, and then it would have your, your

your bug, your test writer agent and your bug fixer agent and lots of different agents. And you would explicitly write this all out in code and then then use it that way. On the other hand, you could just provide a prompt that says, please do all of these things in order.

So in open hands, we do very light planning. We have a single prompt. We don't have any multi-agent systems, but we do provide like instructions about like what to do first, what to do next and other things like this. I'm not against doing it the other way, but I laid out some kind of justification for this in this blog called don't sleep on single agent systems.

And the basic idea behind this is if you have a really, really good instruction following agent, it will follow the instructions as long as things are working according to your plan. But let's say you need to deviate from your plan. You still have the flexibility to do this. And if you do explicit structure through a multi-agent system, it becomes a lot harder to do that. Like you get stuck when things deviate from your plan. There's also some other examples and I wanted to introduce a few papers. So one paper I liked recently is this paper called CoAct.

where you generate plans and then go in and fix them. And so the basic idea is like, if you need to deviate from your plan, you can, you know, figure out that your plan was not working and go back and deviate from it. Another thing I think about a lot is specifying common workflows. So we're trying to tackle software development. And I already showed like three use cases where we do software development. And when we...

do software development, we do a ton of different things, but we do them over and over and over again. So just think of an example. We fix GitHub actions when GitHub actions are failing, and we do that over and over and over again. That's not the number one thing that software engineers do, but it's high up on the list. So how can we get a list of all of the workflows that people are working on? And

There's a few research works that people have done in this direction. One example is manual prompting. So there's this nice paper called Step that got state of the art on the Web Arena Web Navigation Benchmark, where they came up with a bunch of manual workflows for solving different web navigation tasks. And we also have a paper recently called Agent Workflow Memory, where the basic idea behind this is we want to create self-improving agents that learn from their past successes.

And the way it works is we have a memory that has an example of lots of the previous workflows that people have used. And every time the agent finishes a task and it self-judges that it did a good job at that task, you take that task, you break it down into individual workflows included in that, and then you put it back in the prompt for the agent to work next time.

And we demonstrated that this leads to a 22.5% increase on WebArena after 40 examples. So that's a pretty huge increase by kind of self-learning and self-improvement. Another thing is exploration. And one thing I think about is, how can agents learn more about their environment before acting? And I work on coding in web agents, and there's a few good examples of this in both areas.

Within coding, I view this as like repository understanding, understanding the code base that you're dealing with. And there's an example of this or a couple examples of this, one example being agentless, where they basically create a map of the repo. And based on the map of the repo, they feed that into the agent so the agent can then navigate the repo and better know where things are.

And for web agents, there's an example of a paper called Bagel. And basically what they do is they have the agent just do random tasks on a website, explore the website, better understand the structure of the website. And then after that, they, they feed that in as a part of the prompt. Part seven is search.

Right now in open hands, we just let the agent go on a linear search path. So it's just solving the problem once. We're using a good agent that can kind of like recover from errors and try alternative things when things are not working properly, but still we only have a linear search path.

But there's also some nice work in 2024 that is about exploring multiple paths. So one example of this is there's a paper called Tree Search for Language Agents, and they basically expand multiple paths, check whether the paths are going well, and if they aren't going well, you rewind back.

And on the web, this is kind of tricky because like, how do you rewind when you accidentally ordered something you don't want on Amazon? It's kind of, you know, not, not the easiest thing to do for code. It's a little bit easier because you can just revert any changes that you made. But I think that's an interesting topic too. And then finally evaluation. So within our development for evaluation, we want to do a number of things. The first one's fast sanity checks.

And in order to do this, we want things we can run really fast, really, really cheaply. So for web, we have something called mini world of bits, which is basically these trivial kind of web navigation things. We have something called the Ader code editing benchmark, where it's just about editing individual files that we use. But we also want highly realistic evaluation.

So for the web, we have something called Web Arena that we created at CMU. This is web navigation on real open source websites. So it's open source websites that are actually used to serve shops or like bulletin boards or other things like this. And for code, we use Sweebench, which I think a lot of people may have heard of. It's basically a coding benchmark that comes from real world pull requests on GitHub. So if you can solve those, you can also probably solve other real world pull requests.

I would say we still don't have benchmarks for the full versatility of agents. So, for example, we don't have benchmarks that test whether agents can code and do web navigation. But we're working on that and hoping to release something in the next week or two. So if that sounds interesting to you, come talk to me and I will tell you more about it.

Cool. So I don't like making predictions, but I was told that I should be somewhat controversial, I guess. So I will, I will try to do it. I'll try to do it anyway, although maybe none of these will be very controversial. The first thing is agent oriented LLMs, like large language models for agents. My prediction is every large LM trainer will be focusing on training models as agents. So every large language model will be a better agent model by mid 2025.

Competition will increase, prices will go down, smaller models will become competitive as agents. So right now actually agents are somewhat expensive to run in some cases, but I expect that that won't last six months. I bet we'll have much better agent models in six months. Another thing is instruction following abilities specifically in agentic contexts will increase. And what that means is we'll have to do less

manual engineering of agentic workflows and be able to do more by just prompting agents in more complex ways. Cloud is already really good at this. It's not perfect, but it's already really, really good. And I expect the other models will catch up to Cloud pretty soon. Error correction ability will increase, less getting stuck in loops. Again, this is something that Cloud's already pretty good at, and I expect the others will follow.

Agent benchmarks. Agent benchmarks will start saturating. So, and SweeBench. I think Web Arena is already too easy. It is, it's not super easy, but it's already a bit too easy because the tasks we do in there are ones that take like two minutes for a human. So not, not too hard.

And kind of historically, in 2023, our benchmarks were too easy. So we built harder benchmarks like Web Arena and Sweebench were both built in 2023. In 2024, our agents were too bad. So we built agents and now we're building better agents. In 2025, our benchmarks will be too easy. So we'll build better benchmarks, I'm guessing. So I would expect to see much more challenging agent benchmarks come out. And we're already seeing some of them.

In 2026, I don't know. I didn't write AGI, but we'll see. Then the human agent computer interface. I think one thing that we'll want to think about is what do we do at 75% success rate at things that we actually care about? Right now we have 53% or 55% on Sweebench Verified, which is real world GitHub PRs. My impression is that the actual...

Actual ability of models is maybe closer to 30 to 40%. So 30 to 40% of the things that I want an agent to solve on my own repos, it just solves without any human intervention. 80 to 90% it can solve without me opening an IDE, but I need to give it feedback. So how do we make that interaction smooth so that humans can audit the work of agents that are really, really good, but not perfect is going to be a big challenge.

How can we expose the power of programming agents to other industries? So like as programmers, I think not all of us are using agents every day in our programming, although we probably will be in months or maybe a year. But I think it will come very naturally to us as programmers because we know code. We know, you know,

Like how to architect software and stuff like that. So I think the question is, how do we put this in the hands of like a lawyer or a chemist or somebody else and have them also be able to, you know, interact with it as naturally as we can.

Another interesting thing is how can we redesign our existing systems for agents? So we had a paper on API based web agents. And basically what we showed is if you take a web agent and the agent interacts not with a website, but with APIs, the accuracy goes way up just because APIs are way easier to interact with. And in fact, like when I ask the

Well, our agent, our agent is able to browse websites, but whenever I want it to interact with GitHub, I tell it do not browse the GitHub website, use the GitHub API, because it's way more successful doing that. So maybe, you know, every website is going to need to have an API because we're going to be having agents interact with them.

About progress, I think progress will get faster. It's already fast. A lot of people are already overwhelmed, but I think it will continue. The reason why is agents are building agents and better agents will build better agents faster. So I expect that, you know, if you haven't interacted with a coding agent yet, it's pretty magical, like the stuff that it can do. So yeah.

And I have a call to action. I'm honestly, like I've been working on natural language processing and language models for what, 15 years now. And even for me, it's pretty impressive what like AI agents powered by strong language models can do. On the other hand, I believe that we should really make these powerful tools accessible. And what I mean by this is I don't think like, you know,

We should have these be opaque or limited to only a certain set of people. I feel like they should be affordable. They shouldn't be increasing the difference in the amount of power that people have. If anything, I'd really like them to kind of make it possible for people who weren't able to do things before to be able to do them well.

Open source is one way to do that. That's why I'm working on open source. There are other ways to do that, you know, make things cheap, make things, you know, so you can serve them to people who aren't able to afford them easily. Like Duolingo is one example where they get all the people in the US to pay them $20 a month so that they can give all the people in South America free, you know, language education so they can learn English and become, you know, like...

and become more attractive on the job market, for instance. And so I think we can all think of ways that we can do that sort of thing.

And if that resonates with you, please contribute. Of course, I'd be happy if you contribute to Open Hands and use it. But another way you can do that is just use open source solutions, contribute to them, research with them, and train strong open source models. So I see some people in the room who are already training models. It'd be great if you could train models for coding agents and make them cheap. And yeah, please, I was thinking about you, among others. Yeah.

Yeah, that's all I have. Thanks. Slightly controversial tick is probably the nicest way to say hot tick. Any hot ticks questions? Actual hot ticks? Oh, I can also show the other agents that were working if anybody's interested. But yeah, sorry, go ahead. Yeah, I have a couple of questions. So they're kind of paired maybe. The first thing is that you said that

You're estimating that your agent is successfully resolving like something like 30 to 40% of your issues, but that's like below what you saw in Sweebench. So I guess I'm wondering where that discrepancy is coming from. And then I guess my other second question, which is maybe broader in scope, is that like,

If you think of an agent as like a junior developer, and I say, go do something, then I expect maybe tomorrow to get a Slack message being like, hey, I ran into this issue. How can I resolve it? And like you said, your agent is like successfully solving like 90% of issues where you give it direct feedback. So are you thinking about how to get the agent to reach out to like for planning when it's stuck or something like that? Or like identify when it runs into a hole like that? Yeah.

Yeah, so great. These are great questions. Sorry, the third question, which is a good set. This is the first two. And if so, are you going to add a benchmark for that second question? Okay, great. Yeah, great questions. Okay, so the first question was, why do I think it's resolving less than 50% of the issues on sweet bench? So first, sweet benches on popular open source repos, and all of these popular open source repos were included in the training data for all of the language models.

And so the language models already know these repos. In some cases, the language models already know the individual issues in SweeBench. So basically like some of the training data has leaked. And so it definitely will overestimate with respect to that. I don't think it's like horribly, horribly off, but I think, you know, it's boosting the accuracy by a little bit. So maybe that's the biggest reason why. In terms of...

Asking for help and whether we're benchmarking asking for help. Yes, we are. So one, one thing we're working on now, which we're hoping to put out soon as we, we basically made super vague sweep edge issues. Like I'm having a, I'm having a problem with the matrix multiply, please help. Because these are like, if anybody's run a popular open source, like framework,

these are what half your issues are. Your users show up and say, my screen doesn't work. What's wrong? Or something. And so then you need to ask them questions and how to reproduce. So yeah, we're working on that. I think my impression is that agents are not very good at asking for help, even VOD. So when they ask for help, they'll ask for help when they don't need it and then won't ask for help when they do need it. So this is definitely an issue, I think.

Yep. Thanks for the great talk. I also have two questions. His first one, can you talk a bit more about how the web agent interacts with websites? So is there a VLM that looks at the webpage layout and then you parse the HTML and select which buttons to click on? And if so, do you think there's a future where there's like, so I work at being at Microsoft, the, I do think there's a future where the same web index, but there's an agent friendly web index where all the processing is done offline.

so that you don't need to spend time cleaning up these T-mails and figuring out what to click online. Any thoughts on that? Yeah, so great question. There's a lot of work on web agents. I didn't go into all of the details, but I think there's three main ways that agents interact with websites. The first way is the simplest way and the newest way, but it doesn't work very well, which is you take a...

a screenshot of the website and then you click on a particular pixel value on the website and like

Models are not very good at that at the moment. They'll misclick. There was this thing about how cloud computer use started looking at pictures of Yellowstone National Park or something like this. I don't know if you heard about this anecdote, but people were like, oh, it's so human. It's looking for vacation. And it was like, no, it probably just misclicked on the wrong pixels and accidentally clicked on an ad. So this is the simplest way. The second simplest way

is you take the HTML and you basically identify elements in the HTML. You don't use any vision whatsoever. And then you say, "Okay, I want to click on this element. I want to enter text in this element," or something like that. But HTML is too huge. So actually, it usually gets condensed down into something called an accessibility tree, which was made for screen readers for visually impaired people.

And so that's another way. And then the third way is kind of a hybrid where you present the screenshot, but you also present like a textual summary of the output. And that's the one that I think will probably work best. What we're using is we're just using text at the moment. And that's just an implementation issue that we haven't implemented the visual stuff yet. But that's kind of like we're working on it now.

Another thing that I should point out is we actually have two modalities for web browsing. Very recently, we implemented this. And the reason why is because if you want to interact with full websites, you will need to click on all of the elements or have the ability to click on all of the elements. But most of our work that we need websites for is just web browsing and gathering information. So we have another modality where we convert all of it to Markdown because that's way more concise and easier for the agent to deal with. And then

Can we create an index specifically for agents? Maybe a markdown index or something like that would make sense. Oh, how would I make a successor to Sweebench?

I mean, a first thing is there's like live code bench, which live code bench is basically continuously updating to make sure it doesn't leak into language model training data. That's easy to do for sweep bench because it comes from real websites and those real websites are getting new issues all the time. So you could just do it on the same benchmarks that they have there. There's also like a pretty large number of things covering various coding tasks. So like, for example, sweep bench is mainly fixing issues, but there's also like

documentation, there's generating tests that actually test the functionality that you want. And there was a paper by a student at CMU on generating tests and stuff like that. So I feel like SweetBench is one piece of the puzzle, but you could also have like 10 different other tasks. And then you could have like a composite benchmark where you test all of these abilities, not just that particular one. Well, lots of other things too, but yeah. Question from across. Use your mic, it will help.

Great talk. Thank you. My question is about your experience designing agent architectures. Specifically, how much did you have to separate concerns in terms of task-specific agents versus having one agent to do three or five things with a gigantic prompt with conditional paths and so on?

Yeah, so that's a great question. So we have a basic coding and browsing agent. And I won't say basic, like it's a good, you know, it's a good agent, but it does coding and browsing. And it has instructions about how to do coding and browsing. That is enough for most things, especially given a strong language model that has a lot of background knowledge about how to solve different types of tasks and how to use different APIs and stuff like that.

We do have a mechanism for something called microagents. And microagents are basically something that gets added to the prompt when a trigger is triggered.

Right now, it's very, very rudimentary. It's like if you detect the word GitHub anywhere, you get instructions about how to interact with GitHub, like use the API and don't browse. Also, another one that I just added is for NPM, the JavaScript package manager. And NPM, when it runs and it hits a failure, it hits in interactive terminals where it says, would you like to quit?

and enter yes. And if that does it, it stalls our agent for the timeout until two minutes. So I added a new microagent. Whenever it started using NPM, it would get instructions about how to not use interactive terminal and stuff like that. So that's our current solution. Honestly, I like it a lot. It's simple. It's easy to maintain. It works really well and stuff like that. But I think there is a world where you would want something more complex than that. Got it. Thank you. I got a question about MCP.

I feel like this is the anthropic model context protocol. It seems like the most successful type of this standardization of interactions between computers and agents. Are you guys adopting it? Is there any other competing standard? Anything thought about it? Yeah, I think the anthropic CP is essentially a collection of APIs that you can use to interact with different things on the internet.

I think it's not a bad idea, but there's a few things that bug me a little bit about it. It's like, we already have an API for GitHub, so why do we need an MCP for GitHub? GitHub has an API. The GitHub API is evolving.

we can look up the GitHub API documentation. So it seems like kind of duplicated a little bit. And also they have a setting where it's like you have to spin up a server to serve your GitHub stuff and you have to spin up a server to serve your like, you know, other stuff. And so I think it makes sense if you really care about like separation of concerns and security and like other things like this. But right now we haven't seen anything

We haven't seen that to have a lot more value than interacting directly with the tools that are already provided. And that kind of goes into my general philosophy, which is we're already developing things for programmers. You know, how is an agent different from a programmer? And it is different, obviously, you know, like agents are different from programmers, but they're not that different at this point. So we can kind of interact with the interfaces we create for programmers. Yeah. I might change my mind later, though. So we'll see.

Yeah. Hi, thanks. Very interesting talk. You were saying that the agents you have right now solve like maybe 30% of your issues out of the gate. I'm curious, of the things that it doesn't do, is there like a pattern that you observe? Like, oh, like these are the sorts of things that it just seems to really struggle with, or is it just seemingly random? It's definitely not random. It's like, if you think it's more complex than it's like just intuitively, it's more likely to fail.

I've gotten a bit better at prompting also. So like, just to give an example, it, it will sometimes fail to fix a GitHub workflow because it will not look at the GitHub workflow and understand what the GitHub workflow is doing before it solves the problem. So I think actually probably the biggest thing that it fails at is, or that our, our agent plus Claude fails at is insufficient information gathering before trying to solve the task.

And so if you provide all, if you provide instructions that it should do information gathering beforehand, it tends to do well. If you don't provide sufficient instructions, it will try to solve the task without like fully understanding the task first and then fail. And then you need to go back and give, uh, you know, additional feedback.

Another example, like I love this example. While I was developing the monitor website that I showed here, we had a really tricky bug where it was writing out a cache file to a different directory than it was reading the cache file from. And I had no idea what was going on. I thought the bug was in a different part of the code. But

What I asked it to do was come up with five possible reasons why this could be failing and decreasing order of likelihood and examine all of them. And that worked and it could just go in and like do that. So like, I think a certain level of like scaffolding about like how it should sufficiently gather all the information that's necessary in order to solve a task is like, if that's missing, then that's probably the biggest failure point at the moment. Thanks. Yeah. Yeah.

I'm just using this as a chance to ask you all my questions. You had a slide on here about self-improving agents or something like that with memory. It's a really throwaway slide for a super powerful idea. It got me thinking about how I would do it. I have no idea how. I just wanted you to chain a thought more on this. Yeah. Self-improving. I think the biggest reason, the simplest possible way to create a self-improving agent

is to have a really, really strong language model with infinite context. And it can just go back and look at all of its past experiences and learn from them. You might also want to remove the bad stuff just so it doesn't over-index on its failed past experiences. But the problem is a really powerful language model is large. Infinite context is expensive. We don't have a good way to index into it because RAG...

At least in my experience, RAG from language to code doesn't work super well. So I think in the end it's like, that's the way I would like to solve this problem. I'd like to have an infinite context and somehow be able to index into it appropriately.

And I think that would mostly solve it. Another thing you can do is fine tuning. So I think like RAG is one way to get information into your model. Fine tuning is another way to get information into your model. So that might be another way of continuously improving. Like you identify when you did a good job and then just add,

all of the good examples into your model. Yeah. So you know how like Voyager tries to write code into a skill library and then reuses a skill library, right? So it improves in the sense that it just builds up the skill library over time. Yep. One thing I was like thinking about, and there's this idea from Devin, your arch nemesis, of playbooks. I don't know if you've seen them. Yeah. I mean, we're calling them workflows, but they're similar. Yeah. So like basically like you should like,

Once a workflow works, you can persist them as a skill library. I feel like that's some in between. Like you said, it's hard to do rag between language and code, but I feel like that is ragged for, like, I've done this before. Last time I did it, this worked. So I'm just going to shortcut.

All the stuff that, uh, that failed before. Yeah, I totally, I think it's possible. It's just, you know, not, not trivial at the same time. Yeah. I'll explain the two curves. So basically the base, the baseline is just an agent that does it from scratch every time. And this curve up here is agent workflow memory, where it's like adding the successful experiences back into the prompt.

Why is this improving? The reason why is because just it failed on the first few examples and for the average to catch up, it took a little bit of time. So it's not like this is actually improving it. You could just basically view the, this one is constant and then this one is like improving like this. Basically you can see it's continuing to go up. Yeah. How do you think we're going to solve the authentication problem for agents right now?

When you say authentication, you mean like credentials? Yeah. Yeah, because I've seen a few startup solutions today, but it seems like it's limited to the amount of websites or actual authentication methods that it's capable of performing today. Yeah, great question. So my preferred solution to this at the moment is GitHub fine-grained authentication tokens. And GitHub fine-grained authentication tokens allow you to specify very...

on a very granular basis. On this repo, you have permission to do this. On this repo, you have permission to do this. You also can prevent people from pushing to the main branch unless they get approved. You can do all of these other things. And I think these were all developed for human developers. The branch protection rules were developed for human developers. The fine-grained authentication tokens were developed for GitHub apps. I think

For GitHub, maybe just pushing this a little bit more is the way to do this. For other things, they're totally not prepared to give that sort of fine-grained control. Most APIs don't have something like a fine-grained authentication token. And that goes into my comment that we're going to need to prepare the world for agents, I think. But I think the GitHub authentication tokens are a good template for how you could start doing that, maybe. But yeah, I don't know. I don't have an answer. I'll let you know if I find one. Okay, yeah, thank you.

Cool. I'm going to finish up. Let me just see. Okay, so this one did write a script. I'm not going to actually read it for you. And then the other one, let's see. Yeah, so it sent a PR. Sorry, what is the PR URL? So I don't know if this... Sorry, that's taking way longer than it should. Okay, cool. Yeah, so this one sent a PR. I'll tell you later if this actually successfully...

Oh, no, it's deployed on Vercel, so I can actually show you. But let me try this real quick. Sorry, I know I don't have time. Yeah, there you go. I have pie charts now. So yeah, it's so fun. It's so fun to play with these things, because you could just do that while I'm giving a talk and things like that. So yeah, thanks.

2024 in Agents [LS Live! @ NeurIPS 2024] 48:59 Share