We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Benchmarking AI Agents on Full-Stack Coding

2025/3/28

AI + a16z

AI Deep Dive AI Chapters Transcript

People

Martin Casado

总合伙人，专注于人工智能投资和推动行业发展。

Sujay Jayakar

Topics

Martin Casado: 我认为许多AI代码生成工具的轨迹管理仍有待改进，编写困难的代码就像玩游戏，需要良好的启发式方法来指导过程。当前的AI代理在构建完整的全栈应用方面仍存在挑战，这需要进一步改进轨迹管理和启发式方法。我最近使用Cloud 3.7时遇到了一些问题，它过于‘聪明’，导致代码修改难以回退。这说明在实际应用中，选择合适的模型版本非常重要，需要权衡模型的‘聪明程度’和代码的可维护性。基准测试可以帮助理解不同平台的优缺点，但评估对于独立开发者更实用，评估需要开发者掌握一定的技巧。对于构建使用AI模型的应用程序的开发者来说，基准测试和评估都非常重要，评估可以帮助开发者更好地理解和改进其产品。AI应用程序需要进行评估，这往往被低估了，一个好的评估可以帮助开发者构建更好的产品。编写评估可以帮助开发者明确产品目标、解决方案和评估标准，并通过测试验证改进。在使用AI模型进行代码生成时，模型的更新可能会导致评估需要重新编写，这给软件工程带来了挑战。模型的开发和训练过程缺乏透明度，导致模型更新时评估需要重新编写，这在一定程度上是模型开发方式造成的。大型语言模型的改进方向可能并非总是专注于解决核心问题。随着模型的改进和技术的成熟，模型更新对评估的影响可能会逐渐减小。我经常使用AI进行代码编写，它能显著提高我的工作效率。 Sujay Jayakar: 构建完整的全栈应用对于目前的自主AI代理来说仍然不是易事，一些因素会影响其性能，例如强有力的防护措施、模型的代码编写能力以及良好的库选择和抽象。强有力的防护措施（例如快速反馈和明确的正确与错误界限）可以显著提高AI自主编码的性能。AI模型擅长编写代码，但不擅长评估RLS规则或解释SQL查询的运行原理。选择合适的库和抽象对于提高AI编码性能至关重要，要明确模型需要做什么，以及不需要做什么。 Full Stack Bench这个基准测试评估AI代理能否完成从前端到后端（包括数据库、API和订阅）的完整全栈应用构建任务。创建Full Stack Bench基准测试是因为现有基准测试无法充分评估AI代理在实际全栈应用开发中的性能。在处理复杂问题时，AI代理可能由于上下文管理问题而出现不一致性。强有力的防护措施（例如类型安全）可以减少AI代理在代码生成过程中的不一致性。类型安全是减少AI代码生成中不一致性的有效方法。类型安全等防护措施可以帮助AI代理保持一致性，减少在探索解决方案过程中的偏差。运行时防护措施（例如易于测试的语言）也可以帮助管理AI代理的轨迹。 AI模型在调试和推理方面不如代码编写能力强，尤其是在处理React Hook规则或SQL的RLS规则时。大型语言模型的知识截止日期和预训练数据会限制其构建新抽象的能力，但它们可以通过上下文学习来改进其性能，但其知识和上下文学习能力在不同模型之间存在差异。价格较低的模型性能不如价格较高的模型，微调可以提高价格较低模型的性能，但对于业余爱好者来说可能操作复杂。Gemini模型在性价比方面表现出色。 Full Stack Bench基准测试主要关注的是能够集成多个组件（前端、API层、数据库）的大型多组件系统。在使用AI模型进行代码生成时，即使是同一模型，也会存在较大的差异，而类型安全等防护措施可以有效降低这种差异。高质量的评估对于AI应用程序开发至关重要，但目前公开的评估数量有限，许多公司将其视为商业机密。公开共享高质量的评估集可以促进AI应用程序开发领域的合作与进步。在使用AI工具进行复杂任务时，需要关注模型的轨迹规划、进度管理和避免循环等问题。改进AI代码生成工具的性能，需要关注任务提示、工具选择和框架选择等方面。使用AI进行代码编写类似于与人类工程师合作，需要将任务分解成步骤，并确保在每个步骤都达到可提交的状态，以便在出现错误时能够回退。

Deep Dive

Chapters

The episode starts by discussing the challenges of trajectory management in AI coding, drawing parallels to AlphaGo and heuristic development in game playing. It highlights the difficulty of finding efficient paths to solutions and the need for robust heuristics.

Trajectory management in AI coding is underdeveloped.
Coding is like playing a game with a starting and ending position, but few clear paths between them.
Good heuristics are crucial but hard to develop for AI agents.

Shownotes Transcript

Translations:

中文

You know, I'm even thinking about this from the RL way back with AlphaGo and all that. It feels to me like trajectory management is still pretty underdeveloped for a lot of these things. I feel like coding a difficult problem is actually like playing a game, right? You have the starting position, you have the ending position, and there's probably very few bright lines to go between them. Having a good heuristic is actually very hard, right? It's something we teach humans all the time, right? I'm like, how do you know that you should commit

and have this as a commanding position to make further progress. And I think the combination of that where it feels like the heuristic landscape is that there's these bright lines, a little bit of wiggle room around them, but not very much. And then once you fall off that, you're totally... Thanks for listening to the A16Z AI podcast. This episode features a great discussion between A16Z general partner, Martine Cassato, and Convex co-founder and chief scientist, Sujay Jayakar, about just what the title suggests.

Benchmarking AI agents on full stack coding tasks. Sujay talks through why this is important, as well as a benchmark his team developed to do it, and the two also get into their experiences with AI generated code overall. You'll hear all of that, as well as Martine's glowing introduction to Sujay after these disclosures.

As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com slash disclosures. CJ, I really appreciate you joining us on the podcast.

For those that don't know, Sujay is considered by me and many others as the top systems thinker in the world. I say that a little lightly, but a little not. So let me just kind of go through his background a little bit. So Sujay was on the Magic Pocket team in Dropbox. They implemented S3 all the way down to the hardware. He is a co-founder of Convex, and he spent a lot of time thinking about the implications of

AI-generated code. So this is what we're going to be talking about is using AI to code, the implications on systems and so forth. So welcome to the podcast, CJ.

Thanks. Thanks for that intro. For sure. Only a little bit of hyperbole. By the way, I want to be very clear. Many people do consider you the top, one of the top systems thinkers in the world. So not everybody's going to be familiar with Convex. It's trying to do something people have been talking about in database land for decades. It's almost kind of a white whale. So maybe just give a quick background on what you're working on at Convex, and then we'll kind of move over to the AI stuff.

Yeah, sure. So Convex is a reactive database, and it's a database that's built from the ground up to make application development as easy as possible. So there are a bunch of implications for that. But I think the starting point is we just casually use databases that are over 30 years old without thinking about it. And we don't do the same for programming languages. We don't do the same for our libraries.

And convex is just like you're saying, we're trying to go after that white whale of could you make application development an order of magnitude plus more efficient if you've rethought some of those things from first principles. So everything, for example, is reactive by default.

You don't have to handle state management at all. Everything is type safe end to end. And yeah, I mean, it's kind of has all the pieces that you need to make a modern application just integrated and configured entirely in code.

So practically what this means to, you know, us lay developers. So I use Convex on a lot of projects. So let's say I'm writing some web app in JavaScript. Practically what it means is I just take my JavaScript and it gets run in Convex. And then I get transactionality, I get reactivity, I get the ability to query things. And I don't have to like resort to, you know, SQL and all the foibles with SQL to do that. So you actually just basically end up getting a transactional backend while still just writing JavaScript. Yeah.

Is that fair? Yeah, exactly. I mean, there's JavaScript is one of the most popular languages in the world and it's so intuitive. And why can't everything be in it? Sujay also was, you know, one of the primary developers of AI Town. So actually AI Town was started by Yoko, who works on our team. And then I kind of, you know, pitched in for a little bit. But then you and Ian did a lot of the work on that. So for all of you that use AI Town, Sujay is kind of the unsung hero of the backend for that. Yeah.

Now, you wrote a blog post recently on benchmarking AI-generated code. And by the way, for those listening, if you haven't read it, I strongly recommend you read it. It's given me actually kind of the most insightful intuitions I've had since the beginning of this whole AI generation code discourse. But I think, you know, if you're open to it, it'd be great to kind of go through your key insights while going through it. And then we'll try and dig into the details. Yeah, totally. I think the kind of core insight is that

Being able to fully build full stack apps, it's still not a slam dunk for autonomous AI agents today. It's sometimes within their capabilities, sometimes it's not. And what's really interesting where since it's on the edge of what's possible, there are things that we can do to make it work and things that we can do that make it definitely work.

So, what we observed is that there's a few factors that really impact autonomous coding performance. One of them is just having really strong guardrails. We saw that systems that can give really quick feedback and have very strong boundaries on what's correct versus not

help guide these models towards being able to achieve a lot more. We also observed that models are really good at writing code. You know, you mentioned SQL before. They're not so good at doing stuff like

evaluating RLS rules or reasoning about why a SQL query does or doesn't work. Code is the thing that models are surprisingly good at, right? And then the other thing we, and we'll dive into this, I think having really good library choice and choosing good abstractions is a really important task for getting good performance on AI coding. You have the set of things that you want the model to do, and that's how we describe our tasks.

But it's equally as important to pick what you don't want the model to do. And so if there's some really hard problem or something that just exists already as a good library, don't let the model reinvented from first principles because it'll likely mess it up. You know, as part of the blog post, you actually introduced a new benchmark. So it'd be great to hear your maybe spicy take on why a new benchmark is needed because there are so many benchmarks already that are in use today. Yeah.

So we called it Full Stack Bench. And so the benchmark is...

If an AI agent has a front-end app that implements something like we start with a chat app, to-do app, like a file manager app, it's kind of like common full-stack app patterns. If it starts with just the front-end and then as part of the test, we choose which back-end it wants to use, can the model fill out the rest of the picture? Can it take all the things that were done on the front-end and map them into how it sets up the database, how it sets up APIs, how it sets up subscriptions? And

And making these types of benchmarks and running them is a lot of work. So to your point, why create a new one? And frankly, it's because we saw a gap, right? Like when we started getting really interested in this, when for Convex, we had a bunch of our users write into us and say that Convex is like a really good target for AI coding tools. So we have some customers that are themselves like AI Cogen startups, and then they make apps for their users that use Convex for the backend.

They said like some things work super well on convex, some things like really don't work well, and some things are kind of in the middle. So to be able to understand that a little bit more rigorously, that's when we wanted to have this like experimental framework. So...

We took a look around and we tried to see for these tasks that people are doing, are there good benchmarks out there? And the answer was no. And I think this is the type of thing that's surprising. There's publicly available ones like SweeBench. Recently, there's like Sweet Lancer from OpenAI.

And these are kind of much more narrowly scoped to things that are easy to create data sets for, like just take a public GitHub repository, look at all of the commits, look at the GitHub issue that led to it, assemble those without a lot of supervision. And then there's kind of

dinky, very narrowly focused problems, right? Like your code forces, like competition coding. But for the tasks that people are actually doing, like if someone has cursor and they're using agent mode in there to build something, there aren't really great publicly available benchmarks. And this is an interesting theme in itself is that I think we've seen that a lot of companies consider

high quality valves to be their secret sauce and aren't putting them out publicly. But for us, we saw that and then decided we would make our own. And then we also made it public. I'd love your thoughts on the utility of these benchmarks for just an average schmo like me. I'll just give you a quick anecdote. So

You know, Cloud 3.7 dropped, which is amazing. And I do a lot of just lay development at home just for fun, right? I mean, I'm a super casual developer. It's not my day job. It's something I kind of do to relax. And so I've been using Cursor and Cloud 3.5, 3.7 dropped. I used 3.7 and I found I got into a lot of trouble because it was actually maybe...

too clever by half, like it's just kind of do all of these edits. And then, you know, I ended up having to unwind them. And so I went back to 3.5 for now. And, you know, I'm sure I'll figure out 3.7 over time. But so when you think about these benchmarks, do you think that their value extends to the independent developer? Is it more for systems builders or is it more for the end? Like, how do you think of the utility?

broadly? I think if you are building a product where that product then uses these models, so maybe not so much as if you're a developer just using Cursor or using TrashEBT, your tasks are like the things that you directly care about, right? So it's, you know, the need to experimentally run many conditions is not there as much. But I think, you know, for so many people are now building apps where their apps are calling out to these models,

and then they want to provide a good user experience, right, for their end users. There's a really good blog post of your AI app needs evals.

And I think this is one of the things that's just chronically underappreciated is that I think people just assume you can slap together some random stuff, and then it's going to turn into a good product. But as a web development noob back in the day, you know, like learning CSS, that it's like the thing where you like tweak one thing and then something else totally breaks. And I think going through the rigor of writing, like specifically, what problems do you want the AI agent to solve? What do solutions look like?

how are those solutions graded and compiling a suite of those is one just a really good elicitation of what you're actually trying to do in your product. But then say, for example, if 3.7 comes out, you can then automatically see how does 3.7 perform on let's say 100 different things that your product needs it to do, see that it improved on some, regressed on others,

And then if you are tweaking your prompts to make it work better on 3.7, you can then also very easily verify that you're not regressing 3.5, right?

Just seeing that I understand here. So the benchmark gives you general kind of understanding of what different platforms do better or worse, but like it doesn't help you on your independent job. For that, evals are better and probably underappreciated, but you need to get pretty comfortable with how you do evals on these things, which has definitely been my experience. Just for the listeners on this podcast, I mean, I think everybody understands at a high level what evals are, but not really how they're a

applied to these kind of modern AI. So maybe it's worth just talking very quickly about like how you view evals, like what they are, how they fit in this process. Yeah, totally. So it's, you know, and I actually found it very fun. This is my first time kind of writing a big set of evals. And, you know, in this case, like first kind of coming up with what is the task that we're trying to do, right? So in this case, say writing convex code, it could be that given a kind of like a quiz question, given a prompt,

Say, write the backend for a full stack app that needs to be able to list messages in a channel or post a new message to a channel. So the first part of the eval is these like task descriptions. What are you feeding into the model?

And then given that, the model then will give you some output. And some of the eval design is making sure that you tell the model to format its output in a particular way or whatever. And then in the framework, we parse that output and we compare it to a reference solution and then we grade it. That sounds really hard. Like, how do you...

Is it like syntax? Is it semantics? Is it correct? How do you even think about comparing these two things? Yeah. And so sometimes also there isn't a gold standard solution, right? Right. Yeah. So if you just get this random output, how do you... Yeah. How do you compare it with... Yeah. Right. Yeah. And I think this is a big part of the experimental design, right? Because it could be that like, you know, say, for example, if we have a task that's just like implement the backend for a chat app.

then the output for that could look like anything, right? And it could legitimately look like anything. I've seen this kind of general trade-off between generality of the task description

and great ability. So, you know, instead of just saying, write me the backend for a chat app, if you say, write the backend for a chat app where it has a get request that slash list messages, that takes in this, you know, JSON object and returns this in this sorted order, it might be more descriptive than you'd ideally like if you're just prompting some random agent, but it makes it great. You can grade it now, yeah. Yeah. A different eval set for us is that we have,

have the reference solution, and we have the model generated solution. And then we actually spin up convex backends for both, push the code. Oh, interesting. And then we have some human written unit tests that compare the results across both. But then we also have some like backend introspection APIs to say like, what are the schemas for both? Those should 100% match. What is the API description with all the types? And like those should 100% match.

So being able to kind of do that more automatically and be more gradable is another piece of our kind of experimental design. Are there things that the agents aren't super good at that, you know, you had those specific tests or is it, you know, just most gradable things are the same? Yeah, there's a kind of very interesting, at least in our experience, relationship there between different...

difficulty and variance, you know, like we've seen that when... And by variance, you mean variance between models or within the same model? Even within the same model. Like consistency in its ability to come up with a result. Given the same input. Sorry, I don't mean to interrupt, but given the same input. Really? Yeah, totally. Because when we have these coding agents that are spending an hour in one of these... Yeah, it's just a lot of context and state and...

Yeah. So for our most difficult task, we have it implement this files app where there's different project namespaces. There's this hierarchical authorization model where groups can be nested in each other, but not in cycles. And groups can be attached to folders. All this stuff is designed to be pretty complicated and complex.

In addition to that, there's also a lot of code, right? It starts with 4,000 lines of code for the front end and then adds a few thousand lines as it implements the back end. I've seen just a lot of variance on sometimes it'll manage to keep the right things in context as the cursor agent kind of navigates through implementing it. I don't think it's public what their context management looks like, but I think they'll just evict a very important piece of the input.

prompt or some file they put on along the way. And then it might not ever find that again until a human has to get involved saying, hey, you're just totally stuck. So you find for the more complex problems, the variance goes up.

And then, so how do you think about reining in that variance when it comes to, you know, let's say I'm a developer and like I actually care about these. I mean, is there a solution? Yeah. And this was like one of the, I think, big takeaways from the full stack bench work is that having very strong guardrails and guardrails that can be applied quickly. What do you mean by guardrails, by the way? I mean, I guess we all intuitively understand what you mean by guardrails, but specifically, is it like type safety? Is it good semantics?

Yeah, I think type safety is just like one of the best examples, right? Because it's, you know, especially with TypeScript, like there's so many invariants. It's like JavaScript, where you have none at all. Yeah. I mean, it's also funny too. I remember when TypeScript first came out, like all the PL nerds were hating on it, right? Myself.

right it's not sound it's just like trying to embed every single pattern and varying from the javascript ecosystem into types and then after a while i feel like there's a sense of admiration if you just give up on your principles if you just go like give out decidability entirely you go through the five stages of grief and you kind of end up at acceptance yeah taking away anything from this

I think changing one's task and changing the tools that an agent uses to pick things that are more type safe, where type safety can be like cursor agent will, as soon as a model generates code, use the type information from the language server to feed it right back in this context. So it'll just immediately fix stuff without any real iteration loops. So I think that's one of the big takeaways of that. If you want to decrease that variance as it's exploring, kind of

having type safety can keep it on the straight and narrow, you know? What about like runtime guardrails, like languages that are mostly referentially transparent or, you know, like things beyond type safety? Do these matter or...

Yeah, it's a good question. I mean, I think the ease of writing tests, I mean, this is another thing that models seem to be very good at. So if the kind of semantics of the language generally encourage things, patterns that are more testable, that's another way of managing the model's trajectory, right? As it solves or the agent's trajectory. Yeah.

And I mean, I think my initial reaction there, though, is I think that shows up more for reasoning than the actual authoring of the code. Like, you know, I think this is one of the things we observed with the full stack bench experiments was that a lot of the time would be that, you know, the model would think it's done and then we would start to grade it. And then we'd say like, hey, this isn't working. Here's a screenshot of the error in the developer tools.

And then being able to go from that error and debug it and narrow it down to what the problem was relies on a lot of reasoning, right? And, you know, we saw the model just pretty consistently get stuck on things like React hook rules, right? Yeah, sure. Yeah.

hard to reason about. You know, stuff like RLS rules for in SQL, where there's not a very clear procedural execution semantics. Those are the things that I think show up for reasoning and places where it's easy for the model to just blow off the happy path and get totally stuck. So maybe, are you comfortable talking about your experience with the different types of models?

We've seen because we have another set of benchmarks which test model knowledge. I think this is one of those really interesting problems for dev tools right now. And, you know, a bunch of people have talked about it where like with knowledge cut off, but also just like the amount of data and pre-training that is anchored on just existing systems. It's very hard to build new abstractions, right? Even in context?

And so that's the thing. I think that's what ends up saving us. But without any prompt engineering, right? It's the default for what these models are good at is not going to evolve over time. And it's going to be close. It feels like it's kind of like a uniform sample, right, of what's out there on the internet. And one of the first things we did was try to understand this phenomenon for different models.

And then, like you're saying, patch it up in context. And that's, I think, the saving grace here is that these models do also seem to be very steerable. Their ability to learn stuff from just, you know, very small numbers of examples and guidelines and context has been very cool. You know, I mean, this is a thing where we see with Claude 3.5 versus 3.7, we notice that when it comes to at least knowledge about...

Convex, it mostly stayed the same. It was a little bit worse. And that's probably just because our prompts were tuned for 3.5. But we didn't see a significant improvement. Well, this includes knowledge, but also any sort of like in context learning is the same. The artifact of this test is what we tell people to put in their cursor rules.

Oh, I see. So a new model comes out, we see what it knows and what it doesn't know. And then we fold that insight back into the cursor rules and then see if it improves it. And yeah, it's been very interesting, you know, for example, with using Gemini, like the 2.0 Flash versus...

I think we run it in CI with O3 Mini. We've done it with O1, just don't want to pay for it. Yeah. Right. And talking about not paying for it, tried it with GPT 4.5 recently. I was like, I'm just going to do one run, never touch it again. How much was it? How much did it cost to do it with 4.5? Well, we like, I mean, it's, I think 600,000 input tokens. And I think roughly...

I have to go double check. But I think, you know, same rough order magnitude for output tokens. Well, it's not that much. No, but still like, what it was like, what, like $50, I think? Yeah, okay. Yeah, yeah. Okay, that's fine. Yeah, yeah.

Yeah. Could have done a lot of that $50, Martin. Yeah. No, I've been actually looking at, you know, like the pricing of these models is pretty remarkable, right? If you do like whatever GPT-4, oh, it's $2.50. And if you do mini, it's $0.47 or something like this. And so, you know, and do you find there's a big trade-off between the more expensive models and the cheaper models? If I use the cheaper model to me, like they're actually materially worse than the more expensive models. And so I think...

And I haven't found out how to bridge that gap, unfortunately. And as a hobbyist, maybe this is just like my problem. I don't want to spend a lot of money on my stupid little programs I work on. So I always want to use the smaller models, but I just don't find them to be as good. And so maybe a specific question, A, do you find this dichotomy is real? And the second one, is there anything that I can do to use the cheaper models or am I screwed?

This is like for a 4.0 versus 4.0 Mini. It's exactly that. Yeah, we've had very poor results with that. With the small models. Yeah. I know it's interesting to look at OpenAI's docs, and we haven't tried this, but OpenAI's docs for their regular fine-tuning API...

are actually pretty discouraging of using it for improving model performance. If you're using 4.0 and 4.0 isn't cutting it, they like, you know, like reading between the lines a bit, they like don't want to set up expectations that it'll get a lot better if you provide a lot of examples for fine tuning. What they do say is that if you have a very specific task and 4.0 is good enough,

you can then fine tune 4.0 mini with your label data and then get it up to that level of performance at a much cheaper price. So, but I think it's like the hobbyist. Yeah.

That sounds like too much of a pain. The Gemini stuff is pretty good, right? And that's what we see in our benchmarks and it's cheap. For price performance, I found Gemini to be the best actually. But a lot of it is like the actual price component, which I mean, I'm guessing from Google standpoint, this is one area they can flex because they can vertically integrate, right? I mean, all the way down to the hardware. So that's probably what they're doing is just they can just offer it cheaper because of that. Yeah.

Okay, so I just actually want to go through kind of some of the high level talking points because this has been great. So the first one is, you know, there's a gap in benchmark when it comes to, you know, like this kind of big back end. I mean, how would you characterize it? Back end or full stack, like full stack type workloads? Stateful? Is it stateful in transactions? Or is that the primary thing?

Yeah, I think it's, you know, integrating all of these different systems. You know, I think there's already a ton of work out there for front end stuff and being able to work across multiple systems. So going from the front end to the API layer to the database, being able to make intelligent decisions across all of them. Those are all necessary components to actually build a real app. And it's...

To seeing, I mean, even looking at stuff like SweeBench, right? It's often just very focused on something kind of narrow. So big multi-component systems is what this benchmark is good. And it does deal with heavily stable stuff, heavily transactional stuff, like real stuff. Yeah. Two, you found quite a bit of variance, even within the same model. And then the way to reduce that variance was guardrails. One that we've called out particularly was just type safety is actually pretty meaningful. Yeah.

Yep. Three, from the perspective of an independent developer, benchmarks are great, but really you have to rely heavily on evals. And that's kind of where the secret is. I mean, it's very interesting to me because there's so few evals that are public right now. So, I mean, and I don't think most people know how to do a good eval. So, I mean, do you have any thoughts on the state of evals? Being in the position where we were trying to get started, right? And figuring it out. I think it is like, you know, I think folks have,

probably correctly identified that evals are really valuable. And that, you know, I remember the Vercel CTO talked about this with V0, where, you know, they, I think, started by trying to protect their system prompt really carefully. And then eventually we're like, that's actually not that valuable. We will let people get our system prompt. It's the evals that actually let you build a real app.

And I think that gap between an organization that has a great UL set for their domain versus one that doesn't is just really big. And then folks have identified that and then not shared them, right? And I kind of hope for more of an open source types moment, right? Where even if an app is open source in the code or the system prompt,

That's not the hard part for development in a lot of ways. And I think it'd be very cool for more than just like benchmarking models against each other. But if there are, yeah, sets of evals for particular applications getting built that can be put out there and kind of worked on together. Yeah. So I have to ask this because you're such a strong systems thinker and it's relevant to evals, which is there any hope for incremental evolution in

In this space, my experience is you have a new model, everything changes, right? And I was like, I mean, I haven't found the media all that useful for that. You have to end up rewriting a lot of your eval. So much of even evals are just dealing with the same thing.

code-based in the same model. And even if I increment the minor version number of a model, everything changes. And so do you think that we've reined in this problem of like, how do you do real software engineering with a model that's generating code? If the change of Cloud 3.7 is such a perfect example, that just changes so much stuff. I guess the question is, do you think this is manageable or have you found any tricks or is it still Wild West? This is...

Just my speculation. Unprompted hot take. Yeah, please. Yeah. I think some of this is an artifact of the way that models are currently developed, right? And they're trained and that we have these huge companies that are, you know, fighting over being premier foundation models and what goes into their pre-training. But then I think more importantly, what happens during post-training is just a complete black box to everyone, right? I see. Yeah.

There's a lot of benchmark hacking for sure. One thing I noticed about Cloud 3.7 to this point, it's really good at lovable bolt-type stuff. So I'm like, make a pop-up box in HTML. It's like magic and amazing. And so the areas that they're focusing on are clearly very specific. They just...

you know, it feels almost vanity-ish as opposed to like solving the core problem. So if that's any sort of, you know, anecdotal evidence to your theory. Totally. The problem you were talking about specifically, right, where if you are kind of have some coding workflows and you have patterns that are built up on 3.5, I think, you know, who knows if this will be in five years or 10 years at some point down the line, I can imagine that that type of technology

would be codified into some type of benchmark. And then, you know, easy for them to run in their reinforcement learning loops and easy for them to train against. And then that could be something that, because, you know, we've seen how steerable all of these models are with all the different types of

tuning and prompting. And I, you know, I do believe that if things are a little bit more open, that there could be some more consistency in this way. And that we wouldn't all be scrambling when a new model comes out to run it on our own benchmarks and be like, Oh, this worked, or this didn't work, or here's what I have to change in my prompts. I don't know if you've put up a leaderboard for the models themselves with, you know, the full stack bench, the

that you're working on. But I think that would be great just because in my experience, it's the most useful benchmark if you're building a real system. Like it's just not a vanity thing. Hey, have you done that? I should know this. And then if not, do you plan on doing that just so that the model creators consider this benchmark?

Yeah, not yet. We've mostly focused on just cursor agent with Claude. And so let's be done with Claude 3.5, Claude 3.7. I tried it with the OpenAI models using cursor agent, but I, you know, I get this like feeling that they're not really optimized for the OpenAI models.

And so I've been trying to think of, you know, maybe using AIDR and then using that with all the different models is kind of like a more fair comparison. But, you know, it's one of those things where it's spend all this time designing something that's trying to be as fair as possible. But it probably might be interesting just to try cursor with them and then have that be on the leaderboard. All right. So listen, great work on full stack bench. I really enjoyed this conversation. Any predictions on where this stuff goes or anything else that you want to comment?

I think for users of these tools, I think like the takeaway for me is that these tools, if we're trying to solve very complicated tasks, they're, you know, it's still kind of on the boundary of what's possible and not to have them plot out these trajectories, these plans of how do they do complicated coding tasks step by step and how do they know they're making progress and how they make sure they don't get into loops.

So I think maybe the takeaway for me in a very broad sense here is we can work a lot on writing better task prompts. You know, we can work better when we open up a new chat view on cursor, like what do I type in there and how do I get better at that? But I think there's a whole nother space of options.

for saying, how do we change the tools we use? How do we change what libraries we use? What frameworks we use to have some of these properties of having better type safety and guardrails? And that is, you know, kind of secretly a part of the prompt. And I think spending some time optimizing that and thinking about how the model will chart its way through the course of getting from the starting point to a solution can lead to a very large amount of improvements. Amazing. Okay, I have to ask. So do you use...

AI for coding yourself? I mean, I think you're probably the best developer I know. I know a lot of developers. So do you use AI for coding? 100%. Does it make you faster? And if so, how much?

I think it makes it's I think it's hard to quantify, right? Because I think it makes me, you know, maybe two times faster. The kind of idea of there being just so much pent up demand for coding in the world. And, you know, I do all types of things I would never do otherwise. I write little custom debuggers for when I'm working on a tool and visualize data. And it's honestly delightful. It's been really fun. You've spent a lot of time now working with these models and how they generate code.

And, you know, what sort of advice or mental model would you give to somebody that's using them on like how to approach maybe the most effectively or safely prompting or kind of what sort of workflow or how to think about it when using these models for code? Yeah. I mean, I think there's like a lot of analogies here to even just like working with humans, right? I mean, I think there's like,

when it comes to breaking things up into steps and, you know, like working with some junior engineers, right? It's the first work on getting every, all of your interfaces written out and getting it so everything type checks and then git commit. Make sure that once you get to a commanding position, you will never be pushed off of it. Right, right, right. Get it working first, get it the basics, get it committed so like if it's

screws it up, you can kind of go back to that one. And then from there... Totally. And then, you know, if you have a small little extension, right, you have a place you've committed, you know it works, try something small, design it to be evaluatable. Can you know whether that works or not? If it doesn't work, right?

revert. Those are all types of things that I think as humans, we have to learn right as programmers, but I think we're pretty intuitively good at. And I don't think models are amazing at it yet. And maybe they will in the future, but we haven't seen it. Awesome. Thank you so much, CJ. Thanks, Martin.

There you have it. Another episode in the books. We hope you enjoyed it. And if you did, please do rate and review the podcast and share it far and wide with your friends and colleagues. And as always, keep listening for more exciting episodes.

Benchmarking AI Agents on Full-Stack Coding 33:28 Share

AI + a16z

Deep Dive

Shownotes Transcript

Benchmarking AI Agents on Full-Stack Coding