We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Benchmarking AI Agents on Full-Stack Coding

2025/3/28

AI + a16z

AI Deep Dive AI Chapters Transcript

People

Martin Casado

总合伙人，专注于人工智能投资和推动行业发展。

Sujay Jayakar

Topics

Martin Casado: 我认为许多AI代码生成工具的轨迹管理仍有待改进，编写困难的代码就像玩游戏，需要良好的启发式方法来指导过程。当前的AI代理在构建完整的全栈应用方面仍存在挑战，这需要进一步改进轨迹管理和启发式方法。我最近使用Cloud 3.7时遇到了一些问题，它过于‘聪明’，导致代码修改难以回退。这说明在实际应用中，选择合适的模型版本非常重要，需要权衡模型的‘聪明程度’和代码的可维护性。基准测试可以帮助理解不同平台的优缺点，但评估对于独立开发者更实用，评估需要开发者掌握一定的技巧。对于构建使用AI模型的应用程序的开发者来说，基准测试和评估都非常重要，评估可以帮助开发者更好地理解和改进其产品。AI应用程序需要进行评估，这往往被低估了，一个好的评估可以帮助开发者构建更好的产品。编写评估可以帮助开发者明确产品目标、解决方案和评估标准，并通过测试验证改进。在使用AI模型进行代码生成时，模型的更新可能会导致评估需要重新编写，这给软件工程带来了挑战。模型的开发和训练过程缺乏透明度，导致模型更新时评估需要重新编写，这在一定程度上是模型开发方式造成的。大型语言模型的改进方向可能并非总是专注于解决核心问题。随着模型的改进和技术的成熟，模型更新对评估的影响可能会逐渐减小。我经常使用AI进行代码编写，它能显著提高我的工作效率。 Sujay Jayakar: 构建完整的全栈应用对于目前的自主AI代理来说仍然不是易事，一些因素会影响其性能，例如强有力的防护措施、模型的代码编写能力以及良好的库选择和抽象。强有力的防护措施（例如快速反馈和明确的正确与错误界限）可以显著提高AI自主编码的性能。AI模型擅长编写代码，但不擅长评估RLS规则或解释SQL查询的运行原理。选择合适的库和抽象对于提高AI编码性能至关重要，要明确模型需要做什么，以及不需要做什么。 Full Stack Bench这个基准测试评估AI代理能否完成从前端到后端（包括数据库、API和订阅）的完整全栈应用构建任务。创建Full Stack Bench基准测试是因为现有基准测试无法充分评估AI代理在实际全栈应用开发中的性能。在处理复杂问题时，AI代理可能由于上下文管理问题而出现不一致性。强有力的防护措施（例如类型安全）可以减少AI代理在代码生成过程中的不一致性。类型安全是减少AI代码生成中不一致性的有效方法。类型安全等防护措施可以帮助AI代理保持一致性，减少在探索解决方案过程中的偏差。运行时防护措施（例如易于测试的语言）也可以帮助管理AI代理的轨迹。 AI模型在调试和推理方面不如代码编写能力强，尤其是在处理React Hook规则或SQL的RLS规则时。大型语言模型的知识截止日期和预训练数据会限制其构建新抽象的能力，但它们可以通过上下文学习来改进其性能，但其知识和上下文学习能力在不同模型之间存在差异。价格较低的模型性能不如价格较高的模型，微调可以提高价格较低模型的性能，但对于业余爱好者来说可能操作复杂。Gemini模型在性价比方面表现出色。 Full Stack Bench基准测试主要关注的是能够集成多个组件（前端、API层、数据库）的大型多组件系统。在使用AI模型进行代码生成时，即使是同一模型，也会存在较大的差异，而类型安全等防护措施可以有效降低这种差异。高质量的评估对于AI应用程序开发至关重要，但目前公开的评估数量有限，许多公司将其视为商业机密。公开共享高质量的评估集可以促进AI应用程序开发领域的合作与进步。在使用AI工具进行复杂任务时，需要关注模型的轨迹规划、进度管理和避免循环等问题。改进AI代码生成工具的性能，需要关注任务提示、工具选择和框架选择等方面。使用AI进行代码编写类似于与人类工程师合作，需要将任务分解成步骤，并确保在每个步骤都达到可提交的状态，以便在出现错误时能够回退。

Deep Dive

Chapters

The episode starts by discussing the challenges of trajectory management in AI coding, drawing parallels to AlphaGo and heuristic development in game playing. It highlights the difficulty of finding efficient paths to solutions and the need for robust heuristics.

Trajectory management in AI coding is underdeveloped.
Coding is like playing a game with a starting and ending position, but few clear paths between them.
Good heuristics are crucial but hard to develop for AI agents.

Shownotes Transcript

In this episode, a16z General Partner Martin Casado sits down with Sujay Jayakar, co-founder and Chief Scientist at Convex), to talk about his team’s latest work benchmarking AI agents on full-stack coding tasks. From designing Fullstack Bench to the quirks of agent behavior, the two dig into what’s actually hard about autonomous software development, and why robust evals—and guardrails like type safety—matter more than ever. They also get tactical: which models perform best for real-world app building? How should developers think about trajectory management and variance across runs? And what changes when you treat your toolchain like part of the prompt? Whether you're a hobbyist developer or building the next generation of AI-powered devtools, Sujay’s systems-level insights are not to be missed.

Drawing from Sujay’s work developing the Fullstack-Bench, they cover:

Why full-stack coding is still a frontier task for autonomous agents
How type safety and other “guardrails” can significantly reduce variance and failure
What makes a good eval—and why evals might matter more than clever prompts
How different models perform on real-world app-building tasks (and what to watch out for)
Why your toolchain might be the most underrated part of the prompt
And what all of this means for devs—from hobbyists to infra teams building with AI in the loop

Learn More:

Introducing Fullstack-Bench)

Follow everyone on X:

Sujay Jayakar)

Martin Casado)

Check out everything a16z is doing with artificial intelligence here), including articles, projects, and more podcasts.

Benchmarking AI Agents on Full-Stack Coding 33:28 Share

AI + a16z

Deep Dive

Shownotes Transcript

Benchmarking AI Agents on Full-Stack Coding