We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Code Smarter, Not Harder

2024/5/22

Greymatter

AI Deep Dive AI Chapters Transcript

People

Corinne Riley

Topics

Corinne Riley: 我认为AI在代码生成和工程流程中具有巨大的潜力，能够增强甚至完全自动化工程工作，从而开启比以往更大的市场。目前我们看到了三种主要的AI编码工具方法：AI 协同工具、AI 代理和代码特定模型。虽然每种方法都有其优势，但同时也面临着技术挑战，例如如何提取相关上下文、如何使 AI 代理更好地完成端到端编码任务，以及是否拥有代码特定模型能够带来长期差异化。我相信，解决这些挑战将解锁具有低延迟和良好用户体验的可靠代码生成工具，为开发者带来更高的生产力和创新。

Deep Dive

Chapters

This chapter explores the current landscape of AI-powered code generation tools, focusing on three approaches: AI co-pilots enhancing existing workflows, AI agents replacing workflows entirely, and code-specific models. It highlights the challenges and opportunities within each approach, particularly the need for relevant context within a company's codebase.

AI augmentation of engineering tasks is feasible due to the nature of coding work, availability of training data, and testability of results.
Three approaches in the startup ecosystem: AI co-pilots, AI agents, and code-specific models.
Challenges include accessing relevant context, end-to-end task completion for AI agents, and the long-term viability of code-specific models compared to base model improvements.

Shownotes Transcript

Translations:

中文

Welcome to Gray Matter, the podcast from Greylock. I'm Corinne Riley, a partner here at Greylock. Today's episode is an audio version of an essay I wrote entitled Code Smarter, Not Harder, Solving the Unknowns to Developing AI Engineers. You can read this essay on Greylock.com and it's also linked in the show notes. Unlocking high fidelity, reliable AI for code generation and engineering workflows is a massive opportunity.

Engineering tasks are well suited for AI augmentation or replacement for a few reasons. Coding inherently requires engineers to break problems down into smaller, more manageable tasks. There's plenty of existing training data. Tasks require a mixture of judgment and rules-based work. Solutions leverage composable modules like open source libraries. And in some cases, the result of one's work can be empirically tested for correctness.

This means reliably accurate AI coding tools can deliver quantifiable value. Given these factors, attempts at AI coding tools have exploded in just the past year. But there are still many open questions about the technical unlocks that must be solved to make coding tools that work as well as or better than human engineers in a production setting.

Here, I frame the three approaches we at Greylock are seeing in the startup ecosystem and the three open questions facing these tools. First, how do you extract relevant context from these code bases? Second, how do you get AI agents to work better for end-to-end coding tasks? And lastly, does owning a code-specific model lead to a long-term differentiation? The current state of the market. In the past year, we've seen startups take three approaches.

First, AI co-pilots to enhance engineering workflows by sitting alongside engineers and tools where they work. Second, AI agents that can replace engineering workflows by performing engineering tasks end-to-end. And third, code-specific models. These are bespoke models trained with code-specific data and are vertically integrated with user-facing applications.

Even before the overarching questions are answered, we believe each of these above approaches can deliver meaningful impact in the near term. Let's take a closer look at each one. One, enhancing existing workflows.

Today, the vast majority of AI code startups are taking the shape of in IDE co-pilots or chat interfaces to enhance engineering workflows. While companies like Tab9 have been working on code assistance for many years, the big moment for coding AI tools came with the release of GitHub Copilot in 2021. Since then, we've seen a flurry of startups going after the various jobs to be done by engineers. Startups finding traction are going after workflows centered around code generation or code testing.

This is because they are core parts of an engineer's job. They can require relatively low context to be sufficiently useful. In most cases, they can be bundled within a single platform. And lastly, in a world where reliability is scarce, putting outputs in front of the user, i.e. in an IDE, allows them to take ownership of any corrections required.

The elephant in the room is the challenge of going after GitHub Copilot, which already has considerable distribution in Mindshare. Startups are thus working around this by looking for pockets of differentiation in which to ground their wedge. For example, Codium, that's C-O-D-E-I-U-M, is taking an enterprise-first approach.

And the closely named CODIUM, that's C-O-D-I-U-M, is starting with code testing and reviewing and then expanding from there. We also believe there's a strong opportunity for tools going after tasks like code refactoring, code review, and software architecting.

These can be more complex as they not only require a larger surface area of understanding within the code, but also an understanding of a graph of knowledge between different files, knowledge of external libraries, the business context, the end usage pattern of the software, and complex selection of tools. Regardless of the wedge, one of the recurring challenges we're seeing at this layer is to access relevant context to solve wider reaching tasks within a company's code base.

Exactly how that's done is an open question, which I'll get into later. Two, AI coding agents. If augmenting engineering workflow is valuable, an even larger opportunity is figuring out what workflows can be completely replaced. AI coding products that can perform engineering tasks end-to-end and can work in the background while the human engineer does something else would create an entirely new level of productivity and innovation.

A giant leap beyond AI co-pilots, AI coding agents could take us from a realm of selling tooling to selling labor. In a world where coding agents get very good, you could have a single human supervising multiple AI engineers in parallel. The fundamental capability of an AI agent isn't just about predicting the next word in a line of code.

It needs to couple that with the ability to carry out complex tasks with upwards of dozens of steps. And like an engineer, think about the product from the user perspective.

For example, if prompted to fix a bug, it needs to know its location, the nature of its problem, how it affects the product, any downstream changes that might result from fixing the bug, and much more before it can even take the first action. The context must come from something like ingesting JIRA tickets, larger chunks of the code base, and other sources of information.

Being able to write detailed code specs and accurate code planning will become central to adopting AI engineers. Companies and projects we have seen in the space include but are not limited to Devon, Factory, CodeGen, SWE Agent, OpenDevon, AutoCodeRover, Trunk, and more. The question then is, what needs to be done for agents to be able to complete a larger portion of tasks end-to-end?

We'll get to that in my open question section. 3. Code-specific foundation model companies A few founders believe that in order to build long-term differentiation at the code app layer, you need to own a code-specific model that powers it. It's not an unreasonable suggestion, but it seems there are a few open questions that have steered other startups away from this capital-intensive approach.

primarily that it's unclear whether a code-specific model will be leapfrogged by improvements at the base model layer. First, let's recall that most foundational LLMs are not trained exclusively on code, and many existing code-specific models like CodeLlama and AlphaCode are created by taking an LLM-based model, giving it millions of points of publicly available code, and fine-tuning it to programming needs.

Today, startups like Magic, Poolside, and Augment are trying to take this a step further by training their own code-specific models, by generating their own code data, and using human feedback on the coding examples. Poolside calls this reinforcement learning from code execution feedback.

The thesis is that doing this will lead to better output, reduce reliance on GPT-4 or other LLMs, and ultimately create the most durable moat possible. The core question here is whether a new team can outpace frontier model improvements.

The base model space is moving so fast that if you do try to go deep on a code-specific model, you are at risk of a better base model coming into existence and leapfrogging you before your new model is done training. Given how capital-intensive model training is, there's a lot of time and money at risk if you get this question wrong.

I know some teams are going after the very appealing approach of doing code-specific fine-tuning for specific tasks on top of base models, allowing them to benefit from the progress of base models while improving performance on code tasks. I'll explain more in the next section. Open questions.

Regardless of the approach one takes, there are a few technical challenges that need to be solved to unlock reliable code generation tools with low latency and good UX. First, how do we create more powerful context awareness? Second, how do we get AI agents to work better for end-to-end coding tasks? And third, does owning the model and model infra lead to a long-term differentiation product?

Open questions. Regardless of the approach one takes, there are a few technical challenges that need to be solved to unlock reliable code generation tools with low latency and good UX. First, how do we create more powerful context awareness? Second, how do we get AI agents to work better for end-to-end coding tasks? And lastly, does owning the model and model infra lead to a long-term differentiated product?

Let's examine the first question of how to create more powerful context. The crux of the context issue is the fact that certain coding tasks require pieces of information and context that live outside of the open file an engineer is working in and can't be accessed by simply increasing the context window size.

Retrieving those pieces of information from different parts of the code base and even external to it is not only challenging, but can increase latency, which is lethal in an instant autocomplete world. This poses a great opportunity for startups who are able to accurately and securely find and ingest the context necessary for a coding task.

Currently, there are two approaches to doing this, continuous fine-tuning and context-aware RAG. I'll talk about both. Continuous fine-tuning. I've heard customers tell me, I wish a company would fine-tune their models securely on my code base.

While tuning a model to your own codebase might make sense in theory, in reality there is a catch. Once you tune the model it becomes static, unless you are doing continuous pre-training, which is costly and could have the effect of perpetuating existing biases. Without that, it might do well for a limited time, but it's not actually learning as the codebase evolves.

That said, fine-tuning is getting easier, so it's possible fine-tuning a model on your code base at a regular cadence could be viable. For example, Codium with an E states that they do in fact offer customer-specific fine-tuning, but they clearly say it should be used sparingly, as the best approach is context-aware RAG. Context-aware RAG.

RAG is perhaps the best available method today to improve context by retrieving relevant snippets of the codebase. The challenge here is that the ranking problem in retrieval in very large codebases is non-trivial. Concepts like agentic RAG and RAG fine-tuning are gaining popularity and could be strong approaches to better utilize context.

Codium with an E, for example, shared in a blog post how they use textbook RAG augmented with more complex retrieval logic, crawling imports and directory structures and taking user intent, like past files you've opened, as context. Being able to use this granular detail in retrieval can be a significant moat for startups. Open question two, how do we get AI agents to work better for end-to-end coding tasks?

While we still have a way to go to fully functioning AI engineers, a handful of companies and projects like Cognition, Factory, CodeGence, WeAgent, OpenDevon, AutoCodeRover, and Trunk are making meaningful progress.

SWE bench evaluations have revealed that most base models can only fix up to 4% of issues. SWE agent can achieve 12%, cognition reportedly 14%, and OpenDevon up to 21%. An interesting idea, reiterated by Andrei Karpathy, is around the concept of flow engineering, which goes past single prompt or chain of thought prompt and focuses on iterative generation and testing of code.

It's true that prompting can be a great way to increase performance without needing to train a model, although it's unclear to me how much of a moat that can be for a company in the long run. Note that there are some limitations to the suite agent form of measurement.

For context, SweeBench consists of GitHub pairings of issues and pull requests. So when a model is tested on it, they're given a small subset of the code repo, a sort of hint that also introduces bias, rather than being given a whole repo and told to figure it out. Still, I believe SweeBench is a good benchmark to start understanding these agents at this point in time.

Code planning is going to take a central role in AI agents, and I would be excited to see more companies focus on generating code specs that can help an agent build an objective, plan the feature, and define its implementation and architecture. Multi-step agentic reasoning is still broadly unsolved and is rumored to be a strong area of focus for OpenAI's next model. In fact,

Some would argue that actual moat in AI coding agents doesn't stem from the wrapper at all, but the LLM itself and its ability to solve real-world software engineering problems with human-level tool access, like search stack overflow, read documentations, self-reflect, self-correct, and carry out long-term consistent plans.

This brings us to our last and possibly largest open question. Does owning the model and model infra lead to a long-term differentiated product? The billion-dollar question is whether a startup should rely on existing models, whether that be directly calling a GPT slash CLOD model or fine-tuning a base model, or go through the capital-intensive process of building their own code-specific model.

That would mean pre-training a model specifically for code with high quality coding data. We empirically do not know whether a code specific model will have better outcomes than the next suite of large language models. This question comes down to a few basic unknowns. Can a smaller code model outperform a much larger base model? To what degree does a model need to be pre-trained on code data to see meaningful improvement?

Is there enough available high quality code data to train on? And does large scale reasoning of the base model just trump all? The hypothesis from Poolside, Magic, and Augment is that owning the underlying model and training it on code can significantly determine code generation quality. This potential advantage makes sense given considering the competition.

From my understanding, GitHub Copilot doesn't have a model trained fully by Scratch, but instead runs a smaller, heavily code fine-tuned GPT model. My guess is these companies aren't going to try to build a foundation-sized model, but a smaller and more specialized model. Based on conversations I have with people working in this emerging area, my takeaway is that we still just don't know what scale of an improvement this approach will have until results are released.

A counter-argument to the code model approach comes from the fact that existing successful coding copilots like Cursor and Devin are known to be built on top of GPT models, not code-specific models. And DBRX Instruct reportedly outperformed the code-specific trained CodeLlama.

If training with coding data helps with reasoning, then the frontier models will surely include code execution feedback in future models, thus making them more apt for CodeGen. And in parallel, large models trained primarily on language could feasibly have enough contextual information that their reasoning ability trumps the need for code data. After all, that's how humans work.

The key question here is whether the rate of improvement of the base model is larger than the performance increase from a code specific model.

I think it's possible that most co-pilot companies will start taking frontier models and fine-tuning on their own data. For example, take a LAMA 3 8 billion parameter and do reinforcement learning from code execution feedback on top of that. This allows a company to benefit from the development in base models while biasing the model towards code performance.

To recap, building AI tools for code generation and engineering workflows is one of the most exciting and worthy undertakings we see today. The ability to enhance and eventually fully automate engineering work unlocks a market much larger than what we've seen historically in developer tooling. While there are technical obstacles that need to be surmounted, the upside in this market is uncapped.

At Greylock, we are actively looking to partner with founders experimenting with all three approaches discussed in this post. We think the field is large enough to allow for many companies to develop specialist approaches to agents, co-pilots, and models. If you're a founder working on any of these concepts or even just thinking about it, please get in touch with me at corinne at greylock.com.

Code Smarter, Not Harder 18:30 Share

Greymatter

Deep Dive

Shownotes Transcript

Code Smarter, Not Harder