We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

CodeRabbit and RAG for Code Review with Harjot Gill

2025/6/24

Cloud Engineering Archives - Software Engineering Daily

AI Deep Dive AI Chapters Transcript

People

Harjot Gill

Kevin Ball

无

无发言人

Topics

Harjot Gill: 作为CodeRabbit的联合创始人兼CEO，我认为AI在代码审查中扮演着至关重要的角色。随着AI代码生成工具的普及，AI代码审查的需求也日益增长。CodeRabbit通过生成式AI来审查代码，关注代码质量和安全性，为开发者提供了一个强大的辅助工具。我们利用多模型LLM策略，构建代码图，并结合静态分析，以确保代码审查的准确性和效率。通过与用户聊天进行训练，CodeRabbit能够不断学习和改进，为用户提供更好的体验。此外，我们还创建沙箱环境，让AI代理能够自主导航代码库，查找潜在问题。 Kevin Ball: 我作为一名工程副总裁，对CodeRabbit的架构和工作原理非常感兴趣。我认为正确地提供上下文是至关重要的，因为天真地应用LLM模型可能会导致错误。CodeRabbit如何构建代码图来确定可能受影响的部分？你们的步骤流程是怎样的，包括静态分析、使用更便宜的模型进行清理以及使用昂贵的推理模型？我特别关注CodeRabbit如何跟踪过程以显示正确的相关上下文，以及如何确保AI的决策是高质量的。

Deep Dive

Shownotes Transcript

Translations:

中文

One of the most immediate and high-impact applications of LLMs has been in software development. The models can significantly accelerate code writing, but with that increased velocity comes a greater need for thoughtful, scalable approaches to code review.

Integrating AI into the development workflow requires rethinking how to ensure quality, security, and maintainability at scale. CodeRabbit is a startup that brings generative AI into the code review process. It evaluates code quality and security directly within tools like GitHub and VS Code, acting as an AI reviewer that complements existing CI/CD pipelines.

Harjot Gill is the founder and CEO of CodeRabbit. He joins the podcast with Kevin Ball to discuss CodeRabbit's architecture, its multi-model LLM strategy, how it tracks the reasoning trail of agents, managing context windows, lessons from bootstrapping the company, and much more.

Kevin Ball, or K-Ball, is the Vice President of Engineering at Mento and an independent coach for engineers and engineering leaders. He co-founded and served as CTO for two companies, founded the San Diego JavaScript Meetup, and organizes the AI in Action discussion group through Latent Space. Check out the show notes to follow K-Ball on Twitter or LinkedIn, or visit his website, kball.llc.

Harjit, welcome to the show. Thanks, Kevin. Yeah, I'm excited to dig in with you. I'm really excited about what you guys are doing, but let's maybe start with that. So can you give our audience a little bit of a background on you and on CodeRabbit? Yeah, that's great. So I'm Harjit, and I'm co-founder and CEO of CodeRabbit, which is a startup using generative AI to look at code reviews, essentially code quality, code security for users which are on popular Git platforms like GitHub, GitLab.

So the company is roughly a couple of years old, but has grown tremendously non-unionly pretty much. In the last couple of years, we have 100,000 developers who are using this platform.

on a daily basis. And it's a pretty popular product loved by the developers across all the industry segments and so on. Awesome. So let's first look at this from a user standpoint. What does this look like? And then I will be excited to dive under the covers and dig into how CodeRabbit works. But for me as a developer, if I want to use CodeRabbit, what do I do and what does it look like?

Right, so Code Rabbit is like a tool that is a nice complement to a lot of the code generation tools which are out there on the market, as you know, like a lot of the developers are now familiar with Cursor, GitHub Copilot, Windsurf, and so on. And as they're now using AI to generate a lot of the code, and we know that AI-generated code has a lot of deficiencies in terms of maintainability, and sometimes they're like just sloppy errors that AI makes.

So now you've got to bring in AI to review AI because now review is becoming a bottleneck, right? So to consume code rabbit, there are a couple of ways. The product primarily works inside your pull request model. So essentially, once you are done with your feature branch, you open a pull request before it gets merged into the main line and gets shipped out to the end customers. That's typically where all the code reviews happen, like the human reviews, a lot of the static analysis tools that you're running like linters and unit tests run. Essentially, your CICD pipeline runs over there.

CodeRabbit sits alongside those tools and uses AI to perform code reviews.

And very decently, around a couple of weeks back, we also released a VS Code extension that also works with the forks of VS Code like Cursor and Vint Cerf so that the developers can also review the code before they even push the code to the remote Git branch. Okay, cool. So then let's look at what that looks like on the implementation side, because I think one of the things that I've certainly run into with Gen AI is,

naive application of the models. These models are very powerful. They can do a lot of cool stuff, but as you highlight, they get a lot of things wrong. And so figuring out how you feed them the right context and put all those things in place is very important. So can you maybe walk us through, I guess, first, what is the architecture for CodeRabbit behind the scenes? I will first start by first contrasting with how different the code generation is from code review, and then we'll probably go deeper into how CodeRabbit makes it all work.

And if you look at code generation, it all started with a lot of this tab completion style use cases autocomplete. I mean, typically you will see usage of small, low latency models. So as you type, you have this like suggestions show up in ghost text that you can press tab to complete, right?

And most sophisticated approaches will use some sort of a vector database to index your code so that you get more relevant suggestions based on your data structures or coding patterns that you're using, right? On the other hand, the code review is a problem that requires very, very deep reasoning. So the workflow that CodeDabbit is sitting on is latency insensitive because you're running it in the CICD pipeline and that workflow can typically take several minutes to complete.

So a tool like CodeRabbit has to be a lot more thorough in terms of its analysis in order to make it actually work. So CodeRabbit, believe it or not, is actually one of the biggest consumers of the reasoning models in the world right now. So one of the big users of 03, 04, many saw it, right? And that's part of the magic that makes it work. Then, of course, it's the entire workflow around it, on how we bring in the relevant context, right?

And the context comes from, so the workflow basically triggers as soon as you open a pull request. So the context naturally comes from what's the payload of that pull request, what the diff looks like, right? Then you're also bringing the context from the remaining code base, the code graph. So we understand the impact that code would have on the dependencies that you're using in the code, like other functions, which are not even changed, but now are depending on the code that you're changing, right?

So the building the code graph is also pretty critical in terms of context, right?

The other context comes from the Jira or linear issues that you are trying to solve through that pull request. So usually there's some product knowledge or some knowledge about the bug that you're trying to solve coming from the issue systems, right? There's a lot of context is coming from the past learnings because Codrive is a very collaborative product. It's a product that people consume at a team level. So yeah.

And the way you train CodeRabbit is by chatting with it. So the more you talk to CodeRabbit, the better it gets over time. So those learnings that it has learned over user interactions from the previous reviews also get pulled in. And these are some of the examples, like 10 to 15 different data points that we pull in during the context, right? But it's not sufficient, actually. That's the thing, right? I mean, as you know, these models have a very, very limited context windows, right?

And even though when we are seeing these context windows expand to million tokens or so, it's still not efficient because you basically lose the quality of inferences as you try to stuff in more context. Like it's great for summarization, but when you're talking about deep reasoning, you can't really use all that context, right? So what we try to do is like give CodeRabbit's agent enough hints so that it can get a basic bearing on what's happening in the pull request, where directionally are the trajectory of these changes, where they're going and so on.

Then what we are doing, which is a cool thing, which is so differentiated right now, we create all these sandbox environments in the cloud. We actually do create sandboxes, we clone the repository, and then we let the AI run an agentic loop to navigate that code base. We let the AI run CLI commands like shell scripts. It can run keyword searches. It can go and read additional files and bring additional data points into the context.

even run AST grep queries, abstract syntax tree queries to read entire functions and bring it into its context. And then continue with its analysis in order to validate a bug. And one of the stages of reasoning process is, okay, it looks like there might be an issue if you're going to change this, but can I go and validate if it's really an issue? So it's a combination of preloading some context and then giving the agent enough agency to

to go and find missing information. It even runs web queries. Like sometimes you have knowledge cutoff issues, right? I mean, these models have been trained in the past. Like, I mean, sometimes you have 2023 cutoff, 2022 cutoff,

which is kind of bad for the coding use cases because a lot of these libraries and frameworks are constantly evolving. So in a lot of these cases, we try to bring in the context from doing internet searches. So sometimes we'll say, okay, this is a new syntax that we're looking at. Is this syntax something that's really out there or is it incorrect? So you will sometimes say code rabbit to a web query to confirm the latest documentation. So that is fascinating. And I'd love to dive into some of those pieces. So first off, you said you kind of start with

The diff and building the code graph from there. Is that something that you are moderating through an LLM or you have a sort of static analysis that you're doing? Or how do you build that code graph for what's likely to be impacted? It's a combination of both, actually. So that's a nice thing. Like there's a lot of these abstract syntax tree analysis and understanding the relationships. Like, I mean, you're familiar with like language server protocols, LSPs. It's kind of similar what we are doing there, but it's our own proprietary implementation. So not exactly like LSPs, but somewhere in the middle in terms of

the memory footprint and everything we need to build that code graph. And it's all being done on demand. It's not being pre-indexed like source graph or something. We just create this live as we're doing the analysis. And the other part is the large language models are able to then further understand the relevance of that code graph. I mean, a lot of things can be references and dependencies, but which are really relevant for understanding that different code review. So there's a lot of like cleanup on the context as well happening before we

trigger some of the more expensive reasoning models. That's interesting. So could you walk me through maybe like what is the pipeline of steps that you go through? So it sounds like there's some amount of static analysis, there's some amount of cleanup with cheaper models, there's some amount of then these expensive reasoning models, like maybe not in full detail, no secrets here, but like

What are the different types of steps involved and how do you think about sequencing them? Yeah, I mean, we have written about it as well. When we started the company, there were a couple of initial blog posts on how CodeRabbit works and what it makes it both cheap and good at the same time, which is hard to engineer those kind of things in the world of AI. So one of the things that we do really well is understanding the context, right? So it's not like...

tools like Cursor where you're picking a model and then you're running with that model for your entire flow. CodeRabbit is an ensemble of models. We don't even expose what models we are using to the end customers. Sometimes people ask which models you're using, can we choose the models? We don't let them because they'll most likely make a mistake in picking the right model for the use case we have. So our team does a lot of work behind the scenes to pick up the right kind of a model for the different parts of our pipeline and the workload we have.

So we use like seven or eight models and depending on which one's like a good fit for which part of the workflow. And a lot of the context preparation is where we use like cheaper, faster models like GPD 4.1 Nano or GPD Mini 4.1 Mini. Those are like kind of the big work courses.

They're dirt cheap, but we still spend a lot, significant amount of money on them, given how much volume we are running through those models. And they do all sorts of tasks, from summarizing large context, like files, entire files, and previous issues, and so on. So there's a lot of summarization that goes in before we even go into actual code review workflow. So there are multiple steps. There's a whole setup process where we're creating a sandbox.

We are running a lot of the static analysis tools in them. So there's a lot of context being pulled in from your existing tooling. So we already, basically we go and identify what kind of tooling you have set up on your repository. Let's say you are using ESLint, we will go and detect that. We will use existing configuration that your DevOps team might have set up, right? Sometimes people use Golang CI Lint. So we pick up all these tools, right? And we run them for you. So basically that's one of the contexts we bring in.

Then there's a lot of context we bring in from your CI/CD failures. That's another place where we use large language models to understand your failure logs. So if you have a build failure or a unit test case failure, we understand exactly what happened there. And that context is also used during code review so that we can provide remediation, one-click fixes for those steps, right?

So yeah, so there's a, as I said, like seven or eight models and for different use cases for chat, there's a different model for some of the agent take verification flows that we run. Those are like different reasoning models and so on.

Cool. And you mentioned a lot of this is being done on demand, but you also said, hey, you can train CodeRabbit. It will incorporate past learnings based on conversations or things that you've done in there. So it sounds to me like there's some sort of kind of summarization or indexing that you're doing of previous PRs that gets fed in at some layer. What does that piece look like?

That's right. I mean, this is where we have like a very different indexing system, similar in some ways, but different in many ways of the entire code base. So we do look at the entire code base and based on what got merged over the last 1000-2000 commits, right? I mean, that's how the system works.

And over there, we are indexing not just the code snippets. We understand that, okay, these are the relevant code snippets. That's how everyone's been doing it. They use abstract syntax trees, tree-seater grammar rules to extract out the relevant snippets and index. But one of the unique things we also do on top of that, we also convert those snippets into doc strings, document, like natural language, because a lot of the user queries are in natural language. So when you're doing code completion and you're

Your similarity search happens on the code snippets itself, so you have a good match in the vector DB. But when you're going into a lot more agentic use cases like CodeDabit is, the input is natural language query, right? So you have a better match when you're converting code into natural language representation or summary of it.

So we do a lot of that at scale. That makes a lot of sense. And then I presume you expose it to your agent. Here's a query framework. You do a natural language query and load up whatever might be relevant. That's right. We're bringing in knowledge from the code graph. We're bringing in knowledge from the code base index we have created. It's a very different kind of an index than what people have been doing in the space.

And a lot of that context is also shown to the user so that people also can trust the AI because the AI is known to hallucinate, right? So one of the things you build trust is to also show the context and how that insight was bubbled up, like what led to that review comment or the conclusion, right?

So all that helps in making a great user experience. Let's maybe talk about that exposing piece, because I think that is key for any of these LLM driven applications, giving you the paper trail of like, how did this get here? Why is this here? So I can, as a human, validate it and detect those hallucinations and things. So when you have this long pipeline of context that you're loading in, you mentioned a bunch of different steps from a bunch of different sources. How do you keep track

through the process to be able to bubble up the right sets of relevant context.

Right, so it's all in the UX. So when you are posting these review comments, sometimes we will show what kind of additional context was used to bring up that insight. Sometimes it's just pure LLM logic. There's no additional context. It's just an issue that was detected on a surface level. But sometimes it's deep inspection of the code base. Sometimes the agent will go and read additional files in the repository. So you could see an analysis chain in CodeDrivet comments. Sometimes if you open that chain, you will see all thought process.

And the paper trail, as you said, what kind of commands were executed to come up with a certain insight. And then it will pinpoint the files and locations. Even the files will not change in the pull request. I mean, it will also bring up insights from your remaining code base. But you could go back and follow the paper trail. And if it ever went off track, you know exactly why it went off track. And then you can chat with CodeEvident and explain why its analysis is correct or incorrect. And if it is incorrect, it will remember it for next time.

That's super cool. So essentially, just to make sure I'm understanding, your agent is outputting its logs of what it's doing, which includes both LLM reasoning and tool calls, often different places, and then the results of what those tool calls are. And you keep that track and just bubble that up straight to the UI for someone to be able to explore.

That's right. We think it helps a lot. And actually, we were one of the first companies to pioneer this whole sandbox and CLI. Now we see like this becoming a commonplace codex came out and all, but CodeRabbit has been doing this since last two years. Since the days of GPT-4, we are the first ones to actually find out that a lot of the

Code-based navigation is a great way of finding issues versus doing pure RAG. So everyone was prioritizing a lot of code-based indexing. I know the code-based indexing helps, but a lot of what makes CodeDabit unique comes from this code-based navigation that happens ad hoc.

using shell script. Yeah, that's super interesting. And I think is something that we've started to see in a lot of more recent agents of, hey, let's just expose programming tools essentially to agents and let them figure out the right way to apply that.

Right. So even today, Code Rabbit doesn't use tool calls. So some people think we use tool calls. We actually don't. I mean, the entire system is based on CLI commands. So we actually generate code as, instead of doing tool calls, we have a sandbox and CLI. That's all you need. That's the only tool you need, actually. Everything can be, even MCPs you don't need even to open GitHub issues. We use a CLI command, GitHub CLI to open GitHub issues.

We don't actually use MCPs because we don't have to. All the tools are available over CLI. That is fascinating. Let's dive into that a little bit more. So in terms of, I think one of the concerns about giving the LLM full access to any sort of code is how do you sandbox it properly? How do you decide what is in and what is out? So how do you think about that sandbox, especially if you're giving it web access and access to GitHub systems and things like that?

Yeah, I mean, there are standard techniques on sandboxing. People have been doing it in the past for many use cases. Even people have been doing dev environments, preview environments. So Coderabit in a lot of ways is standing on the shoulders of the giants in many ways. I mean, there's some proprietary stuff we have done to make it fast and cheap.

Right. I mean, so we are kind of running these sandboxes at scale while also being very cost effective in doing so. But yeah, I mean, there's I wouldn't say any big secret sauce on how containerization C groups and all those things work. I mean, those are like standard systems techniques. Right. But the main thing is, like, how do you further block off access? In our case, we don't block off internet access because that's something we feel the agent should have. Sometimes it will also make curl commands. It will.

Sometimes we want to use the GitHub CLI to read more other PRs and so on. So we don't restrict any internet access, but at the same time, we do want to make sure our cloud services are protected. We don't have access to our internal systems and so on. That makes sense. Do you list out for it, for example, what sets of CLI tools or what permissions or access? Like GitHub CLI, presumably you have to give it a token to be able to access the appropriate place and things like that.

Yeah, GitHub CLI has a nice way to authenticate. So the token is like, we just provided the token once and the CLI works like that. And then that token is in a secure vault inside GitHub CLI.

The main thing is we don't actually have to give AI a lot of information on these tools because it's already in the training data. So when you understand set command, cat command, rip grep, I mean, those tools are like well known, well understood by AI. So there's not a lot of handholding in and making it understand the schema of these tools because it's just shell scripts. It's trained on that. We do...

explain the scenarios in which certain tools might be handy. So we try to influence the behavior in some ways on when it can make certain commands. I mean, for example, if you see, if you're doing a package.json update, let's say, go and read the vulnerability database GitHub has,

to see if these packages are not out of date. And it does that. It's pretty effective, actually. So each time you see a package JSON command, you will see agent making a command call to GitHub's open vulnerability database they have to detect whether these Python packages or Ruby packages have any vulnerabilities, right?

Yeah, I like that a lot. So in terms of scenarios, and I'm going to explore this because I think as you highlight, you are one of the most successful examples of these agents in the wild, but it is a technique a lot of people are trying to figure out and explore. So can you give us a like ballpark? Are we talking tens of scenarios? Are we talking hundreds? Like, what does this look like?

Yeah, they are in the order of more than 10 for sure from what I recall. Like this all coming from the tribal knowledge, like if you are being an engineering leader or good engineer yourself, you kind of taking what you know best and then programming that as a prompt, taking your own knowledge in many cases, right? And a lot of times we are just learning from the sheer amount of customers we have. And one of the reasons why Code Rabbit improved a lot is because we have a lot of open source usage.

And that is a great feedback loop. So we have every few seconds, we review some pull request in open source. And a lot of people interact with CodeRabbit. We kind of observe what they're doing in those pull requests. And some of that behavior goes back into training our agent. Yeah, that makes a ton of sense.

And we talked a little bit earlier about the challenges of prompt stuffing when you've got these big context windows and too much is in there. Is the scenarios amount still small enough that that's all going into the base agent prompt? Or do you do some sort of dynamic loading or figuring out of what are likely relevant scenarios in any particular time? Yeah, that's right. I mean, it's the latter. First of all, like we're using like multiple models, as I said, like there's not single one base agent prompt. It's not like

agentic loop that everyone else has i mean it's a pipeline in a way and a lot of the work goes in preparing the context actually a lot of the money is actually spent in because one of the things with the reasoning models is these models get thrown off track very very quickly if you're doing rag and just stuffing in the context without cleaning it up first or re-ragging it

These models tend to go completely off track and haywire, right? As opposed to non-reasoning models, they overthink. And that is one of the reasons why some companies struggled when Sonnet 3.7 came out. Sorry.

So on a 3.5 was working really well for a lot of the coding companies, but when 3.7 came out, they had no clue what happened, what hit them. Like we were prepared. One of the good things is because we were built with the reasoning models in mind from day one. In fact, even before reasoning models came out, we had a lot more internal reasoning process. Actually, we're in a lot of stages which were just doing internal monologues and reasoning. We always benefit each time a new reasoning model comes out. So there's not big changes into our system, but some companies have to fundamentally rethink how they were doing their prompting with the reasoning models.

Yeah, that makes a lot of sense. So let's maybe break down a little bit the agentic loop, because I think a lot of people building agents right now, it's essentially, yeah, one big system prompt and tool calls and a loop around it. So you said for yours, it's more of a pipeline, you have this more dynamic set of things. So how do you think about the design of your agent?

Yeah, I mean, we work on like large complex code bases. So single loop doesn't work for us. We have to figure out how do we have like the main agent that figures out what kind of things it has to do, then the delegations happening. There's a lot of complexity over there as well on how we break up the work. There's a whole task tracking system where you have a main root task breaking up into subtasks. That's how we do it. We do divide and conquer essentially the problem with agents, right? And the results bubble up.

and the visibility bubbles up. And that's how it works effectively on large code bases. A lot of that is proprietary. It's not like we're using any framework, or something like that. It's all in-house. And going back, it's a loop, but the trick with these systems is

Also making sure that AI or the large language models saw the right context. Like sometimes you have shell scripts, you know that the quality of the output over there won't be high for you to make a good judgment. So sometimes there's a lot of suppression happening, even though AI would say, okay, it looks like there's a bug, but you know that it didn't see the relevant context. So this might not be high quality inference. I will just hide it rather than...

bubble up a lot of noise. So we do a lot of cleanup. Even on this agentic loop, it's not like a pass-through to the user. There's a lot more understanding of

in our system on what kind of quality context is going in into the pipeline so that we know that the decision or the inference we are getting end of the day is going to be high quality or we can even trust it. Yeah, that makes a lot of sense. For example, one of the examples of like lack of output doesn't mean there's a bug, right? Sometimes you will run a find search on a file and you won't find that file, right? Which is probably like you're maybe looking in the wrong place rather than that file not existing. So those kinds of scenarios you have to like account for. There are many such scenarios, by the way.

Yeah, so let me make sure once again that I'm understanding. So essentially you have a top level and it breaks things down into a task graph. It says like, essentially, here's the set of things that I think we need to do

to dig into this and then delegates those tasks to sub-agents in some form which go and do work and then as they complete it kind of bubbles up through the graph to the top level agent that's right and this task graph is dynamic as you can guess i mean it's figured out by the ai yeah so yeah so there's a system that figures out what the tasks should be now thinking about those tasks

Are they fully dynamic? Do you predefine classes of tasks? Does that connect to how you decide what's going to be relevant context and how high quality is likely to be? Or is it completely driven by the LLM? It's a hybrid system. We do know the nature of these tasks because we let the AI choose what kind of task is running and then we know what these tasks should look like when we run them.

But the graph itself is dynamic to a large extent. I mean, there's a pipeline. I mean, it's a hybrid architecture. There's some pipeline stages which are always like hard-coded in the system. These steps have to happen. But

But then there is like we give enough freedom to this agent to go and find stuff as well and plan around it. So what we found is like planning is a big part of the quality. Like the more you plan, the more you give it agency to go and first like go and navigate the code. And that's usually yields high quality outcomes in the end of the day, rather than just rushing into doing something or concluding something. You want to let the AI like follow multiple chains of thoughts. And then some of them could lead to a dead end, but that's fine.

And like maybe four out of five doors were closed, but one of the doors leads to some interesting insight. Yeah, this is all connecting for me because as you build out those tasks, they have classifications. That's going to help with what we talked about in terms of picking what are the relevant scenarios to load into the context for that sub-agent to decide what it

might check or do. The filtering that you talked about, is that also done kind of agentically by the LLM where it's judging quality or do you have some sort of static analysis in there in some form as well? Yeah, it's mostly LLM driven, I would say. There is some static stuff, as I said, like we know exactly like, okay, that these models did not see the relevant context. So it's very easy to sometimes figure that out from the quality of commands it's running and so on and the outputs, right?

But in many cases, the validation is done by another kind of a judge LLM, which is running online and which is also able to decide whether the result so far has been accurate or not. That makes sense. That makes sense.

And then in terms of what you mentioned, in terms of adapting to inputs, then as things come back, I assume different layers have the ability to say, oh, that was a dead end. Go try this. Let's replan. Let's restructure this as you go. How do you limit the extents of that or decide when you're done?

It's an arbitrary number, I know 10 levels deep. I mean, when it's done, it will just say it's done. I mean, but sometimes it's like we have to have, it's like the stack depth problem, like the maximum stack depth you want to do. And it's a cost thing. Like now I don't remember what's the constant right now. Maybe it was five or 10, something like that. We picked a number and said, okay, this is the deepest we want to go into the rabbit hole. That makes sense. These things tend to loop around, like especially the earlier models. There was a lot of this looping behavior where it will go and check same thing again and again, right? Well, and-

Cost does bring up an interesting question, right? Like, I have a coworker who is way down in agent land and exploring all sorts of different agents and trying different things, but they tick up in cost pretty quickly if you just let them run. So you mentioned you've done a lot to try to control costs and keep this contained. How do you approach that?

It's multiple things, right? One is like the reason we use a lot of the cheaper models is the cost. Like, yes, you could use an expensive model for everything, even summarization, but that doesn't make sense. It's like orders of magnitude more expensive, right? For example, O3 is like five times expensive than Sonnet and Sonnet is like orders of magnitude expensive than 4.0 mini or something, right? So it's being like smart about manning the workload to the right model and so that you get the best price to performance ratio for workload that you have in mind.

The other factors are like being smart about, especially the incremental thing. One of the things that people love about Codrive, it's an incremental reviewer. So it will remember the last time we left the review and next time when it resumes, it will first see whether I have to really re-review something or not, whether it's a trivial change, can I skip it? So we do a lot of the prompts that actually just

figuring out whether we need to even do a deeper analysis or just approve it. There's like a short circuit, basically. Yeah. There is a short circuit. And so far, no one has noticed or complained because sometimes we do skip. And the quality on that has been really high. At least the decisions we have been making have been very high quality on that. Right. And the other part has been rate limits. So we do, and you would sometimes see on Twitter, like people complain, quote, it has rate limits, but that's one of the ways we kind of control the views and

so that it's kind of fair to... So unlike a lot of the AI companies which are now going into consumption pricing, like you would see agent-y companies are now, like Cursor, for example, has a max mode, which is, I was reading the documentation, 20% markup over the API cost. So you're passing on the Sonnet costs, Gemini costs to the end user. Like Code Rabbit, on the other hand, has a per seat pricing. It's all you can eat. But the way we sustain as a business at scale is through a lot of these techniques on the LLM side and rate limits, right? I mean...

we are able to have for our open source plan, we have a lot more strict rate limits versus like relaxed rate limits for our peer users and different plans. Yeah, that makes sense. What would you say some of the most kind of challenging technical areas of building out CodeRabbit have been and

How have you addressed them? It's been fun. It's been a different kind of a project. I don't know how much, it's my third startup now. So a very different flavor than the previous two that I did. I mean, the earlier were in observability and infrastructure, cloud infrastructure, reliability management. This has been like the very different kind of a product where we had to like unlearn a lot of the way you build software. Like it's not deterministic. Like there are a lot of like

in deficiencies in the large language models themselves, but they're amazing in so many ways. The trick has been how do you hide those deficiencies from the end user? They tend to be noisy, they tend to be slow, like create a lot of slop otherwise, right? And build a product that people love. So it's a combination of the reliable execution of these agents and also a great UX that becomes part of your daily workload. For example, CodeRabbit sits inside a pull request models. And one of the very few companies

which have been able to successfully bring a product into an existing workflow. A lot of people hate AI, if you ask me. People are trying to bring AI to every workflow you might have, and people hate that. But CodeRabbit has been one of the very few exceptions where it's actually being loved and being pulled in very rapidly by the developers themselves. You highlight a couple of really important things, and I want to go deeper on there. So one is that these models come with fundamental trade-offs. They have strengths and they have deficiencies. And

If you want to use them effectively in a product, you need to build around those. You can't just treat them like software. And then, as you also mentioned, many companies are failing to see that and just kind of trying to bolt them onto things without even thinking about, is this a useful use case for this? What are the strengths? What are the trade-offs? How do I do that? So I'm curious, through building CodeRabbit, if you've kind of developed any

I guess almost like principles for how you think about what is going to be a good use case for LLMs and not, or how you build a product around LLMs.

a large language model? That's a great question, actually. Like one of the things that people love about Codrive, it has been how surprisingly reliable it is or accurate it is given that bad experience or bad taste in the mouth that every other product leaves. And that is a bar we try to like keep up with the new features, which also means tracking where these models are technically, like in terms of both price and performance, right?

So there are a lot of use cases we want to do, like, but we deliberately don't go and build them because we know that the capabilities are not there yet. We don't want to like,

lower the power on GoDrabbit. For example, a lot of companies are now doing issues to PR, but if you give an open-ended prompt, 80% of the time you're still going to end up with the wrong implementation, right? So these are still, I would say, experimental systems not ready for large-scale mainstream use case. GoDrabbit is mainstream. We are being used in even traditional companies, not just like Silicon Valley startups, but even traditional companies on PHP, Java applications, even older applications are using us very successfully, right?

So those are some of the principles. Yes, we could do a lot with AI, especially with tool calling, it doesn't require a lot of code. I mean, if you look at agentic systems, they are very simple systems. They're just a bunch of tools cobbled together and it's usually like Sonnet doing all the magic for you. But those are not products yet.

right? You need a person who is really expert in prompting and able to drive the outcomes. Yes, Twitter is a different bubble. Like when people say they're successful with AI, they're good prompting genius. They know exactly where these models will fail and they don't even try those use cases, but the rest of the world is not ready for a lot of this prompting and models, right? So these are the guiding principles, like some of them. UX is another one, like we do

try to make sure that we understand really the user's existing workflow so that we can seamlessly bring AI into their daily life versus something they have to remember to use. I mean, one of the big differences between CodeDiver experience and other tools is it's not a chat product.

Every other product requires prompting and chat. It is not like we are the very front of the very few products that has zero activation energy. There is no activation on the user. You open up here. It gives you insights. I think there's something really powerful there because chat GPT was so successful that it has kind of made everyone have this mental model of LLM equals chat. And to your point, you are not a chat product. That is true.

not at all what you are doing. What is your mental model for what makes a good LLM problem? If LLM does not equal chat, what is it providing for you? Like if someone else was trying to go through the learning process that you have of how am I going to apply this in a useful way to create real value? What's the like...

picture that you have of what the capabilities this LLM provides? No, that's a great question. You have to first understand where the data is coming from, what the training data looks like. We know that these models are trained on software. They've been very successful because that data has been very easy to obtain. Things like shell scripts, you know intuitively that, hey, we have

Thousands of repositories, these LLMs are trained on what good shell scripts look like, right? So those are the strengths, like you have to play with the strengths. Like if you suddenly come up with like a use case where you know that there's been very scarce training data or even the reasoning models cannot solve every problem. Like they were good at things that they've seen in the past or they've been reinforced learning has been, data has been there for even for RL, right? One of the things we have seen that these large language models don't really make someone who's already 10x become 100x effective, right?

they really make an average, let's say, 1x person become 10x because it's bringing in a lot of the training data, which is trained on best practices, good use cases, to a more average developer. And also, like, that's what makes it, like, effective in automatic repetitive work, the toil.

Some of these code review comments are actually toil. Most of the time it's the same thing. Best practices around security, best practices around some null pointer checks. It's again and again the same thing. Or it's sometimes unit test case generation, doc strings, those kind of things, it's very effective. And those are the use cases we typically go after where there's a lot of toil, repetitive work, and we know that people just don't want to do these things.

Those are things you go and automate. If you ask me, can an LLM create a brand new or make someone who's already a really good programmer become 100x? I mean, that I don't know yet, but we have seen a lot of people become 10x thanks to large language models. One of the things you said a little earlier was around essentially not wanting to build features where the technology isn't there.

yet what would you say is kind of the edge right now of the types of things you would do with an llm where you think it might get to in the next few months versus ah that's not going to happen anytime soon yeah we constantly track the envelope like that's the whole idea with the evals like one of the other secret sources these good ai app companies have is like evals

Like, we are able to track not just the efficacy of the current system or the new models that come out, but also track the limits of these models. And we have some test cases we know that even advanced models like O3 are not yet able to solve for us. So it's very critical that we track the progress. And we have seen our own benchmarks and our own evals getting beaten progressively from 4.0 to 0.1.

to O3 and so on, right? And that gives us a good idea. And the second is a price, right, on how effectively can we offer, like, because even these providers don't have enough quota, like we have to fight with the provider sometimes to get rate limits. So even if, let's say, we have a use case in mind and people are willing to pay for it, we just don't have capacity for it to be delivered at scale. So there are multiple factors which kind of hold us back on some of these frontier use cases that we have in mind.

it's a complicated thing, like I would say, on where to big bets. I mean, overall, in this space, there's massive appetite in the market to bring AI to, as I said, automate the toil and the mundane work, right?

But at the same time, there are the practical limitations on how much capacity you can get and the capabilities of the AI itself. Yeah, I think it's the first time I've seen in quite a long time where it feels like the whole industry is capacity limited. We just can't ship enough GPUs. That's right. And that's right. And it gets expensive as well. I mean, we do see like there's going to be orders of magnitude reduction.

But then again, some of these other use cases will start opening up, right? And it's challenging, right? I mean, that aspect. I mean, overall, the models are in a way designed, I mean, especially with RL, like, yes, you can make them competent on a lot of use cases provided you have the right kind of data. It's like recording the usage, not just what's available on the internet, but observing how people do things.

how I think if I, that's how the RL thing works, right? And sometimes it's just synthetic. But the thing is that you have to have ability to record that data somehow.

And that's how like other use cases will open up. But for now, it seems like coding is something you can easily obtain that data either through code editors, through open source, through by hiring humans. Like I know these companies are also hiring a lot of contractors to go and solve programming puzzles, right? So that data is relatively easy to obtain. And that's why we have seen a lot more success in coding use cases initially with AI. But that doesn't mean other use cases are out of reach forever. It's just a matter of time. People will figure out how to obtain quality data to make those use cases reliable. You,

You mentioned evals, and that's another place it might be worth us digging for a little bit, because this is something that I feel like there's a lot of chatter, but I haven't seen big standards coming out yet in terms of how to eval. It feels very company-specific oftentimes. So how are you thinking about and managing evals? It is indeed company-specific, and we have burned in the past by looking at public evals and trusting them. And that's what happened back in June, July, where

We were burnt by, like, even GPD4 came out. It was actually worse than Turbo for at least our use case. We didn't have good evals back then. And we saw a lot more. Like, the main eval is like, hey, are we seeing the same number of conversions? Are people still buying the product at the same rate? Like, from sign-up to paid, are we seeing a big churn rate?

So those kind of things are the real data points, like the business outcomes. As long as you release these models and your outcomes improve or mean the same, this means something is working. So a lot of it is like wipe checks as well, right? At the same time, like you want to be like, still do as much as you can at your end, because if you're rolling out these new models, you don't want them to backfire. Like we have like 100,000 developers, like

And the last thing we want is like disrupting their daily flows, right? So we try to be careful, like we try to curate some of these examples we see in the wild where we think we'll make a good eval. And it's about like, we're taking more like a cattle versus pets approach. We don't like have millions of examples like other companies. We try to curate a golden data set of as few examples as possible, which allow us to track

where the AI is today, and where we can also be able to compare these models more effectively, very quickly. What granularity do you apply that at? Because we talked about you have this complex and valuable task graph and pipeline of things going on. Is the eval at the level of the whole pipeline on a particular code change, or are there more granular things that you are testing?

It's both. We are taking the end-to-end approach as well, where we are running the end-to-end flow, but a lot of the times we are also running like as a unit test case kind of a thing, assuming a lot of the context we are able to provide is perfect from the other stages of the pipeline. How is a certain stage going to perform? Because, you know, like it's a complex pipeline and especially agentic and your errors compound the deeper you go. That's the hard part. Like, I mean, if you have 5% error rate, it becomes 20% end of the day downstream, right?

So the idea is like, how do we decompose this pipeline and test each stage independently as much as possible by keeping a lot of the other factors the same. So yeah, so it's kind of a balance. Yes, there are end-to-end tests as well. And at the same time, it's very granular. I wouldn't say we have 100% coverage because some of the prompts are simple. We don't feel like writing a lot of evals for them, but some of the more complex prompts

where a lot of the classification happens, a lot of the reasoning happens, like those kinds of prompts, we have extensive tests for now. Are you using any particular framework for that or it's homegrown or? It's mostly homegrown. I mean, we do have some visibility in tools like Langsmith, especially from the open source. Like we don't trace our paid customers to private repositories, but that's where we have a lot of the open source data coming in that provides us live visibility into how the system is performing. That makes sense.

slightly different direction. You said this is your third company. And I think I saw CodeRabbit's completely bootstrapped. You didn't go the venture capital route or anything like that. I know that's something a lot of developers dream about doing, taking a project and bringing it to be something sustainable. What did that take? How does that look? And were you able to get to something that could sustain you very quickly? Or kind of what was that timeline like? That's an interesting question. Like, I mean,

Yes, I mean, we had success in the past. My first startup was a good exit. Second, not so much. I mean, that was in the reliability management space, but CodeRabbit was kind of an internal tool that started out there, but then it like flourished independently. In this startup, like one of the unique things has been just the compressed timeframes, things are moving. So it's not like we didn't take venture capital as money. We are funded by CRV, which is one of the big investors in the product-led growth companies.

So overall, we raised like around 26 million. So it's not like it's completely bootstrapped at this point. Like there is significant VC money which has been raised in this company. But yeah, I mean, it did get to Series A without the seed funding round.

So we were already a million dollars annual recurring revenue last year when we did that round. That was completely on bootstrap budget. But we could do that given that, yes, there was some prior success. So we could like invest. We were at a stage in life where we could take that kind of a risk. That makes sense. How did you...

Get your initial sets of customers. I think this like zero to one phase is one of the most challenging and particularly for developers finding the market. And you're targeting developers, which a lot of us when we think about, oh, I could do something that we start with an itch that we want to scratch for ourselves. So how did you kind of get to that 1 million?

out of the gate, no background budget except what you could fund yourself? A lot of that is thanks to my co-founder, Gur, who did things I would not have otherwise done, first of all. The first two startups were all enterprise sales, very content marketing driven, very different go-to market. I'm not saying that was ineffective, but that's what those products needed. On the other hand, the developer market is a very consumer-style market. It's a massive market compared to selling cloud infrastructure, for example. And

The strategies that work here are very different. Like even things like ads work very effectively in this space. So it was a combination of multiple things like influencers, organic tweets, like our users talk about the products. A lot of it is not even us pushing it. Like it's the flywheel effect of the users that talk about it.

So a lot of our customers who come in inbound are primarily coming because of word of mouth. They're not being acquired by marketing by any ways. Our cost of acquisition of customers is very, very low in the industry because it's just a flywheel effect. Like the key things we did is we made the product accessible to as many people as we could. We made the product free for open source users so they could try it out. We made the product free for all individual users on VS Code.

So the idea is like, we know that this AI thing is so new, it needs a massive habit change, right? You want to like, the main battle is not building a product or raising money. The main thing is like, are people going to form this new habit or not? That was our biggest worry two years back, we saw it coming. Everyone was trying to bring AI products to the market, we knew 90% of them would fail, because people are not going to change their habits.

So we saw that early on and in order to quickly iterate on the product and make sure that we build a habit forming product, we had to make it accessible. There was no other way. And we kind of innovated a lot on that. And that's what led to a lot of user love because we could iterate and hammer it to the point where it has a very good product market fit and gets universal love. Yeah, great lessons there. I guess we're getting closer to the end. Is there anything on the horizon? What's the next step?

big release coming from CodeRabbit? No, we're doing very interesting stuff now, actually. So code review has been a very interesting starting point, getting us through the door in pretty much most companies now. One of the things now we are seeing is wipe coding take off now. Like we are seeing even more acceleration in our growth. Like we have been going crazy, but last three weeks has been like, I would say crazier. We have never seen that kind of growth.

because all these OpenAI codecs came out, background agents Cursor is doing, and Cloud Code is there. There's so many vibe coding tools out there. And what we're seeing is this huge opportunity in being a tool that can make the vibe coded systems production ready. So there is still some last 20% polishing, or we call them finishing touches. Those are the areas we are focusing on that in the PR can be eliminated all the deficiencies. For example, if you're missing documentation,

And you as a company care about it. Can we add doc strings? Can we add missing unit test case coverage? Because those kinds of things you're going to discover when you actually open a PR. You're not going to discover that in your cursor or code editor. You're going to discover that in the CICD. And that last 20% polishing is what we are focusing on as a company. That's super cool. Especially because I feel like

One of the things I've seen with people exploring vibe coding is the better your code practices are, the better the AI is able to generate things in it. If you keep things modular and well-named and all these things that get caught in a code review,

then you're going to be able to sustain this longer as well. That's right. I mean, there's so many things you're talking about. Maintainability, you're talking about, can we fix some of the CICD failures? Like there's just so much downstream of a PR as well that needs to happen. And we are pretty excited. Like, I mean, the massive appetite and a lot of these form factors haven't been thought of in the past. And we were so excited to bring all these new ideas to the market. That's awesome. Well, anything else that you would like to leave our audience with before we wrap?

I mean, the only thing I would say is definitely try out Code Rabbit if you haven't tried it already. I know that a lot of people have heard about it, but the thing is like, it's not, it's a tool that will surprise you once you actually try it because it's that good. So I recommend everyone at least try it once. Awesome. I think that's a great wrap up. Thanks, Kevin.

CodeRabbit and RAG for Code Review with Harjot Gill 48:42 Share

Cloud Engineering Archives - Software Engineering Daily

Deep Dive

Shownotes Transcript

CodeRabbit and RAG for Code Review with Harjot Gill