We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Insights from Cleric: Building an Autonomous AI SRE // Willem Pienaar // #290

2025/2/11

MLOps.community

AI Deep Dive AI Chapters Transcript

People

Willem Pienaar

Topics

Willem Pienaar: 作为Cleric的CTO,我一直在构建AI SRE。我们面临的挑战在于,工程师既要创造软件,又要将其部署到生产环境中并运行,这对现实世界产生影响。生产环境与开发环境不同,缺少测试、IDE和及时反馈。在企业公司的生产环境中,找到代表所有问题和解决方案的数据集非常困难,这使得问题变得复杂且动态。团队正在努力掌握理解和信任的界限,构建模块化组件,虽然对内部运作不确定,但仍将其部署到生产中以提高速度,尽管会逐渐失去理解。在激励机制不一致、团队众多且面临交付压力的情况下,生产环境变得不稳定。AI生成的代码将使系统更加复杂,组件之间的动态关系将更加难以理解。难以想象在整个企业规模的组织中,Kubernetes集群的复杂性会增加。为了找到根本原因,必须因果关系地遍历图,向上游追溯。LLM擅长提取关系和处理非结构化数据,可用于构建知识图谱。知识图谱构建后会迅速过时,因此需要高效决策。我们的Agent是一个诊断Agent,利用知识图谱快速找到问题的根本原因。即使知识图谱会快速过时,拥有它仍然非常重要。在后台扫描过程中,知识图谱可能会发现未被注意的问题,例如数据暴露或配置错误,并及时提醒工程师。LLM可以作为推理引擎,预测即将发生的故障,从而实现主动警报。虽然直接将LLM应用于指标图或云基础设施对象效率不高,但通过提炼推理能力到更精细的模型中,可以实现显著的改进。在后台扫描构建图时,我们使用更高效的模型,并设置每日预算以控制成本。后台扫描并非持续运行,而是像人一样获取云基础设施的最新信息,以便在出现问题时快速采取行动。在调查中,我们为Agent设置预算上限,并允许人工干预,以便在Agent提供有价值的信息时及时介入。Agent会在预算范围内运行,并在发现有价值的信息时通知人类,否则会保持沉默或提供调查结果。我们的目标是实现端到端的问题解决,Agent能够完成工程师需要判断和使用不同工具的步骤。目前,Agent主要用于缩小搜索范围,帮助工程师更快地定位问题所在的服务或集群。Agent可以有效地减少搜索空间,并与工程师协作,逐步学习和改进问题解决能力。我们以协作模式启动,快速减少搜索范围,告知检查过和未检查过的内容,然后工程师可以提供更多背景信息并进一步指导。Agent在成功时速度很快,失败时速度很慢。我们使用置信度评分和评论员来评估Agent,以避免向人类发送垃圾信息。关键在于节省工程师的时间,避免发送不良信息,因此了解Agent的优势和劣势非常重要。我们通过丰富事件信息、分析历史数据和用户反馈来评估Agent的置信度,以便在向人类提供信息之前对其进行评估。我们采用分层方法构建知识图谱,其中一些层具有更高的置信度和持久性,并使用不同技术进行更新。使用较小的微图可以更轻松地进行数据管理。大部分关键信息通常可以在相同的系统中找到,例如配置或代码变更。监控Slack和部署情况是有效的,查看发布和变更计划,并进行评估。总结Slack讨论串非常有用,可以提取问题、讨论和解决方案,并附上相关的PR。总结Slack讨论串可以作为指导或运行手册,展示团队如何解决问题,并包含实用知识。工程师面临的两个主要挑战是理解系统和流程,以及集成和访问定制系统。作为Agent,我们需要像团队中的新工程师一样被教导,否则无法成功。LLM具有适应性,可以尝试不同的方法来总结不同粒度的信息。可以将大量原始信息直接放入上下文窗口,也可以以更简洁的形式呈现,并提供查询工具以获取更多信息。关键在于一开始就提供价值,以便工程师认可并开始协作,从而形成良性循环。工程师应该觉得Agent有价值,从而开始协作,并提供更多信息以获得更多价值。工程师不希望仅仅审查Agent的工作而没有获得任何好处,因此互动必须提供价值和隐含的反馈。有三种类型的记忆:知识图谱、情景记忆和程序记忆,都需要被捕获。我们索引环境,提取程序,并存储在环境中获得的经验。我们非常重视数据安全,Agent只能读取数据,不能进行更改,所有数据都保留在客户环境中。我们主要存储情景记忆,即事件发生时如何解决问题的实例。我们通过监控系统健康状况来评估变更的有效性,并查看代码来预测人类将进行的更改。如果Agent提出的建议被批准,则表明Agent的建议是有效的。如果Agent提出的建议被拒绝,则表明Agent的建议是错误的。与工程师的互动是隐含的信息来源,这些信息会被附加到记忆中,但最终数据集仍然非常稀疏。我们在外部评估平台上训练Agent,并进行大量手工标注,以提高Agent的准确性。Agent是通用的,但会根据客户的上下文信息进行定制。我们将Cleric的新版本和提示、逻辑、推理以及解决问题的方法注入到Cleric中。这是一个分层挑战,既要实现所有客户的跨领域收益和评估平台驱动的准确性,又要实现客户流程的定制。我们通过置信度评分来避免向工程师发送过多的警报。如果置信度评分低于某个百分比,则不会通知任何人,并继续尝试确定是否确实存在问题。我们意识到这是一个建立信任的练习,不能仅仅回应我们发现的任何东西。许多团队正在尝试将置信度评分构建到他们的产品中,但这非常困难,因为这是一个无监督的问题。置信度评分由数据飞轮和经验驱动,并受到公司内部经验的影响。工程师可以设置阈值,只显示相关性极高的发现或诊断,并设置简洁性和特异性。我们采用异步方式,Agent会主动搜索信息并返回,如果置信度高,则会响应,否则会保持沉默。在同步模式下,Agent几乎总是会响应,置信度评分的重要性降低,因为用户可以不断优化答案。无法在Docker容器中重现生产环境,因此无法确定Agent的正确性。尽管如此,置信度评分仍然是一种强大的技术,可以消除大部分误报,并在我们没有实质内容时保持沉默。如果Agent不确定用户的意图,会要求用户澄清,以避免浪费时间和金钱。Agent会要求用户提供更具体的指示,并随着时间的推移逐渐放宽限制。对于Agent来说,需要确保用户在初始指令中足够具体,以避免浪费资源。我们希望Agent成为工程师喜欢使用的工具,并根据使用情况定价。

Deep Dive

Shownotes Transcript

Translations:

中文

Will and Pinar, CTO of Cleric. We're building an AI SRE. We're based in San Francisco. Black coffee is the way to go. And if you want to join a team of veterans in AI and infrastructure and a really tough problem, yeah, come and chat to us.

Boom. Welcome back to the M. The Lobs Community Podcast. I'm your host, Demetrius. Today we are talking with my good friend, Willem. Some of you may know him as the CTO of Cleric AI, doing some pretty novel stuff with...

the AISRE, which we dive into very deep in this next hour. We talk all about how he's using knowledge graphs to triage root cause issues with their AI agent solution. And others of you may know Willem because he is also the same guy that built the open source feature store Feast.

That's where I got to know him back four or five years ago. And since then, I've been following what he is doing very closely. And it's safe to say this guy never fails to disappoint. Let's get into the conversation right now.

Let's start by prefacing this conversation with we are recording two days before Christmas. So when it comes out, this sweater that I'm wearing is not going to be okay. But today it is totally inbounds for me being able to wear it. Unfortunately, I don't have a cool sweater like you. And I'm in sunny San Francisco. But I guess it's... Got the fog. Yeah, it's Christmas vibe. Dude, I found out three...

four days ago that if you have thing this pill magic pill with caffeine it like minimizes the jitters so i have taken that as an experience lcn or which yeah you've heard of it yeah yeah dude i've just been abusing my caffeine intake and pounding these pills with it it's amazing i am so much more productive so that's my 2025 secret for everyone

Carrying a bit of magnesium for a bit of sleep or actual sleep. All right, man. Enough of that. You've been building Cleric. You've been coming on occasionally to the different...

conferences that we've had and sharing your learnings. But recently you put out a blog post, and I want to go super deep on this blog post on what an AI SRE is, just because it feels like SREs are very close to the MLOps world. And AI agents are very much what we've been talking about a lot as we were presenting at the Agents in Production conference.

The first thing that we should start with is just what a hard problem this is. And why is it hard? We can dive into those areas, and I think we're going to get into that in this conversation. Maybe just to set the stage, everyone is building agents, like agents are all the hype right now. But every use case is different, right? You've got agents in law, you've got agents for writing blog posts, you've got agents for social media. One of the tricky things about our space is really

If you consider two main things that an engineer does is they create software and then they deploy it into a production environment and it runs and operates. It actually has to have an impact on the real world. That second world, the operational environment, is quite different from the development environment. The development environment has tests, it has an IDE, it has time feedback cycles. Often it has ground truth, right? So you can make a change and see if your test is false. There's permissionless data sets that are out there. So you can go to GitHub and you can find like

of issues that people are creating, the PRs that are like the solutions to those issues. Yeah. But consider...

like the production environment of an enterprise company, where do you find the data sets that represent all the problems that they've had and all the solutions? It's not just laying out there, right? You can get some like root causes and things that people have posted as blog posts, but this is an unsupervised problem for the most part. It's a very complicated problem. I guess we can get into those details in a bit, but that's really what makes us challenging. It's complex, sprawling, dynamic systems.

Yeah, the complexity of the systems does not help. And I also think with the rise of the coding co-pilots, does that not also make things more complex? Because you're running stuff in a production environment that maybe you know how it got created, maybe you don't. Massively. And I think even on our scale, a small startup, it's become a topic internally.

how much do we delegate to AI? Because we're also outsourcing and delegating to our own agents internally that produce code. So I think all the teams are trying to get to the boundaries of understanding and confidence. So you're building these modular components like Lego blocks where the internals you're unsure about, but you're shipping them to production and seeing how that succeeds and fails because it gives you so much velocity. So the ROI is there, but the understanding is one of the things you lose over time. And I think at scale, whether

where the incentives aren't aligned, where you have many different teams and they're all being pressured to ship more. Belts are being tightened so there's not a lot of headcount and they have to do more. The production environment is really... People are like putting their fingers in that damn wall but eventually it's going to break. It's unstable at a lot of companies. Yeah, so...

coding is going to make, or AI-generated coding is really going to make this a much more complex system to deal with. So the dynamics between these components that interrelate, where there's much less understanding, is going to explode. Yeah, we're already seeing that.

Dude, there's so many different pieces on the complex systems that I want to dive into. But the first one that stood out to me and has continued to replay in my mind is this knowledge graph that you presented at the conference and then subsequently in your blog post. And you made the point of saying, this is a knowledge graph that we created on a production environment, but it's not like it's a gigantic Kubernetes cluster.

It was a fairly small Kubernetes cluster and all of the different relations from that and all the Slack messages and all the GitHub issues and everything that is involved in that Kubernetes cluster you've mapped out. And that's just for one Kubernetes cluster. So I can't imagine across a whole entire organization, like an enterprise size, how complex this gets. Yeah.

So if you consider that specific cluster or graph I showed you was the Okta Telemetry reference architecture. It's like a demo stack. It's like an e-commerce store. It's got about 12, 13 services. Yeah, roughly in that range. I've only shown you literally like 10% of the relations, maybe even less. And it's only at the infrastructure layer, right? So it's not even talking about like buckets and cloud infrastructure, nothing about nodes, nothing about application internals, right? So if you consider one cloud project, like a GCP project or AWS project,

There's a whole tree. There's the networks, the regions, down to the Kubernetes clusters. Within a cluster, there's the nodes. There's the containers. Within the containers, there are also the pods. There's multiple containers potentially. Within each of those many processes, each process has code

with variables and it each lets it creates this tree structure but then between those nodes in the tree you can also have interrelations right like a piece of code here could be referencing an ip address but that ip address is provisioned by some cloud service somewhere and it's also connected to some other systems and you can't not use that information right because if a problem arrives and you're you know at the lens of your lap then you have to causally walk that graph to go upstream to find the root cause

In the security space, this is a pretty well-studied problem. And there are traditional techniques that people have been using to extract this from cloud environments. But LLMs really unlock a new level of understanding there. So they're extremely good at extracting these relationships, taking really unstructured data. So it can be conversations that you and I have. It can be Kubernetes objects. It can be all of these, like the whole spectrum from unstructured to structured. You can extract structured information. So you can build these graphs.

The challenge really is twofold. So you know you need to use this graph to get to a root cause, but it's fuzzy, right? As soon as you extract that information, you build that graph, it's out of date almost instantly because systems change so quickly, right? So somebody's deploying something, an IP address gets rolled, pod names change. And so you need it to be able to make efficient decisions with your agent, right? So just to anchor this,

Our agent is essentially a diagnostic agent right now. So it helps teams quickly root cause a problem. So if you've got an alert that fires or if an engineer presents an issue to the agent, it quickly advocates this graph and its awareness of your production environment to find the root cause. If it didn't have the graph, it could still do it through first principles, right? It could still say...

Looking at everything that's available, I'll try this, I'll try that. But the graph allows it to very efficiently get to the root cause. And so that fuzziness is one of the challenges, the fact that it's out of date so quickly, but it's so important to still have it regardless. There's a few things that you mentioned about...

how with the vision or the understanding of the graph, you can escalate up issues that may have been looked at in isolation as not that big of a deal. And so can you explain how that works a little bit?

So the graph is essentially, there's two, if you draw a box around the production environment, right, there are two kinds of issues, right? There's what you have alerts for and your awareness of. So you tell us like, okay, my alert fired, here's a problem, go and look at it. Another is we scan the environment and we identify problems. The graph is built in two ways. One is a background job.

where it's just like looking through your infrastructure and finding new things and updating itself continuously. And the other is when the agent's doing an investigation and it sees new information and it just throws that back into the graph because it's got the information that Miles has just used to update the graph. But in this background scanning process, it might uncover things that it didn't realize was a problem, but then it sees it because this is actually a problem. For example, it could...

process your metrics or it could look at your configuration of your objects in Kubernetes or maybe it finds a bucket and it's trying to create that node, the updated state of the bucket, and it sees it exposed publicly. So then it could surface this to an engineer and say,

your data is being exposed publicly or you've misconfigured this pod and the memory is growing this application and in about an hour or two this is going to crash yeah so there's a massive opportunity for LNs to be used as reasoning engines where it can infer and predict a failure imminently and you can prevent that so you get to a proactive state of alerting

That is, of course, quite inefficient today if you use an LLM to just slap it on a vision model onto a metrics graph or onto your objects in your cloud infrastructure. But there's a massive low-hanging fruit there where you distill a lot of those inferencing capabilities to more fine-tuned or more purposeful models for each one of these tasks. But how does the scanning work? Because I know that you also mentioned

The agents will go until they run out of credit or something or until they hit their like spend limit when they're trying to root cause analysis, some kind of a problem. But I can imagine that you're not just continuously scanning or are you kicking off scans every X amount of seconds or minutes or days?

Yeah, so there are different parts to this. If we do background scanning graph building, we try and use more efficient models. So because of the volume of data, you don't use expensive models that are used for like

you know, very accurate reasoning. Yeah. And so the costs are lower. And so you set it like a daily budget on that and then you run up to the budget. This is not something that's constantly running and processing large amounts of information. Think about it as like a human, right? You wouldn't process all logs and all information in your cloud infrastructure. You just get like a lay of the land. Like what are the most recent deployments? What are the most recent conversations people are having in Slack? Just get like a play-by-play.

so that when an issue comes up, you can quickly jump into action. You have fast thinking. You can make the right decisions quickly. But in an investigation, we set a cap. We say, per investigation, let's say, make it 10 cents or make it a dollar or whatever. And then we tell the agent, this is how much you've been assigned. Use it as best you can. Go find information that you can use through your tools. And then allow the human to say, okay,

go a bit further or stop here, I'll take over. Wow. And so we bring the human in the loop as soon as the agent has something valuable to present to them. So if the agent goes off on a quest and it finds almost nothing, it can present that to the human and say nothing. Or say, okay, couldn't find anything. Or just remain quiet. It depends on how you've configured it. But it'll always stop at that budget limit. Yeah, the...

benefit of it not finding anything also is that it will narrow down where the human has to go and search. So now the human doesn't have to go and look through all this crap that the AI agent just looked through because ideally

If the agent didn't catch anything, it's hopefully not there. And so the human can go and look in other places first. And if they exhaust all their options, they can go back and try and see where the agent was looking and see if that's where the problem is. I think this comes back to the fundamental problem here. And maybe we glossed over some of those tools to solve the problem of operations and on-call operations.

No amount of data dogs or dashboards or kube-cub and commands will free your senior engineers up from getting into the production environment. So what we're trying to get to is end-to-end resolution. When we find a problem, can the agent go all the way, multiple steps, which today requires engineers' reasoning and judgment, looking at different tools, understanding tribal knowledge, understanding why systems have been deployed.

We want to get the agents there, but you can't start there because this is an unsupervised problem. You can't just start changing things in production. Nobody would do that. Right now, if you scale that back from resolution, meaning change, like code level change, Terraform, it thinks it's in your repos. If you walk it back from that, it's understanding what the problem is. And if you walk it back further from that, it's search space reduction, triangulating the problem into a specific area. Maybe not saying the line of code, but saying here's the service or here's the cluster.

And that's already very compelling to a human. Or you can say, it's not these 400 other cloud clusters or providers or services. It's probably in this one. And that is extremely useful to an engineer today. So search space reduction is one of the things that we are very reliable at and where we've started. And...

We start in a kind of collaborative mode. So we quickly reduce the search base. We tell you what we checked and what we didn't. And then as an engineer, we can say, okay, here's some more context. Go a bit further and try this piece of information. And in that steering and then collaboration, we learn from engineers and they teach us and we get better and better over time on this like road to resolution. Yeah, I know you mentioned memory and I want to get into that in a sec, but keeping on the theme of money and cost,

and the agents having more or less a budget that they can go expend and try and find what they're looking for. Do you see the agents will get stuck in recursive loops and then use their whole budget and not really get much of anything? Or is that something that was fairly common six or 10 months ago, but now you've found ways to

counterbalance that problem. This problem space is one where small little additions to your or improvements to your product make a big difference over time because they compound. We've learned a lot from decoding agents like SWE Agent and others. So one of the things they found was that when the agent succeeds, it succeeds very quickly when it fails very slowly. So typically you can even see as a proxy as the agent run for three, four, five, six, seven minutes.

it's probably wrong even if you don't score it at all and if it ran into like it came to a conclusion quickly like at 30 seconds it's probably going to be right our agents sometimes do chase their tails so we have a confidence score and we have a critiquer at the end that assesses the agent so we try and not spam the human

Ultimately, it's about attention and saving them time. So if you keep throwing bad findings and bad information, they'll just rip you out of their production environment because it's going to be noisy, right? That's the last thing they want. So yes, depending on the use case, the agent can go in a recursive loop or it can go in a direction that it should. So for us, a really effective mechanism to manage that is understanding where we're good and where we're bad.

So for each issue or event that comes in, we do an enrichment and then we build the full context of that issue. And then we look at, have we seen this in the past? Similar issues, how have we solved this in the past and have we had positive feedback? And so if we check the right historical context, we get a good idea of our confidence on something before presenting that information to a human, like the ultimate set of findings. But yeah, sometimes it does go awry. I'm trying to think, is the knowledge graph something that you are...

creating once getting an idea the lay of the land and then there's almost like stuff doesn't really get updated until there's an incident and you go and you explore more and what kind of knowledge graphs are you using are you using many different knowledge graphs is it just one big one how does that even look in practice

We originally started with one big knowledge graph. The thing with these knowledge graphs is that they're often... The fastest way to build them is deterministic methods, so you can run kubectl and you can just walk the cluster with traditional techniques. There's no AI or ALM involved. But then you want to layer on top of that the fuzzy relationships where you see this container has this reference to something over there, or this config map mentions something that...

I've seen somewhere else. And so what we've gone towards is a more layered approach. So we have like multiple graph layers where some of them have a higher confidence and durability and can be updated quickly or perhaps using different techniques. And then you layer on the more fuzzy layers on top of that or different layers. So you could use an owl in to kind of canvas the landscape between clusters or from a Kubernetes cluster to maybe the application layer or to the layers below.

But using smaller micrographs has been easier for us from like a data management perspective. What are other data points that you're then mapping out for the knowledge graph that can be helpful later on when the AI SRE is trying to triage different problems? In most teams, there's an 80-20 gap.

like burrito distribution of value. So some of the key facts are often found in the same system. I think it was Meta or, yeah, that's had some internal survey where they found out that 50 or 60% of their production issues were just due to config or code changes. Anything that disrupted their prod environment.

So if you're just looking at what people are deploying, like you're following the humans, you're going to probably find a lot of the problems. So monitoring Slack, monitoring deployments is one of the most effective things to do. Looking at like releases or changes that people are scheduling and understanding those events. So having an assessment of that. And then in the resolution path, there's also the way to build the resolution. Looking at runbooks, looking at how people have solved problems in the past, like

Often what happens is like a Slack thread is created, right? So the Slack thread is like a contextual container for how do you go from a problem, which somebody creates a thread for, to a solution. And summarizing these Slack threads is extremely useful. So you can basically say like,

this engineer came into this problem, this was the discussion, and this was the final conclusion. And there's often like a PR attached to that. So you can condense that down to almost like a guidance or like a runbook. And attaching that into like novel scenarios is useful because it shows you how this team does things. And they often contain probable knowledge, right? So this is how we solve problems at our company. We connect to our VPNs like this. We

access to a system. These are the key systems, right? The most important systems in your production environment will be referenced by engineers constantly, often through shorthand notations. And if you speak to engineers at most companies, those will be the two bigger problems, right? One is you don't understand our systems and our processes and our context.

And the second one is that you don't know how to integrate or access these because they're custom and bespoke and homegrown. And so those are the two challenges that we face as like agencies. Basically, we're like a new engineer on the team and you need to be taught by this engineering team. If you're not taught, then you're never going to succeed. I hope that answers your question. Yeah. And how do you overcome that challenge?

You just are creating some kind of a glossary with the shorthand things that are fairly common within the organization or what?

Yeah, so there's multiple layers to this. And I think this is quite an evolving space. Thankfully, all ends are pretty adaptive and forgiving in this regard. So we can experiment with different ways to summarize different levels of granularity. So we've looked at, okay, can you just take like a massive amount of information and just shove that into the context window, give it in a relatively raw form. And that works, but it's quite expensive. And then you show it like more condensed form and you say,

This is just the tip of the iceberg. For any one of these topics, you can query using this tool and get more information. And it's not always easy to know which one is the best because it's dependent on the issue at hand, right? Because sometimes a key factor, a needle in a haystack is buried one level deeper and the agent can't see it because it has to call a tool to get to it. So we typically err on the side of spending more money and...

just having the agents see it and then optimizing cost and latency over time. For us, it's really about

being valuable out of the gate. Engineers should find us valuable and in that value, the collaboration starts. And then it creates a virtuous cycle where they feed us more information, they give us more information, they get more value because we take more grunt work off their plate. And it's like training a new person on your team. If you see that, oh, this person is taking on more and more tasks, yeah, I'll just give them more information, I'll give them more scope. Yeah, I want to go into a little bit of the

ideas that you're talking about there like how you can interact with the agent and but I feel like the gravitational pull towards asking you about memory and how you're doing that is too strong so we got to go down that route first and specifically I

are you just caching these answers? Are you caching like successful runs? How do you go about knowing that a something was successful and then where do you store it? How do you like give that access or agents get access to that? And they know that, Oh, we've seen this before. Yeah, cool. Boom. It feels like that is quite complex in,

In theory, you would be like, yeah, of course, we're just going to store these successful runs. But then when you break it down and you say, all right, what does success mean?

And where are we going to store it? And who's going to have access to that? And how are we going to label that as successful? Like I was thinking, how do you even go about labeling this kind of shit? Because is it you sitting there clicking and human annotating stuff? Or is it you're throwing it to another LLM to say, yay, success. What does it look like? Break that whole thing down for me because memory feels quite complex in that when you really look at it.

It is. A big part of this is also the UX challenge because people don't want to just sit there and label. I think people are just like, especially engineers are really tired of slop code and they're just being thrown this like slop and then they have to review. They want to create and I think that's what we're trying to do is free them up from support. But in doing so, you don't want to get them to like constantly review your work with no benefit.

So that's the key thing. There has to be interaction where there's implicit feedback and they get value out of that. And so I'm getting to your point about memory. So effectively, there's three types of memory. There's the knowledge graph, which captures the system state and the relations between things. Then there's episodic and procedural memory. So the procedural memory is like how to ride a bicycle. You've got your brakes here, you've got your pedals here. It's like the guide. It's almost like the run book.

But the runbook doesn't describe for this specific issue that we had on this date.

What did we do? The instance of that is the episode or the episodic memory. And both of those need to be captured, right? So when we start, we're indexing your environment, getting all these like relations and things. And then we also look at, okay, are there things that we can extract from this world where we've got procedures? And then finally, as we experience things or as we understand the experiences of others within this environment, we can store those as well.

We have really spent a lot of time, and most companies care about this a lot, securing data. So we are deployed in your production environment, and we only have read-only access. So our agent cannot make changes. We can only make suggestions. So all your data stays. You want to change that, right? That later we'll talk about how you want to eventually get to a different state, but continue. Yeah, yeah. We want to get you close to a resolution, but that's a longer path.

So we're storing all of these memories mostly as, I think the valuable ones are the episodes, right? Those are the instances, like if this happened or this happened and I solved it in this way. We had a Black Friday sale, the cluster fell over, we scaled it up, and then later we saw it was working, but oh, it's done. And we did that two or three times and we think that's a good pattern, like scaling is effective.

But that's all captured in the environment of the customer. Our primary means of feedback is monitoring system health post change. Oh, nice. We can look at the system and see that this change has been effective and we can look at the code of the environment, whether it's the application code or the infrastructural code. Basically, as like a masking problem, do we see...

Can we predict the change the human will make in order to solve this problem? And if they do, then make that change. Especially if it's a recommendation, then we see that they've actually been greenlit what we've done, right? They've actually approved our suggestion. Yeah. That is not super a rich data source because the change that they may make is slightly different or we may not have access to those systems. A more effective way is...

So if we present findings and say, here's five findings and here's our diagnosis, and you say, this is dumb, try something else, then we know that was bad. So we get a lot of negative examples, right? So this is bad. And so it's a little bit lopsided, but then when you eventually say, oh, okay, I'm going to prove this and I'm going to blast this out to the engineering team, or I'm going to update my page of duty notes, or I'm going to, I want you to generate a pull request from this information.

Then suddenly we've got like positive feedback on that. In the user experience, it's really an implicit source of information, that interaction with the engineer. And that gets attached to these memories. But ultimately at the end of the day, it's still a very sparse dataset. So these memories, you may not have true labels. And so for us, a massive investment has been our evaluation bench, which is external from customers.

where we train our agents and we do a lot of really handcrafted labeling, whereas even a smaller data set gets the agent to a much, much higher degree of accuracy. So you want a bit of both, right? You want the real production use cases with engineering feedback, which does present good information. But the eval bench is ultimately what is the firm foundation that gives you that coverage at the moment.

But it feels like the evals have to be specific to customers, don't they? And it also feels like each deployment of each agent has to be a bit bespoke and custom per agent. Or am I mistaken in that one? The patterns are very... So the agents are pretty generalized. The agents get contextual information per customer, so it gets injected...

localized customer-specific procedures and memories and all those things. But those are layered on the base which is developed inside of our product, right? Like in the mothership or actually it's called the temple of Cleric. Nice. So we distribute like new versions of Cleric and our prompts, our logic, our reasoning, generalized memories or approaches to solving problems are imbued

in a divine way into the cleric and the center. It's a layering challenge, right? Because you do want to have cross-cutting benefits to all customers and accuracy driven by the eval bench, but also customization on their processes and customer-specific approaches. All right, so there's a few other things that are fascinating to me when it comes to the UI and the UX of how you're doing things. Specifically,

how you are very keen on not giving engineers more alerts unless it absolutely needs to happen. And I think that's something that I've been hearing since 2018, right?

And it was all on alert fatigue and how when you have complex systems and you set up all of this monitoring and observability, you inevitably are just getting pinged continuously because something is out of whack. And so the ways that you made sure to do this, and I thought this was fascinating, is A, have a confidence score. So be able to say, look,

We think that this is like this and we're giving it 75% confidence that this is going to happen or this could be bad or whatever it may be. And then B, if it is under a certain percent confidence score, you just don't even tell anyone and you try and figure out, is it actually a problem? And I'm guessing you continue working or you just forget about it. Explain that whole thing.

user experience and how you came about that. Yeah, we realized because this is a trust-building exercise, we can't just respond with whatever we find. And the agents can, sometimes they're just not, especially during the onboarding, excuse me, during the onboarding phase, they don't have the necessary access and they don't have the context, right? And so at least at the start when you're training the agent, you don't want it to just spam you with this raw ideas. And so the confidence score was one that

I think a lot of teams are actually trying to build into their products as agent builders. It's extremely hard in this case because it's such an unsupervised problem.

I'm trying to not get into the raw details because there's a lot of like effort we've put into that, like building this confidence score is a big part of our IP is like, how do we measure our own success? Come up with a divine name for the IP or something. It's not your IP. It's your, what was it when Moses was up on the hill and he got the revelation? It was, this is not your IP. This is your revelations that you've had. Yeah.

But the high level is basically that it's really driven by this data flywheel. It's really driven by experience. And that's also how an engineer does things. But those can be, again, like two layers, like from the base layers of the product, but also experiences in this company. So we do use an ILM for self-assessment, but it's also driven and grounded by existing experiences.

So we inject a lot of those experiences and whether those are positive or negative outcomes. And as an engineer, you can set the threshold. So you can say only extremely high relevance findings or diagnosis should be shown. And you can set the conciseness and specificity. So you can say, I just want to one sentence or just give me a word or give me all the raw information.

So what we do today is we're very asynchronous. So an alert virus will go from a quest, it will find whatever information we can and come back. If we're confident, we'll respond. If not, we'll just be quiet. But then you can engage with us in an asynchronous way. So it starts async and then you can kick the ball back and forth in an asynchronous way. And in asynchronous mode, sorry, in the synchronous mode,

It's very interactive and lower latency. We will almost always respond. If you ask us a question, we'll respond. So then the confidence score is less important because then it's like the user is refining that answer, saying, go back, try this, go back, try this. But for us, the key thing is we have to come back with good initial findings. And that's why the confidence score is so important. But again, it's really driven by experiences. Just to reiterate, why this is such a complex problem to solve

You can't just take a production environment and say, okay, I'm going to spin this up in a Docker container and reproduce it at a specific point in time. At many companies, you can't even do a low-test across services. It's so complex. So all different teams are all interrelated. You can do this as a small startup with one application running on Heroku or Vercel. But doing this at scale is virtually impossible at most companies. So...

You don't have that ground truth. You can't say with 100% certainty whether you're right or wrong. And that's just the state we're in right now. Despite that, the confidence score has been a very powerful technique to at least eliminate most true positives. Or when we know that we don't have anything of substance, just being quiet. But how do you know if you got enough information already?

when you were doing the scan or you were doing the search to go back to the human and give that information? And also, how do you know that you are fully understanding what the human is asking for when you're doing that back and forth? Honestly, this is one of the key parts that's very challenging. It's a human will say the checkout service is done.

And you need to know that they are probably maybe based on who the engineer is talking about production. Or if they've been talking about developing a new feature, they're probably talking about the dev environment. And if you go down the wrong path, then you can spend some money and like a lot of time investigating something that's useless.

So what we do is even at the initial message that comes in, we will ask for a clarifying question if we are not sure about what you're asking, if you've not been specific enough. And most agent boulders, even if cognition is debon, they do this. Initially they'll say, okay, do you mean X, Y, and Z? Okay, this is my plan. Okay, I'm going to go do it now. So there is a sense of confidence built into these products from a UX layer. And that's where we are right now. It's

With ChatGPT, you can sometimes say, or with Claude, something very inaccurate or vague, and it can probably guess the right answer because the cost is not multi-step, right? It's very cheap. You can just quickly fix your text.

But for us, we have to short circuit that and make sure that you're specific enough in your initial instructions. And then over time, loosen that a bit. As we understand a bit more what your teams are doing, what are you up to, you can be more vague. But for now, it requires a bit more specificity and guidance. Speaking of the multi-turns and spending money for things or trying to not waste money and going down the wrong tree branch or rabbit hole,

How do you think about pricing for agents? Is it all consumption based? Are you looking at what the price of an SRE would be? And you're saying, oh, we'll price a percentage of that because we're saving you time. Like what in your mind is the right way to base off of pricing pricing?

We're trying to build a product that engineers love to use. And so we want it to be a toothbrush. We want it to be something that you reach for instead of your observability platform, instead of going into the console. So for us, usage is very important. So we don't want to have procurement stand in the way necessarily. But the reality is there are costs and this is a business and we want to add value. And money is how you show us that we're valuable. So the original idea with agents was that there would be this

augmentation of engineering teams and that you could charge some order of magnitude less but at a fraction of engineering headcount or employee headcount by augmenting teams. I think the jury's still out on that. I think most agent builders today are pricing to get into production environments or into these systems that they need to use to solve problems.

to get close to their persona and if you look at what devon did i think they also started at 10k per year or some pricing and i think it's now like 500 a month but it's mostly a consumption-based model so you get some committed amount of compute hours that is effectively giving you time to use the product

For us, we're also orienting around that model. So because we're not GA, our pricing is a little bit like on flux and we're working with our initial customers to figure out like what do they think is reasonable? What do they think is fair? But I think we're going to land on something that's mostly similar to the Devon model where it's usage-based. We don't want engineers to think about, okay, if there's an investigation, it's going to cost me X. They should just be able to just run it and just see is it valuable or not and increase usage.

but it will be something about like a tiered amount of compute that you can detail use. So maybe you get 5,000 investigations a month or something in that order. Okay, nice. Yeah, because that's what instantly came to my mind was you want folks to just reach for this and use it as much as possible. But if you are on a usage-based pricing, then inevitably...

you're going to hit that friction where it's, yeah, I want to use it, but it's going to cost me. Yeah. Yeah. So you do want to have a committed amount set aside at the front. And we're also exploring like having a free tier or like a free band. Maybe the first X is just, you can just kick the tires and try it out. And as you get to a higher limits, then you can say, okay, let's turn on the taps.

So we haven't even talked about tool usage, but that's another piece that feels like it is so complex because you're using tools, you're using a different, you're using an array of tools. And how do you tap in to each of these tools, right? Because if you're looking at logs or are you

syncing directly with the data dogs of the world. How do you see tool usage for this and what have been some specifically hard challenges to overcome in that arena? Again, this kind of goes back to why this is so challenging and especially one of the key things that we've seen is agents solve problems very differently from humans.

But they need a lot of the things humans need. They need the same tools. If you're storing all of your data in Datadog, we may not be able to find all the information we need to solve a problem by just looking at your actual application running on your cloud in front. So we need to go to Datadog. So we need access there. And so engineering teams give us that access.

If you then constructed a bunch of dashboards and metrics, and that's how you've laid out your runbooks and your processes to debug issues, we need to do things like look at multiple charts or graphs and infer across those in the time ranges that an issue happened, what are the anomalies that happen across multiple services, so if two of them are

spiking and CP that are interrelated. So we should look at the relations between them. But these are extremely hard problems for LLMs to solve, even vision models. They're not purposeful for that. So when it comes to tool usage, LLMs are, or foundation models, are good at certain types of information, especially semantic ones, so code, config, logs.

They're slightly less good at traces, but also pretty decent. But they really suck at metrics. They really suck at time series. So it's really dependent on your observability stack, how useful it's going to be. Because for a human, we just sit back and look at a bunch of dashboards and we can see like pattern matching instantly. You can see that these are spikes. But for an LLM, they see something different.

So what we'll find is over time, these observability tools at least will probably become less and less human-centric and may even become redundant. You may see completely different means of diagnosing problems. And I think the honeycomb approach, the trace-based approach with these high cardinality events is probably the thing that I put my money on as the dominant pattern that I could see winning.

Can you explain that real fast? I don't know what that is. So basically what they do is, or what charity majors and some of these others have been promoting for years is logging out traces, but with rich events attached to these. So you basically can follow like a request through your whole application stack and, um,

you can log out like a complete object payload at multiple steps along the way and store that in a system where you can query all the information. So you've got the point in time, you've got the whole like tree of the trace as well. And then at each point, you can see the individual attributes and fields. And so you get a lot more detail in that versus if you're looking at a time series, you're basically seeing, okay, CPU is going up, CPU goes down. And what can you clean from that? You basically have to like

like witchcraft, trying to find the root cause, right? But the data dogs of the money have been making a lot of... Sorry, the data dogs of the world have been making a lot of money selling consumption and selling the witchcraft to engineers for years. And so there's a real incentive to keep this status quo going. But I think as agents become more dominant, we'll see them gravitate to the most valuable sources of information and then...

If you give your agent more and more scope, you'll see Datadog is rarely involved in these root causings. So why are we still paying for them? So I'm not sure what it's going to look like in the next two or three years, but it's going to be interesting how things play out as agents become the go-to for diagnosing and solving problems. Yeah, I hadn't even thought about that, how for human usage, it's like maybe Datadog is set up

wonderfully because we look at it and it gives us everything we need and we can root cause it very quickly by pattern matching but if that turns out to be one of the harder things for agents to do instead of making an agent better at understanding metrics maybe you just give it different data and so that it can root cause it without those metrics and it will shift away from

reading the information from those services. If you look at chess and the AIs and the stockfishers of the world, that's just one AI that plays against grandmasters. Even the top players have learned from the AI. So they know that a pawn push on the side has been extremely powerful or a rook lift has been very powerful. So now the

the top players in the world, adopt these techniques, they learn from the AIs. But that's also because it's always a human in the loop. We still want to see people playing people. But if you just leave it up to the AIs, like the way they play the game is completely different. They see things that we don't. And I know I didn't address your answer at the start fully, but these tools are grounding actions for us. So the observability stack is one of them. But ultimately, we build a complete abstraction to the production environment. So the agent...

uses these tools and learns how to use these tools and knows which tools are the most effective but

We also built a transferability layer. So you can shift the agent from the real production environment into the eval stack. And it doesn't even know that it's running in an eval stack. It's now suddenly just looking at like fake services, fake Kubernetes clusters, fake data docs, fake scenarios, a fake world. So these tools are an incredibly important abstraction. It's one of the key abstractions that the agent needs. And honestly, it's

Memory management and tools are the two big things that agent teams should be focusing on, I'd say, right now. Wait, why do you switch it to this fake world? Because that's where you've got full control. That's where you can introduce your own scenarios, your own chaos, and stretch your agent. But if you do so in a way where the tools are different, the worlds are different, the experiences are different, there's less transferability, when you then take it into the production environment, then suddenly it's going to fall flat. So you want the

like a real simile of the production environment in your tool or your eval bench. And are you doing any type of chaos engineering to just see how the agents perform? Yes, that's pretty much where our eval stack is. It's chaos. We produce a world in which we produce chaos and then we say, given this problem, what's up? What's the underlying cause? And we see how close we can get to the adoption cause. Perfect.

opportunity for an incredible name like lucifer this is the this is the seventh layer of hell i don't know something along those lines yeah we've got some ideas on the blog post that will have some more players on this idea so tpd i think one thing to note is that

This is a very deep space. So if you look at self-driving cars, lives are on the line. And so people care a lot. And you have to hit a much higher bar than a human driving a car. It's very similar in this space, right? Like these production environments are sacred. They are important to these companies, right? They are, if they go down or if there's a data breach or anything, that their businesses are on the line. CTOs really care. The bar that we have to hit is very high. And so we take security very seriously. But

The whole product that we're building requires a lot of care and there's a lot of complexity that goes into that. So I think it's extremely compelling as an engineer to work in this space because there's so many compelling problems to solve, like the knowledge graph building, the confidence scoring, how do you do evaluation, like how do you learn from these environments and build them into your core product, the tooling layers, the chaos benches, all these things. And how do you do that in a reliable, repeatable way? I think that's the other big challenge is

If you're an AWS or GCP or using this stack or a different stack, if you're going from e-commerce to gaming to social media, how generalized is your agent? Can you just stamp it or can you only solve one class of problem? And so that's one of the things that we're really leaning into right now is the repeatability of the product and scaling this out to more and more enterprises. But yeah, I'd say it's an extremely complex problem to solve. And even though we're valuable today,

true resolution, end-to-end resolution, maybe like multiple years, just like with self-driving cars. It took years to get to a point where we've got Waymo's on the roads. Yeah, that's what I wanted to ask you about was the true resolution and how that was like, that just scares me to think about, first of all, and I don't have anything running in production, let alone a multimillion dollar system. So I can only imagine that you would

encounter a lot of resistance when you bring that up to engineers? Surprisingly, no. There's definitely hesitation, but the hesitation is mostly based on uncertainty. Like, what exactly can you do? And if you show them, like, we literally can't change things. We don't have the access. Like, the API keys are read-only or we're constrained to these environments. And if you introduce change through the processes that they have already, so pull requests, and there's guardrails in place,

then they're very open to those ideas. I think a big part of this is really engineers really hate and fry on support. So they yearn for something that can help free them from that. But it's a progressive trust-building exercise. We've spoken to quite a lot of enterprises and almost all of them have different classes of sensitivity. You have your big fish customers, for example, that

you don't want to touch their critical systems but then you've got your internal airflow deployments and your cicd your gitlab deployment if that thing falls over we can scale it up or we could try and make a change there's zero customer impact and so those are the areas we're really helping teams today is on the lower severity or low risk places we can make changes and when if you if you're crossing crushing those changes over time an engineer will introduce you to the more

high value places. But yes, right now we're steering clear of the critical systems because we don't want to make a change that is dangerous. Yeah. And it just feels like it's too loaded. So even if you are doing everything right, because it is so high maintenance,

You don't want to stick yourself in there just yet. Let the engineers bring you in when they're ready. And when you feel like it's ready, I can see that for sure. Also, behaviorally, engineers won't change their, you know, the tools they reach for, the processes in a wartime scenario. When something is like a relaxed environment, they're willing to try AI and experiment with that and adopt that.

But if it's a critical situation, they want to introduce an AI and add more chaos into the mix, right? So they want something that reduces the uncertainty. Yeah, that reminds me about one of my major things that I notice whenever I'm working with agents or building systems that involve AI. The prompts can be the biggest hangups. And the prompts for me sometimes feel like

I just need to do it. Obviously, I'm not building a product that relies on agents most of the time, so I don't have the drive to see it through. But a lot of times I will fiddle with prompts for so long that I get angry because I feel like I should just do the thing that I am trying to do.

and not get AI to do it? I don't really have an answer for you. That's just the nature of the beast. Yes, exactly. I do want to just double click and say everybody has that problem. Everybody struggles with that. You don't know if you're like one prompt change away or 20. And they're very good at making it seem like you're getting closer and closer, but you may not be. We found success in building frameworks to do evaluations so that we can at least understand

either from production environments or evals. There's samples, the crowd truth that makes us know or gives us confidence we're getting to the answer. Otherwise, you just, you can go forever, right? Like just tweaking things and never getting there. That's it. And that's frustrating because some, yeah, sometimes you take one step forward and two steps back and you're like, oh my God. It's quite hard with content creation. I think it's harder in your space. I have all but experienced

stopped using it for content creation that's for sure like maybe to help me fill up a blank page and get directionally correct but for the most part yeah i don't like the way it writes i don't really even if i prompt it to the maximum it doesn't feel like it gives me deep insights yeah stopped that but you're still on gpd 3.5 right so

Insights from Cleric: Building an Autonomous AI SRE // Willem Pienaar // #290 55:57 Share

MLOps.community

Deep Dive

Shownotes Transcript

Insights from Cleric: Building an Autonomous AI SRE // Willem Pienaar // #290