We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Everything Hard About Building AI Agents Today

2025/6/13

MLOps.community

AI Deep Dive AI Chapters Transcript

People

Demetrios

Shreya

Willem

Topics

Demetrios：在生产环境中诊断AI系统故障非常困难，需要根本原因分析。我们专注于构建一个AI Agent，通过分析生产系统和可观察性堆栈来诊断警报，并提炼成根本原因或发现。 Willem：生产环境中缺乏可靠的ground truth，验证也是一个问题。我们学到的是，尽可能将人从循环中移除，建立一个快速的学习循环，将生产中的失败反馈到评估系统中。我尝试尽量减少对Agent步骤的依赖，避免底层错误影响整体结果，即使Agent轨迹错误，有时也能偶然发现有用的信息。 Shreya：构建数据处理流水线面临与构建AI Agent相同的挑战。数据处理是理解LLM以及人与系统如何互动的有趣实验场，因为你无法预知所有数据。在数据处理中，验证非常困难，不仅要验证转换的正确性，还要验证数据是否存在以及是否遗漏。我正在开设关于AI评估的课程，讨论如何构建可信赖的AI应用和评估。用户更倾向于将工作流程视为数据流管道，而不是从自然语言到管道的转换。AI有很长的失败模式，需要将这些模式综合成具体的指令。LLM擅长处理详细的提示，但不擅长处理模糊的提示，需要提供示例和详细说明才能达到理想效果。

Deep Dive

Shownotes Transcript

Translations:

中文

spend another minute yeah just in case make sure we exhaust every last brain cell of mine and that shouldn't be hard so we should probably kick it off with when you go to production something fails and

You're trying to figure out why it's failing, especially in your AI system. It's not the easiest thing. It's not trivial. And so I know both of you have seen some of this. You have thoughts. Let's kick it off with that. So we're focused on building an agent that root causes alerts in production. So an alert fires in your PagerDuty or Slack. We have an agent that can diagnose that by looking at your production systems, your observability stack,

planning, executing tasks, calling APIs, and then reasoning about that until it distills down all the information into a root cause or at least a set of findings. And so I can tee up a few challenges that we've run into. So one of them is

ground truth, like it's a lack of ground truth in production environment. Unlike code or writing, there's not just this corpus of like web information that you could just download and then train online. So you need to figure out whether the agent even successfully solved the problem. Like, how do you know that? So you need to find ways, either user feedback, but sometimes the users don't know. Like if you go into engineer and say, you know, is this the root cause? Oftentimes they'll say, this looks good, but I'm not sure if it's real.

And so verification is also then a secondary problem. Effectively, the thing that we've learned is you need to, as much as possible, get the human out of the loop, not just from doing the work, but also the review process and the labeling and feedback process. Because otherwise you succeed or fail, but you're still dependent on them, blocked on them to improve your agent or your AI system.

Ultimately, what you want to get to is a loop, like a learning loop of failure in production and incorporating that back into your eval system. And you want this to be very fast. This can't be in order of days or weeks or months. It needs to be ideally hours or minutes if you can.

- Yeah, so we can dive into some of those areas. - I'm Shreya, so I do two kinds of areas of research. I'm a PhD student, by the way. One is on data processing with LLMs. So if you have lots and lots of unstructured data like PDFs, transcripts, text data, how can you extract just semantic insights and aggregate them and make sense of them? Turns out that building pipelines to do this

kind of has all of the same challenges as building AI agents. It's just AI agents for data analysis. So happy to chat more about, but what does it mean to build evals in this? Does it mean to incorporate tool use? How do you interplay methods like retrieval and having LLMs directly look at the data itself?

And of course, you know, what I think makes data processing a very interesting kind of petri dish for understanding LLMs and how humans and systems kind of all interact is that when you're doing data processing, you don't know all the data that you're trying to make sense of.

The AI is kind of telling you what's in the data. So verification is so hard here. You're not only verifying the transformation, like the insights that are extracted, but also that they were extracted correctly, that they exist in the data, that you didn't miss anything. But how do you know if the LLM missed extraction if you don't know what's in your data? So there are a lot of really interesting challenges there.

And then separately, because I feel like we have this pretty rich lens on how all of these problems work, I'm teaching a course with Hamil, Hamil Hussain, on AI evals in general. How do you build AI applications that you feel confident in deploying? How do you build evals around them?

when you mentioned your question, what do you do when something fails in production? That's a horror story if you haven't even thought that out before you deploy it to production. You need to have some sort of blocking in place. Some metrics you're measuring. It can't just be like, oh, somebody said it failed, and then you're like, hands up in the air, no way to start. You shouldn't have gotten there in the first place. So really, how do you go from zero to being able to debug? It feels like there's a bit of...

overlap here with the fuzziness you don't understand what's in the data and so you can't really tell if what is coming out is correct and then also the same with on your side you're like we don't really understand if the root cause has been fixed or not and so that fuzziness is in a way like how do you bridge that gap how do you do it besides just well i think it looks good it's

Yeah, one really reductionist way to look at it is that if we're in the production environment, like in a cloud infrastructure observability, it's like a lot of information. Time series, there's logs, there's code, there's chatter in Slack. And what you're really trying to do is really, first step is information retrieval, it's search. And you're taking very sparse information that's spread out everywhere and creating dense gems, like findings out of that, that is contextual to the problem at hand.

And I think what we've tried to do as much as possible, and I'm kind of curious to hear Shreya's take if this is possible in her use case. I actually want to see the difference between the use cases, is we flatten the dependency on the agentics parts of the workflow, essentially. If you build agentic steps on top of agentic steps, and if a base layer is wrong, then everything above that is wrong.

In our cases, sometimes it still works out because the agent can go in a trajectory that is wrong and still stumble upon a finding that is good and bring that back.

But I think I'm curious in her case, in your case, does that work or is it just then it's a catastrophic failure to the user? Most of the use cases we're focused on are more batch ETL style queries. So we're building the system called doc ETL where you can think of it as writing map reduced pipelines over your data, but the LLM executes the map. So the map is not code function. It's a prompt that...

that is executed per document. And the LLM also executes a reduce operation. So we do a group by and send all the documents to an LLM and it does some sort of aggregation there. In this sense, there's not really retrieval that immediately comes to mind. Retrieval can be thought of as an optimization if you can...

understand what your MapReduce pipeline, your problem expressed as MapReduce is to map out insights that are relevant and then aggregate and summarize how they are relevant to the query.

you could imagine some of these maps can be retrieval-like execution. You can use embeddings to filter out before you use an LLM to extract the media insights from each document. And in that sense, I would think of

retrieval as an optimization, putting my database hat on, of course. So just curious, at what layer does the LLM come into play? Is it planning the MapReduce operation or is it also within the operation itself? It is within the operation. There is a layer on top that goes from completely natural language query to a pipeline of MapReduce filter

Think like Spark, but every operation is LLM. Okay, and that is the natural language that comes from the user. Yes, and in a sense, I find that we've done a few user studies on this. Nobody really wants to go from NL to pipeline because you want to think about your workflow as kind of a...

A data flow pipeline. And anyways, you're writing prompts for map, reduce, filter, cluster, like whatever it is, data operations that you want. So it is low code, no code. Where people really struggle is, you know, they've implemented, so say you implement your application, your pipeline and doc ETL for your use case, which you can do, it'll just be really slow. Yeah.

It's how do you go from initial outputs to improving the pipeline.

it's how do you even know that the agent has failed because of a downstream issue and then when you know something has failed how do you encode that observation back into the prompt right like you kind of have vibes that are like oh this is like not quite right but to specify that what we call bridging like the gulf of specification that is the hardest part that we see every user do just trying to come into play non-technical so if i'm understanding that correctly it's basically like

I see something's wrong, but I don't exactly know how to fix it. And I'm trying to fix it by tweaking some prompts or I'm changing out a model or I'm doing a little bit of everything that I think could make a difference. Yep. And...

It may or may not work. It may or may not. There's that, and there's, especially in data processing, and I think this is probably also true for your case, there's just such a long tail of failure modes when it comes to AI. They're failing in all these bespoke ways, and keeping track of all of that and synthesizing them into, okay, what are the three concrete instructions I need to give the LLMs? LLMs are great if you have very detailed prompts.

They are horrible with ambiguous prompts. Great for 90% with an ambiguous, but to get that last 10%, you're out here giving examples, giving very detailed instructions. Maybe that's just me. I don't know. I mean, edge cases is a really good one to get into because I think this is where a lot of teams think, well, it's easy. I'll just build an edge and just fork this open source thing and it'll just work. And you get to some level of performance, but getting to a very high performance really is really, really hard depending on the use case.

I'm interested in the HCI perspective as well, because I'm curious, what are the control surfaces that your users have? What do you get? Because if you think about mid-journey, sometimes the most frustrating thing is it just takes and you just get back a random thing every single time. So it's like at the casino, just like the slot machine, right? Yeah.

And it can be so frustrating. But if you have inpainting, then suddenly you can take an image and improve it. Or even with the new ChatGPT image gen, you can specify using a spec what you want. And that's a different level of HCI. I mean, chat and interfaces are just inferior, I guess, to something more structured.

So I'm curious, what are those interfaces today and where do you see that going over time? Yeah, okay, so now I have to go into a history lesson if you ask the ACI question. This is the perfect room for that. That is great. So Don Norman, who's written a lot about design, really is the first one who came up with these sculls. Sculls of execution and evaluation are the one he came up with. And any product, not AI product, any interface requires users to kind of figure out

how to use it. They have some latent...

intent in their head. Like say it's something as simple as booking a flight. You booked a flight from Germany to here. You need to figure that out in your head, look at the Google flights interface and kind of map that out and execute it and then see what happened and then make sense of it. And this kind of loop was very hard to do in the 80s and 90s. We solved it. But what AI brings that's very new

is now this gulf of, I'm going to call it specification because it's just easier for me to understand in that way. It's broken down to two things. One is how do you communicate to the AI? Then how do you generalize from the AI to your actual data or tasks? Because you can have a really great specification. You have a great mid-journey prompt and that gets sent to the AI, but there's the gap between the intent that's well-specified,

and the AI and it gets wrong and then you're like, "Oh my gosh, did I do something wrong?" I think the first thing that we have to do as a community, and this is something that we're talking about in research circles too, is just recognizing that these are two gulfs. Both need bridging, but the tools in which we have to do that need to be different. For gulf of specification,

I, at least in the doc ETL stack, we're building tools to help people do prompt engineering to mark examples of bad outputs and automatically suggest prompt improvements. The goal is just to get a very complete spec.

And we're going to make no promises that that spec is going to run as intended because we all know LLMs make mistakes. Our idea is separate tooling like task decomposition, agentic workflows, better models are going to bridge that generalization gap. The gulf that you're talking about and how

We've had so many years of humans interacting with computers that we figured out a way. It wasn't that hard for me to book the flights. It would have been in the 80s. Exactly. The interface has been polished. But now when you throw natural language into it,

it is so much harder because A, we're not used to it and B, it is very fuzzy. Again, going back to that fuzziness of when I say a word, it may mean one thing, you may interpret it as another thing and the LLM can just, it's that going to the casino and you're playing slot machines. And I think this is partly not just the natural language,

element of it but it is having a model of what the system is doing and if you can't figure that out then you're just stuck at like the starting line so if you take like a cursor and you're doing some code um you know software development if you understand that it's just doing rag over your code then you have a better mental model and you know okay i can introduce these files i can index that or if you know that the tabs that you have open are you know weighted higher so having that model uh

It affects how you prompt the agent. - Absolutely, and the history too. Like I know that I have something in my clipboard or I just did some edit. Yeah, I don't know why it feels like you have to know those things in order to steer. It's very difficult. - And I think that's a mistake a lot of people make is they, it's not that they want to infantilize, but they want to abstract this from users and just think, "Okay, this is just like a black box. Don't worry about it." But it makes it much harder to use their products. And so another thing that we were experimenting with is,

Sometimes giving the user a little bit less and giving them more UX affordances that allow us to get more feedback from them. But these are orthogonal, so it may make the product slightly harder to use. So for example, we might give you... These are the key findings.

But in like a mid-journey style, we will give you buttons that say expand for more information or search further using this finding. And so if we give you something that's very useful, if you click on that and expand, we know, okay, this is actually good. If you keep ignoring something that we give you, then we know this is bad. And so there's some implicit feedback that we get back. And I'm not sure if you're incorporating anything like that or... Yeah, we're very big on like hierarchies of feedback. So...

the easiest thing to do is to get binary, click or don't click, yes or no. And then to be able to kind of drill down on that with open-ended feedback. One of the things that we did that was quite successful to help people write prompts for dog ETL pipelines was always have an open-ended feedback

box anywhere where you can kind of highlight the document or output that you think is bad and just stream of thought why it's like a little bit off. Oh, nice. And you can color code that or tag that. That lives in a database. Wow. And any time you invoke the AI assistant or you... We also have a prompt improvement feature

which can read pretty much all of your feedback and suggest targeted improvements. So the prompts are visible to the user? Yes, the prompts are visible. I don't think we're at the state yet where there's anything better than writing prompts for steering, especially for data processing.

I think if you have a better scoped task, it's possible that they don't have to write the prompt. Like in your task, you're very specific. You're searching people's logs, helping them do root cause analysis. But say you were using .etl to write those pipelines, you absolutely have to write the prompt. I guess there's always just a trade-off where how much control you give the user. So we give them like a control surface, like text input where they could...

inject some guidance, either globally or contextually. So on a specific category or class of alert, we can inject some guidance. So maybe there's like an SLO burn rate alert. Then you can attach something contextual that they would say, always check data dog metrics, always check the latest live conversations, always check the system because it's always complicit or involved in some way to the failure.

And sometimes you need the users to give you that guidance. There's so much context that's just latent in their heads that they need to somehow encode in your product. What it reminds me of is when you download a new app on your phone and you have these moments going back to there's that gap of, I don't know how this app works. And so you kind of swipe around and you press buttons and you figure out, oh, okay, cool. I think I know what's going on here. And so...

but now we're working with text and we're working with prompts. And so being able to really figure out what's the interface that's going to best

interact with the human and the text and so that expand button is huge i really like that because then it gives a signal it kind of just gives you a little bit and then you get to know all right but the thing is you can't give them all the information then then you give them like a summary just enough to like get them to kind of click more yeah it's a teaser

Yeah. And many times it's actually not even useful. People don't click on it, but that's very good signal anyway. Yeah. Yeah. That's helpful. And same with the like highlighting the text and being able to just stream whatever you want. Yeah. And what I'm thinking about is when you have that, like highlighting the text and then streaming of consciousness, how are you then incorporating that back into the system so that it learns and gets better from it?

And that is why I think you need AI assistance there, because users cannot remember all of the long tail of failure modes. That is why we have a database of feedbacks. And when users need to improve prompts or want to do something new, we can suggest a prompt to them, because we already know things that they care about, because we read their feedback. And we always provide a suggestion.

or diffs to their prompts, and then they can click accept or they can click reject. So it's reincorporating it into the prompt and also the eval set? Yes. Yeah. That's interesting. Eval sets right now, we don't have a good workflow for it. We're still...

playing around with what we think is best. Like generating synthetic data or having users, currently users would have to bring their own data right now. - It's fascinating to be like, all right, cool. There's insights. There's thousands of insights here because of all this toying around with the output that I've had.

I have left my precious human time has been taken to leave insights here. Now, what do we do with them to make sure that they are incorporated into the next version of whatever I build? And so having an assistant and being able to suggest, well, you might not want to do that because remember you said you cared about this and whatever prompt that didn't work out well. And it's not even that we find that I think in,

It's not just data processing that has this problem. It's also code generation. It's also like bid journey or image generation. But when you're starting out with a session, you almost always do some exploratory analysis. There's a term for this in HCI called epistemic artifacts.

And it comes from how artists use tools. Like if they are given new paints or a new medium, they're going to play around with that before they paint their thing. And all of the interfaces that I think we build in this new Gen EI interface

like arena, for lack of a better term, need to have the ability to quickly create epistemic artifacts. Like when you're in Cursor, you want to try something out and you want to be, if it doesn't look good or if it doesn't work, you want to be able to toss it and keep moving forward. I think that's one of the big failures today. Yes, it's one of the biggest failures. Some of the costs are just too high to experiment and people just kind of back out of the playground or...

There's often just not a playground available. It really makes me think that this is a UI UX problem. And it's very much in the product of how do we make it as easy as possible for people to not have these, oh, I know a little black magic on cursor because I understand it's a rag and it's the tabs. And if I copied and pasted something, then it does better. That is something that should not be like,

gate kept right right so how do you in your product design something that is very much

keeping that away from like oh you have to know if you know great it's very difficult what really muddles these kinds of ids or interfaces is that there's so many entry points into bridging the gulfs right there's the gulf of the specification where you need to externalize your intent as fully as possible and there's a gulf of generalization which is

You need to make sure your prompt works, like regardless of RAG, regardless of whatever hyperparameters that were selected. And right now, humans are relied on to give those hints for both the gulfs. Like you have to know how RAG works so you can give the appropriate hints to bridge the generalization gulf. Like that's crazy. Yeah. Well, one of the things that we intentionally made a

decision to fork Wolf off. Which we originally started with like a slack uh

like an agent that's in your Slack, a teammate essentially. And we actually started on the help desk and we were filling questions from engineers. So one engineer would be like a platform team supporting another engineer coming in with like a question and getting in between engineers was very hard because there's a lot of like chit chat, there's a lot of back and forth. Often the questions are, they need immediate answers. They're something that somebody spent a whole day on. And this was a very synchronous engagement.

Where with the alert flow, it's a lot more asynchronous. It's a system-generating alert. You can investigate that on your own time. And so if you take the cursors and the devins of the world, it's kind of similar, right? With Cursor, you're in the loop. It's the most important thing of your day that you're trying to solve with Cursor. With devin, it's different because you're saying, code this thing for me, but it's like a side task you give to an intern, basically. And I think in our case, we're also trying to take the grunt work away from engineers that they're not...

immediately trying to solve. So it's more like an ambient background agent that's just doing all this work for you. And if you check in, you're like, well, okay, it solved like 20 alerts for me. I don't have to go and look at those. How do you think about this like dichotomy or like the difference between these two worlds? Because I think these use cases are actually different. They're very different and you can't rely on the human like hints anymore.

to work because you're not the premier ID for human attention. Exactly. And also going back to this, like, how have you thought about I don't want to have it so that people need to know this secret sauce for cleric to work, right? Or like some people have a better experience because they know these things and they are understanding of how AI works or just rag systems or whatever. And

You want to avoid that at all if you can, right? So engineers often come to us and say, "Wait, so you're going to be better at solving these problems than me?" And they spent years at these companies, and we don't claim that. We just see there's so much low-hanging fruit in terms of automation that we could automate away for you with these agents, and then just lay up or tee up all the key things you need to make the decision. So we want to lean into their domain expertise. They are the experts, and we just want to make it easier for them.

So what they should be assessing is the findings and metrics and the logs and the dependency graphs and all those things that they already know well. We don't want them to have to understand the internals of our product. I think that's a failure. But also because we're not a synchronous flow, you're basically looking like it's the AI is leading itself to an answer and it'll bail if it can't find something and continue up to a certain point if it is on the right path.

But for the most part, you're just producing artifacts that they can understand and intuit already. The stuff that takes a long time is just maybe switching from one...

tool to the next tool gathering the data trying to put two and two together and then once you have a picture you can start to really use your expertise but all of that before you get to the place where you have that picture that's where you're saying we can automate the shit exactly often engineers are just like

dropping into a console or a terminal and keep cuddling. And it's the same thing every single time. And there are, of course, black swan events and really tough problems that maybe even an AI can't even solve for you. But there's so much gunk and base mechanical rote work that engineers have to do. And remember, they have a full-time job in a lot of cases to write software to actually make the business successful. It's not just debugging and routine investigations in the background, right? Yeah.

I really appreciate like these. I keep forgetting the word that you're using. I'm using like the valley, but you're using... Gulf. Gulf. You know, just knowing that there's a gap. Doesn't matter what the term is. Recognizing that is helpful. Yeah, there is such a gap. And now I'm going to start thinking about like all the places that there's gaps. And so maybe there's...

Other places that you've been thinking about, because you said there was two gaps and one we went over, I think, heavily. What was the other one again? Well, I think now there's three gaps. Before AI, too. Three being specifying, then generalizing from your specification to your actual task or data. Then third is comprehension, understanding, understanding.

what the hell happened? Like, how do you tame the law? How do you even look at the long tail of AI outputs? How do you look at your data? Like, did it do it right? Like all validation falls in that comprehension and like,

We can go down a deep, dark rabbit hole. But it's a really big cycle, right? Like after you've comprehended, then you need to specify again. And then that specification needs to generalize. And just bridging these three gulfs is so, I think every IDE is going to have this problem. Every product that does something moderately complex is going to have the problem.

I'd love to also get into the edge cases if we can. One of the things, we were speaking to Adam Jacobs and he was also talking about this problem in DevOps and in a lot of spaces we know that there's like a model collapse effect. Model, quote unquote, whatever your AI system is. There's no guarantee that you can just keep adding evals and scenarios and improve your system to get to 100%. At a certain point you may like...

like solve one problem and then, you know, another problem rears its head. It's whack-a-mole. It's whack-a-mole. And so I don't know if you've got any techniques or, you know,

from your products that you've been building? Yeah, I think it's about saturation. It is not 100%. It's about building up the minimal case of set of evals and then to the point where you're adding more, trying new things and nothing changes. Like that, you're done. You can't do any better. Like you got to wait for a new GBT model.

I ask this question at every single HCI talk that I go to because I think there's so much work in trying to steer, trying to make models or agents better, but there's a ceiling. And I don't think we've figured out yet what defines having hit the ceiling. I'm really curious if you have heuristics for your use case on that. We don't really know where the ceiling is. We know that if you can set the right expectation with the user, so what we do is we...

We have a lot of data on which types of alerts we can attack and actually solve. That's good. And so we start with something like,

We ask them to, or our customers, to export all their alerts over the last two weeks. And then we try and identify ones that we've solved either in our evals or in other companies or for other teams where we are very confident. So there's almost three buckets. The first bucket is the ones you're very confident you can solve if you deploy us. And the second bucket is the ones that you need to learn in production. And you're pretty confident you can learn those.

But you don't know at what point it gets to the third bucket, which is like you can never solve these. And it's very customer dependent. But from a go-to-market standpoint, the first bucket is the only one that really matters. If that's big enough, then they're like, okay, it's valuable to have you in prod. And that's what gives you the right to stay in their environment.

And then the second one is the one you want to expand and really like prove your worth. And try and find where the third one, where's that line where you cross over to the third. But I think you're, you're unique and thinking about it this way because many people don't even know what the first bucket is or like have a characterization of it. Yeah.

It's like anything goes with AI, right? You could ask it anything. It'll give you an answer. It's honestly the worst thing that you can say because a customer will come to us and say, okay, if I deploy you, what can you do for me? And if you say, well, we'll figure it out, then they're like, okay, that's not good enough. I need to know exactly what you can solve. Just put us into production for a few weeks and I'll tell you exactly what we can do for you. Yeah, that's awesome.

Exactly. So that's what we focus heavily on, really nailing that first clause in our evals and then getting Flywheel going, this learning loop to get the second bucket, the learned set of alerts really high. I like this framing. I think I'm going to parrot it to people and say Will is doing this.

But I have seen the problem is just like not knowing, not even having a reasonable idea of what the ceiling is and not knowing where you are right now. And it really is not about numbers. It is about vibes. I'm very pro vibes, but about having like some confidence band around like numbers per vibe.

And not like overall we've hit 97% or whatever. Every time I read some case study or like some... I also do a little bit of consulting or some client might say like, oh, we're at 94% accuracy. And I'm like, what does it even mean? Yeah, prove that. Yeah, like what I want to know is what are the three to five vibes that you like really are trying to nail? Or like if you have well-defined accuracy metrics, like time to closing or like... And then...

give me your confidence bands on a sample. And then just make sure that we're kind of in those. I really like this idea of, hey, there's that third band that we're trying to figure out where the ceiling is. We don't know. We don't know exactly where it is.

Have you found it's very logarithmic on the amount of time and you get to this point of diminishing returns and you start to be like, you know what? I might just have to give it up on this one. Yeah, all the time. But it's fine. Like AI can still be impactful if it's not 100% reading your mind all the time. What makes us good at using AI is knowing when we can use it. Yeah. As you said, like as long as that class is big enough, like...

Yeah, there's plenty of work to automate. Yeah, as long as you're not wasting engineers' time. If this is a productivity-focused product and you can be quiet if you are unsure about something, then it's okay. Because then if you do prompt them, it just needs to be valuable. It's like having infinite amount of interns, but they won't come to you if they don't have anything good to say or ask. Yeah, that goes back to what I think about a ton is just like how...

disrespectful LLMs are of my time. It's like I don't need a five-page report when only one of those sentences was actually valuable to me.

And I think a lot of proxies there is a very in your face AI. So it's like AI, this is stars and glitz everywhere. And that's very, like, ubiquitous. But I think what people would really want is just like, work being done for them in the background, right? Your to do list is getting cleared, whether it's JIRA or linear or whatever. But are you trying to help folks with their prompting to or you abstract all that away, and you just give them the alerts?

No, we give them the findings on an alert. We're a root cause. So we don't give them access directly to tune the prompts. But we do give them control surfaces so they can inject some guidance and it can be contextual as well. But we don't have full control over the agent. And is that because of going back to this, it's like what you were saying earlier where

They don't need to know the black magic of how to work with AI. They just need to see, because it's a totally different persona. You don't want them having to dig through the prompts now to figure out if that's the correct way to go about it or if there's a better way. We actually did do that at the start. And we realized that what happens is every customer would then...

like idiosyncratic instructions that doesn't necessarily generalize. And so that's one of the... You kind of want to build muscles that benefit everyone and it's kind of like a compounding effect. And what we realized is that

if we have that control on our side, it's harder at the start because the users have a poorer mental model, but it's better for us over time. And so to this point of model collapse, we found that you can get to a good plateau or baseline that everyone benefits from, and the general models like the GPDs, the drop and the sonnets, they help a little bit, but

But because these environments are so idiosyncratic and there's no public dataset, I can't just ask you to export your whole company's like, infra, right? It's nothing. So we contextualize what the agent can do, but we do it centrally. So based on performance metrics, we will say, oh, your agent has a certain set of skills that's different from another agent, but we can subselect those.

So maybe if you're running Datadog and Prometheus and you're running on Kubernetes and it's all Golang, we will not have an agent's skill set that's for Python or these skill packs that are specific to technologies that don't apply to you. So we'll try and delete or garbage collect memories or instructions as much as we can to simplify what the agent can do. So we've got a bunch of these techniques that...

If you look at our base agent, or our base product is the same, but we do contextually modify that slightly. So the performance numbers for each customer can be higher. But that's also risky because then the measurements aren't always Apple to Apple, right? The other thing about exposing prompts, if you're building an application, is often when something is a little bit off,

the first thing that people do is like go and like do some mucking around in the prompt. And it makes it worse. And it's like, you've already gone through years of doing that, right? Like they just saw the prompt for the first day and then like, it doesn't make sense, right? It's like exposing, in a way, I really think prompts are like code. So it's like exposing your code base to the user and it's like,

no you don't need to see how the sausage is made i think it's like not a question of like proprietary secret sauce or whatever but it's just like don't invite them to do something that's like bad for them and also if you give them that surface then you can't really take it away yeah they'll feel like well i've just spent this time doing this thing exactly yeah and i can see a world where they spend so much time and it ends up being like

wow, this actually was a waste of time. And this tool that was supposed to save me time is now taking me more time because I'm tuning these prompts and I'm trying to make the system work the best that it can. And so they're going under the hood, wasting time in that regard. That's true. But sometimes users feel higher sense of attachment and affinity to products that they've customized. If you change the colors, you put in dark mode, all those things before you know it, you'd like this product because it's yours, right?

So there's a balance, but I wouldn't expose the prompts necessarily. Maybe just some color sliders or something. It's so much easier. And actually, when you were talking about the different agent attributes, it reminds me of when you're playing any kind of sports game.

And you have the sports player and it has that circle and what they're good at and what they're bad at. That's what I want to see with the different clerics that, all right, you have this agent and it's good at Golang, but Python, nope, the skill isn't Python, but it doesn't need it. That's the good part. Yeah.

Yeah, you should see our latest marketing. We have some things coming out soon that is in that vein. Yeah? That's so cool. We should get you on the brand team. I'd love it. But anyway, what else were we going to talk about? I remember there was more items on the docket. Well, I was also kind of curious from Shreya's point of view, if you see fetters in production from, let's say users, something they're stuck with, do they give you a data dump or how does that work? And what's that end-to-end cycle for you? How quickly can you go from failure back into prod with a new version? Yeah.

We don't have big AI pipelines. We're building scaffolding for people to write AI pipelines. So it's very software engineering-esque. People will say that there is a bug, like there's an infinite loop here. And I'm like, okay, I'll fix it. And it's like TypeScript. Yeah.

I don't have anything great there. It's on them to figure out that whole loop. I have some research projects. It's an open source research project. Everything comes through Discord or GitHub issues or whatnot. For clients that I work with, with consulting, because they're actually companies, I find that they're actually stuck in just even being able to

detect whether something is wrong. It's not even a question of their users complaining. They're so early stage where they're just like, help, did we get this right? Should I deploy that? But maybe that's just the people who sign up for consulting or people who just don't have the comfort to even get there.

But I think every single person that I talk to, you got to have some metrics. Whether or not they correlate very strongly with what users think and say, that's fine. But having something there to look at is a first step. And the other thing that it indirectly gives you is if you've already instrumented your code, it's much easier to add new evals.

But people think about like, oh, like having to add evals to there. It's a huge thing because yeah, it is like adding observability and instrumentation is so hard. I think there's like, there's like production failures and then there's two parts. There's one is how do you assess what happened? And so we started with traces. So individual run, you can see what the agent did, like which tools it calls, what the reasoning was, the prompts and all those things.

And then the next step for us is how do you convert that into like a new scenario during a test? The problem that we had is that it's not, it's so manual to do the trace reviews. And so you drop into these traces, it's super low level. And so we built like a summarization or like a post-processing process that's completely, well, it's like part AI, part like deterministic, but it collapses and condenses and clusters all of these things together along many dimensions. So we'll see,

for the major groups of, and we focus on the tasks that for most failures happen. Let's say one task is analyze the logs, you know, for in data log or the next one is look at the conversations in the alert channel over the last couple of hours or days. There's specific tasks that frequently reoccur.

And then we try and cluster those. And then we look at metrics or, you know, did the agent successfully call these APIs? I mean, even from like an engineering standpoint, are there like API failures? Did it go into like loops? Did it get distracted? Was it efficient in finding or solving these tasks? And like many of these metrics, you don't need humans for feedback. It's completely like you can just parse the information.

And then from that, we built these heat maps. And then the heat maps, the rows are essentially tasks and what the agent did, and the columns are metrics. And then you can see it just lights up. The agent really just sucks at this one thing. It sucks at querying metrics in Datadog. You're not the only person who's doing the heat maps. And when you see it, you're like, well, how did we not do this before? You're the third person who's told me that.

yeah and then it's easy to write an eval because you're like okay i just need an eval that can you know help us and then the next problem we ran into was creating these evals sometimes took like a whole week because you need live prod infra and so then we both like uh

This is a deeper topic, but like a simulation environment that modeled the production environment. So I was asking if you can get data dumps from your users, because in our case, we couldn't. And so we had to innovate on the eval layer, the simulation layer. That's very interesting. People do send their pipelines and some data to us. So it's very easy to debug for us. But I think the simulation idea is super interesting. Yeah, I really like that. But it also feels very bespoke.

You mean for the use case? Yes. And not even just like data dog simulation. Data dog log retrieval simulation is different from like, I don't know, like whatever other agents that you guys have. Like you probably have to build like an environment per agent to some extent. Or build some spec per agent. Sure. So what we do is a little bit more akin to what like the SWE agent does with a SWE kit.

And so you can have like, there's a lot of similarity in the observability layer. So there's metrics, but there's like 20 different metric systems. But the idea of like, a line graph or seeing metrics is not different to time series, right? Yeah. So you want your agent to operate with like the layer above that system.

Of course, there's idiosyncrasies of the technology, but if you abstract that away, then there's transferability between those. So if you're good at like Datadog logs, you're probably good at open search logs. You don't have to change much. You don't have to change much. That's nice. And so often you can get dropped into, you can just plug in an MTP for a new logging system. As long as the integration works, it'll have good performance. It's not going to be 10 out of 10. There may be some unique things. Having a simulation also makes it better. A lot of

I tell a lot of people, like, if you want to test the reliability of something, just run it on a bunch of, like, different logs. But, like, slightly varying terminology. I don't know. Whatever it is. And just, like, make sure that your answer is the same. And a lot of times that's not true.

And you can do some sort of anomaly detection on those outputs to figure out, okay, what are the common failures that it gives you? And now if you have this environment, that becomes very easy. So what we did originally was we'd spin up actual environments, like actual GCP projects, actual clusters. That's so much. Almost everybody is doing this at the moment that's kind of in the space. I'm not sure if anybody's doing the simulation approach, but...

what's not surprising, but it's obvious in hindsight, is these systems are good at repairing themselves. Kubernetes wants to bring applications up. All the development is in the software is to make sure that it works, not broken. And keeping a system broken in a state...

that is consistent because tests need to be deterministic. Otherwise you run your agent once, run it again and fails. But if the whole world has changed in that last five minutes because the time series are different, the logs are different. - You're screwed. Yeah, it's worthless. - Is it this environment that's changed or is it my agent? And then you're like just doubting and then you need a system to monitor the environment as well.

So that was very hard. And sometimes you need to backfill these environments with data and you accidentally load the wrong data. And then now you need to delete your accounts and reprovision them. It's just so slow. And so we also then went a different route of APIs that are mocked by LLMs. But then this also creates non-determinism. So that's also a very, very challenging direction to go. And so that's where we landed on the simulation approach. We created these fakes. The downside with the simulation approach is that it's not a perfect...

simulacrum or replica of this like Datadog, right? We can't have every API because then we're like building Datadog effectively, right? But you just need to get it to like a, there's like an 80-20 rule where it's good enough that the agent gets fooled by it. Some of these latest models, the agent actually realizes in the simulation that

There's a screenshot on LinkedIn that my co-founder shared where it figured out it's in a simulation and it's like, it bailed out of an investigation because it's like, this looks like a simulation environment or I can't remember if it's the right word, but something like that or like a testing environment. Wow. Then you have to change the

the pod names and you have to change the logs to make it more realistic but that's pretty cool though what if you like tell the system prompt the agent up front like this is a simulation but you're something it won't work that's what we're doing that's what we're doing but I'm not sure if that's going to backfire but we are doing that right now it'll be like yeah I'm good there's no production system this is

You basically have to tell Neo, you're in the dojo, but you're going to go into the streets. But you're training now, but just train properly as if you're going to go into the streets. That's really trippy. And it's cool how quickly the iteration loop then becomes. And so you're able to figure things out and...

When you learn it once, does then it get replicated throughout all of the different agents? Yeah. This is part of the problem. The other part is how do you improve the agent to actually fix the problem, right? So then you have to think about if you need causality that's spanning multiple services, do you add a knowledge graph or a service graph or do you need a learning system? There's other components to the agent that you have to expand or introduce. Yeah.

So I don't want to get into all of those things, but that's where the work really is. I don't know if you can see that dog right there, the dog lamp? Yeah. Is legit. There's like two other little dogs there. Cuties, yeah. They are happy dogs. Is English your first language?

English is my first language. Does it not seem like it? Alright, now we're cutting this shit.

Everything Hard About Building AI Agents Today 47:02 Share

MLOps.community

Deep Dive

Shownotes Transcript

Everything Hard About Building AI Agents Today