We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

857: How to Ensure AI Agents Are Accurate and Reliable, with Brooke Hopkins

2025/1/28

Super Data Science: ML & AI Podcast with Jon Krohn

AI Deep Dive AI Chapters Transcript

People

Brooke Hopkins

Topics

Brooke Hopkins: 我是 Coval 的创始人兼首席执行官，我们构建了一个用于语音和聊天代理（最终目标是任何自主代理）的模拟、评估和监控平台。我们借鉴了 Waymo 自动驾驶汽车开发中的经验，旨在帮助公司在运行大量昂贵测试与实现高测试覆盖率之间取得平衡，解决在分布式系统上大规模运行复杂模拟、简化流程以及衡量和解读结果等问题。我们通过模拟多步骤代理工作流程，帮助客户自动化可靠的模拟和评估，解决手动测试耗时且难以管理上下文和状态的问题。我们的平台设计目标是将复杂的事情简化，让AI工程师能够专注于其他问题。Coval 的用户流程：从简单的单一提示测试开始，逐渐增加复杂性，并通过模拟测试、指标创建和生产监控迭代改进代理。我们应对AI代理级联错误的策略包括构建自愈型代理（例如后台“过度思考者”）、冗余系统和优雅的故障处理机制。我们使用多层指标，结合自动化指标和人工审查，并关注趋势而非绝对值来评估AI代理性能。Coval 提供了多种指标来评估AI代理的性能，包括工作流程遵循度、函数调用正确性以及与人类表现的比较。Coval 的实时监控功能可以帮助客户及时发现并解决问题，例如基础设施故障或新的用户行为模式。我们选择从语音代理入手，是因为语音代理领域正在快速发展，并且语音作为一种相对受限的媒介，更容易开发先进的指标和工作流程。语音代理的潜力远不止于取代电话，它可以创造全新的交互方式和应用场景，例如建立企业间的通用自然语言API。Y Combinator 的经历帮助我完善了 Coval 的业务方向，并从其他创业者那里获得了灵感和支持。 John Krohn: (对 Brooke Hopkins 的观点进行提问和引导，并总结讨论内容)

Deep Dive

Chapters

This chapter introduces Brooke Hopkins and Coval, a platform for simulating and evaluating AI agents. It highlights Brooke's background and Coval's recent success, setting the stage for a discussion on AI agent reliability and the future of AI.

Brooke Hopkins, founder and CEO of Coval
Coval is a simulation and evaluation platform for AI agents
Coval recently closed a $3.3 million fundraise
AI agents are poised to be the next major platform shift after mobile

Shownotes Transcript

Translations:

中文

This is episode number 857 with Brooke Hopkins, founder and CEO of Coval. Today's episode is brought to you by ODSC, the Open Data Science Conference.

Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week, we bring you fun and inspiring people and ideas exploring the cutting edge of machine learning, AI, and related technologies that are transforming our world for the better. I'm your host, John Krohn. Thanks for joining me today. And now, let's make the complex simple.

Welcome back to the Super Data Science Podcast. Today, I'm delighted to be joined by the dynamic AI entrepreneur, Brooke Hopkins. Brooke is founder and CEO of Coval, a Y Combinator-backed San Francisco-based startup that provides a simulation and evaluation platform for AI agents. They also recently closed a $3.3 million fundraise that includes heavy hitter venture capital firms like General Catalyst, Mac, and Y Combinator.

Previously, Brooke was a tech lead and senior software engineer at Waymo, where she worked on simulation and evaluation for Waymo's self-driving cars. Before that, she was a software engineer at Google. She holds a degree in computer science and mathematics from New York University's Abu Dhabi campus. Despite Brooke's highly technical background, our conversation is largely conceptual and high-level, allowing anyone who's interested in developing and deploying agentic AI applications to enjoy today's episode.

In today's episode, Brooke details how simulation and testing best practices inspired by autonomous vehicle development are being applied by her team at Coval to make AI agents useful and trustworthy in the real world. She talks about why voice agents are poised to be the next major platform shift after mobile, creating entirely new ways to interact with technology.

She talks about how companies are using creative strategies like background overthinkers to make AI agents more robust. Those overthinkers are AI agents themselves, by the way. And she provides us with a glimpse of what the rise of AI agents means for the future of human work and creativity. Indeed, how agents will transform all of society. All right, you ready for this fascinating episode? Let's go. ♪

Brooke, welcome to the Super Data Science Podcast. I'm excited to have you here. Where are you calling in from today? Calling in from San Francisco. You and I met in San Francisco. I'm going to butcher the name of the exact name of this event, but it was an event run by the Gen AI Collective, I think is their official name.

And it was a startup competition. There were, from what I understand, a huge number of startups, over 100, maybe several hundred Gen AI startups applied to be part of this Gen AI competition. You were one of 10 startups selected to present at this Gen AI collective. So it was a cool thing where you had like two minutes to demo the product. You weren't allowed to have a slide deck.

And not only were you one of the 10 companies invited to do this, which was an extremely high barrier to clear in and of itself, but you won.

Yeah, that was a really exciting day. We actually had also launched on Product Hunt that day. And this event was co-hosted with Product Hunt and the CEO of Product Hunt was there as well as all the people from the Gen AI Collective. And so it was a very exciting day to be, we launched on Product Hunt, we got number one on Product Hunt, and then went to this event. The energy was electric. It was really exciting. Yeah, it was a really cool event. And I

And I was delighted that you took the time to speak to me afterward and that you were interested in being on the Super Data Science Podcast. Thank you, Brooke. Let's dig into why you won that day at the Gen AI Collective event with your company, Coval. So you previously led the evaluation job infrastructure at Waymo, which is Google's self-driving car project. I'm personally a huge fan of Waymo. I've

love being in them. I feel so safe when I'm driving. So after I've been in San Francisco around in Waymo's, then I'm back driving a car myself. I think to myself, drive like a Waymo, be patient.

Totally. I think it's so fun to be in Waymo because it feels so magical every time. I think that how Waymo was able to transform the fear and kind of uncertainty around self-driving cars to the other extreme of where now you get in Waymo and feel safer than if you were in another ride sharing service is really, I think, speaks to just like amazing technical talent and deployment as well.

Yeah, being in a human-driven car, it feels like you're in the Wild West. You're like, what is this wildness? Totally. And every once in a while, you get the rogue driver who is a little bit crazy in some way. It makes you...

Yeah, wish for it way more. Absolutely, yeah. And we, in episode 849, which was an annual episode that we do where we predict trends for the coming year, data science trends, my guest in those episodes for many years now is Sadie St. Lawrence. And we did something new this year, which is we also, in addition to making predictions for 2025, we also...

We created some awards, so things like our biggest wow moment of the year, our biggest disappointment, what company we think made the biggest progress in AI in the past year, that kind of thing. And our wow moment of the year was being in a Weimark.

Wow, that's amazing. I'm super glad to hear that. It definitely would make my moment of the year, probably for the last five years. Yeah, it's one of those things. It's my go-to example now, a question that I get from people, lay people, friends, family, that they're like, what's something that we need to know about AI? And my go-to answer, since I was in a Waymo in the Northern Hemisphere summer in 2024, was,

I, uh, my go-to answer is you can go to San Francisco, use an app like Uber and have a car come and pick you up with nobody in it and drive you to wherever you'd like in the city, drop you off. You feel safe when you could relay that to people and have them think about how that shows that there is going to be a huge amount of change in the coming years. It just hasn't, um, it just hasn't proliferated around the world yet.

100%. I think the future is here. It's just not evenly distributed. It's very real. It's just not evenly distributed. That's the quote I was looking for. Yeah, it's, I think, and I think seeing how fast Waymo is deployed to new cities is also a really exciting part of this, where you're

To get from Mountain View to San Francisco took so long. It probably took, you know, Waymo was started 10 years ago. We only deployed fully rider only in San Francisco two years ago. And now we're already deploying in Los Angeles, in Phoenix. They're expanding to all sorts of new cities. And the speed of deployment is just accelerating for each new city. And I think that speaks to a lot of the developments, both in like

how the model development works, but also how simulation was able to aid where you don't have to have that manual deployment in all these cities. You don't have to be running nearly as many driving logs because our simulations have become so accurate and you're able to scale them to a level that they previously didn't have confidence in. Yes, yes. And speaking of simulations, Waymo,

You're the founder of Coval and the CEO, which is a simulation and evaluation platform for AI agents. So you're starting with voice and chat assistants. Can you parse for us what this means? So we kind of have a sense, maybe you could even use Waymo as an analogy because it's quite easy to visualize in that scenario how simulation and the evaluation work that you were doing at Waymo works.

how that then transferred over to what you're doing with the hottest topic in, I think, the world, although obviously super biased as an AI person, but AI agents, you know, applying that Waymo knowledge to now this super hot field of AI agents. Totally. Yeah, so what Koval is doing is we're building a simulation evaluation and monitoring platform for voice and chat agents, but eventually we want to do any autonomous agent platform.

So an autonomous agent is an agent that's navigating the world and responding to the world. So think like a web browsing agent or a voice agent or a chat agent are all responding to what you say back to the agent.

And so in the same way that when Waymo is driving from point A to point B, it needs to respond to a pedestrian crossing the street or it needs to respond to maybe some changing features on the road, such as construction or a new road has been created. There's like some parts car. There's all these different changing environments.

We're trying to take the same learnings from how we conquered that at Waymo in order to create really robust, scalable self-driving software and transfer that into how can we build really reliable, robust voice and chat agents or web agents that are able to navigate autonomous situations while also balancing how do you balance

running tons of tests that are really expensive while also having really high coverage. And so these were a lot of the trade-offs that we made at Waymo, as well as how do you run these really complex simulations on distributed systems? How do you do it at scale? How do you distill a lot of the complexity of this to something that's really simple for model and ML and AI engineers to be able to understand so that they can focus on the other hard parts they're working on?

And then also just how do you measure this? What do the metrics look like? How do you interpret the results and how do you get signal from all of these massive amounts of data? Yeah, and I think we're going to dig into basically all of those topic areas right now. Before I kind of dig into those kinds of things like balancing accuracy and scalability, compounding error issues that you get when you have a chain of AI agents, real-time monitoring, all these kinds of things that Coval offers, maybe you could illuminate

illustrate for our listeners. And this can be maybe kind of a tricky thing to do in an audio-only format. Actually, I had the pleasure of you providing me with a demo of your platform, sharing your screen and showing that to me last week. So I kind of have in my head some visuals of how the platform works. But maybe you could use a case study with a client or two. You don't need to necessarily name them by name if that's not something, obviously, you're authorized to do with the client. But just kind of

describing a situation that a client's in and how Coval is able to automate reliably their simulation and evaluation. Totally. So a common pattern that we see when you're developing agents is that in order to test these multi-step

agent workflows, there are a couple of things that make it really hard. So one, to test this manually often takes a lot more time because instead of just putting in one input, such as clicking a button or getting an LLM response from a call, you have to go through multiple steps. And so with a phone call, this could take as long as a phone call takes, which might be minutes or even 15 minutes.

or longer. And then as well, you have to recreate all these different contexts and states. And these are really hard to manage. Even if you are willing to put in the time to do this, you have to remember, okay, I went down this pathway, but I haven't tested this pathway. And then can I remember what it was like when I tested it the first time? And so similar to self-driving cars, you have

To get from point A to point B, you have all these possible paths that you can go, and some of them are right and some of them are wrong, and some of them are hard to tell. And so really what you want to be doing is running all of the possible pathways, or at least a representative subset of the pathways, so that you can have high signal and high confidence in what you're testing.

And then you want to see how often do certain types of events happen. So, for example, how often am I seeing the agent get stuck? How often am I seeing the agent mispronunciate things?

That's perfect. I love that. Just like me. Just like a human. So how often are you seeing transcription errors? How often are you seeing logic errors? All these types of things. You want to see how often they're getting the conversation wrong or right. So some of our customers, for example, a lot of our customers are customer service agents. So this is an area we're seeing just really explode within voice agents.

And so you have a customer that's calling you and they want to book an appointment. So they want to book an appointment for tomorrow or next week, or they want to book an appointment for the next available time or Tuesday the 24th. And you should assume that that's in 2025 or all these different permutations of possible ways to book an appointment. And so what our customers will do is they simulate

booking an appointment and the prompt for this simulation would be book an appointment for some time in the future or just book an appointment. And then you can vary how deterministic or non-deterministic these simulations are with temperature and other things, and then be able to map out all these different pathways. Or if you care more about, I want to test booking an appointment for tomorrow because I've seen some errors with that case. I'll prompt it with

Book an appointment for tomorrow and then run that 10 times or 100 times. See how often it's failing. If I feel like sometimes it works and sometimes it doesn't. Can I see how often it's not working? Another thing that we can do is we can re-simulate from transcripts. So if you're a user, you go into your logs and you find examples where your voice agent is performing in a way that's an unexpected way.

So you can go into your logs, find those examples, and then re-simulate them. And so this is actually borrowed from self-driving as well. This is a really common developer workflow where we'll drive manual miles on the road or we'll drive supervised autonomous miles, take those logs from production, and then re-simulate them through our simulation system. So this allows you to reproduce issues to a much finer granularity than if you use fully synthetic data.

AI is transforming how we do business. However, we need AI solutions that are not only ambitious, but practical and adaptable too. That's where Domo's AI and Data Products Platform comes in. With Domo, you and your team can channel AI and data into innovative uses that deliver measurable impact.

While many companies focus on narrow applications or single-model solutions, Domo's all-in-one platform is more robust with trustworthy AI results, secure AI agents that connect, prepare, and automate your workflows, helping you and your team gain insights, receive alerts, and act with ease through guided apps tailored to your role. And the platform provides flexibility to choose which AI models to use.

Domo goes beyond productivity. It transforms your processes, helps you make smarter, faster decisions, and drive real growth. The world's best companies rely on Domo to make smarter decisions. See how you can unlock your data's full potential with Domo. To learn more, head to ai.domo.com. That's ai.domo.com.

To kind of give a visual of this and kind of bring the analogy to life a little bit, I was recently in Austin, Texas and saw Waymos driving around with somebody in the driver's seat. And so that's an example. You were talking about expanding to new regions. And so in that scenario, you have somebody maybe getting footage of something that's specific to Austin, Texas that would have been difficult to

to simulate from data in San Francisco or Mountain View. And now with Coval, a developer can be taking a chat experience and going through something that they think, like going through a specific flow that they think is really important, but then also simulating based on that to have more variability without all the effort. Totally. And I think that's why Waymo is actually a great analogy to this, or a place where we can draw a lot of learnings

Because in the same way that Waymo doesn't 100% rely on their simulations, they use it to filter like what should humans really look into? How can we move faster? How can we discover issues faster?

And how can we have much larger scale coverage than we ever would if we are doing only manual testing? But that doesn't mean that you don't have humans reviewing all of the performance or looking into specific issues. And so in that same way, like how can you use the manual driving time to once you're really sure that that software is up to snuff and that the software is doing what we expect and then finding those really long tail cases?

Or really just proving out the true reliability of Waymo versus, I think, previously in robotics, a lot of this would be done manually. You would be manually testing all of these different scenarios and then trying to reproduce these each time.

And so I think that's what we're seeing. That's where we're seeing voice AI is right now, where people are going back and forth with their agents manually. And that's what most companies are doing. And it's really painful.

So what Koval does is we come, a lot of companies, a lot of engineers are going back and forth with their agents all day. Maybe they have a script in the best case where they're simulating a transcript, but we help those engineers who are going back and forth with their agents be able to reduce the developer time and then also have run far more tests than they ever would be able to with Koval.

manual testing alone. Nice. Yeah. And so I realize that this is going to be tricky without visuals, but can you explain maybe even just at a high level how that happens in the platform? And I know that a lot of

design work has gone into building your platform effectively to strike that balance of making Coval both intuitive to use as maybe even a first-time user while simultaneously offering the breadth of functionality that you've been describing the power users might want to have. Totally. I think this is an amazing challenge of developer tools and we definitely have idle companies like Vercel or Linear where they take really complicated things and distill them into really simple products.

And I think developer tools, really well done developer tools, take a really complicated thing and make it obvious what you should do next. Because I think at the end of the day, AI engineers have so many complex problems that they're solving all at once. You know, voice alone, voice video, streaming, this is it.

This has been a hard problem for over a decade. I think it's amazing how hard it still is to deal with audio bites, video streaming, et cetera. There's all of the complexities of prompting and dealing with models, building out your RAG infrastructure, building out infrastructure, traditional infrastructure,

understanding your user, filling out the right workflows. And so testing this is just one more piece that often isn't the core competency of these companies, nor should it be. And so what we're trying to do is make it really simple and obvious so that they don't have to spend tons of time thinking through their eval strategy, figuring out how they should be evaluating these setups, figuring out how can we build out metrics, how can we build out really complex systems to do this evaluation, but instead

we streamline them through that process. So it becomes overwhelmingly obvious what to do next, overwhelmingly obvious what their problems are in their system. Nice. Yeah. So that's the design challenge. How do you tackle it? Something that has been nice is that we built this so many times at Waymo. We built several iterations of it and saw a lot of the common design patterns that happen when you have complex configuration files for simulation. And so we've taken a lot of those learnings away from

when we thought something was going to be really obvious or we thought something would stay simple for a long time, knowing how it might evolve over time. So for example, things like configuration files for the simulator, what types of arguments go in there? Where might we go in the future, even if that's not what we have today? And I think that's been really helpful for knowing

how to modularize things so that you have small digestible components, such as we have the simulator, we have metrics, we have analysis, but then also still making it so that you don't have to have a thousand different configuration pieces. So I think that's been, we've taken a lot of learnings from there.

The other aspect is just making a lot of it really easy to use via our UI. So making it

Visually, I think a lot of this is a UX problem of how do you take large amounts of data and distill them for the users so that they can understand what they're simulating and what the results mean. And those two things, while really simple to say, I think are a really hard problem that we spent a lot of time at Waymo solving of how can you tell the user what they're simulating and make it really clear visually.

Often, one of the failure modes is you just ran the wrong tests, right? Your data set wasn't representative of all the cases you're trying to test. Your configuration wasn't enabling the right modules. You weren't running the right setup in some way. And so...

For our agents, it's really important to figure out how can we distill this so it becomes really clear what they're simulating and what they're analyzing. Nice. So let's say I'm a client of yours and I have a customer service agent and I'm going into the Coval platform for the first time, blank slate.

What do I do? Where do I go to start to make my life easier and start to have comprehensive testing and simulations going? Like, yeah, what's the flow like as a user? I'll talk through the whole developer lifecycle. So it's day one. You're building a voice agent. You go and you find a pretty easy-to-use platform to build a voice agent and you build an MVP. Then you can come into our platform and you can iterate on your prompt directly so that you can say,

How does the prompt play out in just a super basic environment? Not even enabling voice necessarily, just can I see how the conversation plays out with this single prompt? That's usually the first stage. And then you might make your agent a bit more complicated. You might add in some reg or you add multiple agents or you add some flows to it.

And then what you can do is simulate, set up some simulated tests through our system. You'll create a test set. That test set might have a bunch of different scenarios like book an appointment, book an appointment for next week, call to issue a refund, complain about the recent experience you had on their airline, etc. And you'll run all of these scenarios through our simulator.

Then you'll have a bunch of simulated conversations. That alone is super helpful because now you can look through lots of different, you can run a hundred simulated conversations at once and then digest the ones that maybe failed to complete or had a flag, a failed metric,

Maybe it shows that the conversation was ended abruptly, or the user wasn't able to achieve their goal, or the appointment was not successfully booked. So then you can go in and manually review those and try and understand what's happening. And here's where our users really iterate on both their evals and their system. So you might go in and realize, these are the things that I'm trying to manually detect. I'll create a metric in order to detect those things.

And then I realize that I'm interrupting the user. So I'll change some parameters so that I'm not interrupting the user as eagerly. Then I'll rerun it through simulation and be able to say, okay, is it now clear that my interruptions have decreased?

So once you have a good workflow there, you can automate those and then start monitoring how well your system's doing in production. Nice. I might be interrupting you or you might have gotten through developer lifecycle, but I have a couple questions for you. So you mentioned at the beginning on day one, the developer selects a

an agent, do you have, can you disclose preferences that you might have or Covel might have as kind of preferred agent providers? Is that something that you do or you can provide guidance on in public? Yeah. So I think there really is not necessarily a right platform or a one size fits all. And I'm saying that honestly, because we've seen this work, we've seen all sorts of different platforms work well.

We also want to be the evaluation framework that's agnostic of which framework you're using, because this allows you to really easily switch between platforms. Everything is moving so fast in voice AI. And also, as prices go up or as your requirements change or as your product evolves, different solutions might make sense for you at different times.

So I don't think that we're kind of biased towards one or the other. I think there's a bunch that makes sense for different use cases. And there's a couple of axes in which I would make that decision. So, for example, on the scale of low code to more configurable, you have much more low code solutions that are servicing business owners or any non-engineering background where they can really set up a voice agent as easy as setting up, you know, an email newsletter or any other service.

easy to configure system.

But here you're going to have a lot less configurability. Setting up function calling or RAG is going to be a lot more limited. Whereas a higher configurability option, such as some of the open source voice orchestrators, those are going to give you a lot more control over function calling, over being able to add in different infrastructure and mix and match that with your own in-house built infrastructure.

And so I think figuring out where on that spectrum you are, then also looking at which companies kind of support the developer needs that you're looking for. Some other considerations that we've seen are

So there's a couple of important things when you're building voice agents, such as instruction following, function calling, workflow following, conversational or how natural the voice sounds, creativity. So if you're building an application that's talking to you as a friend versus you're building an application that has to follow a very strict workflow in order to collect a certain amount of data for a patient intake versus a

uh, voice application that is calling out to do a bunch of function calls, like such as updating records or booking appointments. Um,

Or you're in a very high compliance industry where instruction following is really important. It really needs to do the things that you tell it to. These are all different trade-offs and I think different platforms excel at different things. So for example, the conversational and creativity, you might be looking at different models than, for example, if you really care about function calling and making sure that you can do really complex function calls, that might not fit into the up

the more opinionated platforms that require you to set everything up in the way that they determine. That being said, there's some platforms that allow you to set up workflows in these really beautiful ways and makes it really easy so you don't have to code this giant mess.

So those are kind of the trade-offs that we make. And we actually work with our customers to figure out what the right platform is for them. So if you're having these questions, reach out to me and I'm happy to even just like bat around some ideas. Excited to announce my friends that the 10th annual ODSC East, the Open Data Science Conference East, the one conference you don't want to miss in 2025 is returning to Boston from May 13th to 15th. And I'll be there leading a hands-on workshop on agentic AI.

Plus, you can kickstart your learning tomorrow. Your ODSC East Pass includes the AI Builders Summit running from January 15th to February 6th, where you can dive into LLMs, RAG, and AI agents. No need to wait until May. No matter your skill level, ODSC East will help you gain the AI expertise to take your career to the next level. Don't miss. The early bird discount ends soon. Learn more at ODSC.com slash Boston. Nice. Yeah, so you mentioning there, Brooke, how...

how agentic AI platforms are designed to try to provide ease of use and they might kind of have graph visualizations perhaps to allow that to happen. That reminded me how during the demo that you gave me at Coval last week, you had a graph aspect of the platform that allowed users to create nodes and connections between those nodes to map out conversation flows, which you could imagine would be very helpful in, say, a customer service example where users

Somebody comes in and you could have one flow where it's dealing with an issue that they're having or booking appointments. And then down the issue leg of this graph, you could have a whole bunch of common questions or flows that happen when somebody is encountering an issue versus when somebody is looking to book an appointment. There's a completely different way that the conversation could go.

So that's an example of how Coval is trying to combine or succeeding at combining precision and scalability, because these are often conflicting goals. So when you think about trying to have an AI agent work effectively, the most precise thing to do, but

but would be extremely time consuming would be to create maybe thousands or tens of thousands of different conversation flows that cover the gamut of possibilities and really comprehensively cover all the possible scenarios that your users could go through, which might be impossible, but let's say, let's pretend that it's possible to do all that coverage. It's going to take thousands, tens of thousands of manually created conversational flows.

That's not very scalable. You know, you make a change to your platform, you're offering some more flexibility, you open up to a different kind of customer base, any of those, it could be very small shifts. And then all of a sudden, wow, we're going to need thousands more conversation flows to handle this new niche that we're covering and this new functionality that our AI agent has. So that's the far end of precision on the very far, on the opposite end of the spectrum.

to maximize scalability, you could have something like, hey, you know, you could have a chat with some kind of conversational Gen AI agent and say, I'm going, you know, I'm creating a conversational agent that will work in this particular scenario, create a whole bunch of tests for me. And then you just use those without kind of reviewing them,

So, yeah, so I think hopefully I've done a passable job of kind of explaining this spectrum of scalability to precision. And yeah, I'd love to hear your thoughts on that and how Coval addresses it. Yeah, I love that you're already realizing this because I think that it's a non-obvious piece of the puzzle, but it's not only of can I run the right test and get signal out of it, but what should I even be running? How do I make this work?

How do I make these trade-offs of scalability and signal? So we made these trade-offs a lot at Waymo, basically always balancing cost and latency with signal. So you can obviously always spend more to make things faster and run more scenarios. But this obviously comes at the cost of it being more expensive, it taking longer in your developer iteration cycle.

On the flip side, you could run no scenarios and it would be very fast and cheap, but you will have no visibility into your system. And so this is something actually we work with our customers to figure out what is the right balance of how many scenarios should they be running at which points in their workflow. So what does it make sense to run, say,

on every PR that you submit or what makes sense to run every six hours or nightly. And then what types of sets should you be creating to run regression sets with or run before big releases? And so I think this is a really big problem is figuring out not only once I know what to run and how to get signal from that, like run the right metrics, then how do I scale that as I add more customers, as I add more use cases?

And I think here is also where that kind of developer experience that I mentioned is really important, is being able to show what is the distribution of the data set that you're running? How many examples are you running? Are they all on the same topic or are they on different topics? How do they compare to what you're seeing in production? Are they pretty similar to those examples in production or are you running examples that are completely different than what we're seeing in production?

And so that's where we think it's really important to have this end-to-end workflow where you can go from monitoring, simulation, testing, then see how it's behaving in production, and then be able to rerun those logs through simulation or match is what we're testing, actually surfacing the issues we're seeing in production. Great answer. I love that. Crystal clear answer.

In addition to this complexity of accuracy, sorry, of precision versus scalability, another big issue that happens with AI systems, with agentic AI systems, is that there can often be a lengthy cascade. You know, you mentioned earlier this idea of tool calling. So you could have kind of a, you could have,

an AI agent that is kind of triaging the call that is figuring out, okay, based on the conversation so far, it seems like I'm going to need to call tool A. And then maybe later in the conversation, they need to call tool B. Or maybe tool A, in order to do its job effectively, needs to call on tool C.

And so that was a bit of a vague example, but it was to kind of illustrate that you can end up with this cascade of multiple agents in a sequence, potentially making requests in parallel or having multiple things happen sequentially, multiple calls happen sequentially, all in parallel. And AI agents are responsible for all that without a human in the loop.

So in that kind of scenario, even a small error, especially early on, like what if the triaging agent right at the beginning got it wrong? It called tool A and it should have called tool D. Totally. So that can lead to a butterfly effect where one small error in an earlier step can lead to a massively wrong output later on in the conversation, right?

So what strategies can we employ to mitigate this kind of butterfly effect? And how do you ensure graceful failure when these errors do happen?

Totally. I think this is one of the reasons why evaluation of agents and multi-step evaluations is so different than evaluations of alums or any call where you have some input and some output, because not only do you have the non-determinism of a single call, you have the non-determinism and all the possible pathways through evaluation.

these cascading failure points that just explode in terms of the possible pathways and the possible types of failures. There's also an interesting case where it goes off track, but then the agent saves it. It realizes that it made the wrong mistake, which

leads me to, I think a lot of the pathways, a lot of the ways people have been solving this is kind of self-healing agents, where some interesting things I've seen is having an agent in the background for a voice where you have a cheaper and faster voice

low latency model coming up with responses, but then you kind of have an overthinker in the background that's looking at the whole conversation and maybe it takes longer to make that call, but the latency is okay as long as it's in the background and can help prompt the agent to get it back on track to saying, you messed up this order or you forgot to ask something.

You can also employ other strategies for graceful failures around having multiple redundancy in your systems. I think there's a lot to learn from aerospace and self-driving here as well, where you basically, I think self-driving has really mastered graceful failures where it has fallback mechanisms. There are ways that it can pull over or it can ask questions to ride our systems. There

There's all sorts of systems in place so that it's not just reliant on any one system within the voice agent or within the agent at play. I think how this translates to voice agents is

can the voice agent self-determine when the request is too complex for itself? This is already happening a lot today. I would say the majority of our customers already have the ability to transfer to a human when they determine that the task is too complex for them. Yeah, that's something that hadn't even occurred to me until now, of course. Exactly. So that's an example of a redundant system. But I think we can create redundancy in all these other ways. And something that's interesting about

kind of a counterpoint to people who say that voice agents will never be reliable enough. They're inherently non-deterministic and very hard to kind of corral into doing a task reliably right. I think we've seen that

in infrastructure, this not be the case because servers are inherently very unreliable. And they yet were able to create cloud, create infrastructure on cloud or on all sorts of unreliable systems at many, many layers that theoretically should all compound into a massive error percentage.

And we've seen create six nines of reliability for those systems. And that's through redundancy, that's through fallback mechanisms, that's through all sorts of other engineering techniques. And so I think we're going to see the same thing happen with agents where you can create reliability out of unreliable systems. Classic Luddite thing. This will never be possible. And then I can point you in the direction, Luddite, of...

Waymo's, you know, it's the kind of thing people say, oh yeah, it's, it's the same thing as nuclear fusion. It's like self-driving cars are kind of always 20 years away, had been for decades, but now it's happening. And it's the same kind of thing with this, with, with voice agents, with more and more kind of agentic systems. The server example that you gave was beautiful because it's,

Yeah, you know, six nines of reliability that is possible with agentic systems as LLMs themselves get better at not hallucinating on an individual call, but then also as these kinds of redundancies are built in, like you've been discussing. And so...

It is inevitable. It is not impossible. It is inevitable. This is what is going to happen. And if you don't think agents are going to be able to handle a huge swath of complex tasks in the coming years,

You're wrong. You heard it here first. Yeah, yeah, yeah. It doesn't, it's not even like a risky thing for me to say that. That is what's happening. Just as there are going to be self-driving cars in more and more cities from more and more providers handling a broader range of tasks. It's just, it's going to happen. So, yeah. I think there's an interesting parallel there of how some agents react

acting reliably can actually raise the tide for everyone, where if you have lots of good examples of agents being deployed in enterprises successfully, that's going to create an environment where more agents are able to take on larger and larger tasks. And I think we saw this with self-driving, where as you're able to carefully, safely scale out self-driving, it

it doesn't matter which company does it, that's going to make it a more favorable environment for any company to be able to develop self-driving. And so I think something we want to do at Koval is also giving companies the tools to be able to show to their customers that this is an agent that is going to perform reliably and you can trust that this is performing not just on the demo cases that I showed you,

And that, you know, maybe smoke and mirrors, but it's actually working for all of the cases that you're interested in. And then you can go and explore those cases and have confidence that the agent will be behaving as you expect and then monitor that over time.

And so something Koval wants to do is we want to be able to power enterprises to be able to understand how their agents are behaving in a world where these systems are so much more complex than just knowing if your web app that you use for accounting works. You just log in and it's working or it's not working. But with agents, there's just so much less visibility. And I think that makes people inherently more

distrustful of the systems, even if they can produce so much value and the technology is already there. I just realized I've been butchering the pronunciation of your company this entire episode. You've been doing Koval like all of the above, and I've been doing Koval like Albert. Actually, this is funny. I think we don't have a consistent pronunciation, so you don't have to worry about it. Okay.

Our name actually comes from, we are named after, or we named the company after Sofia Kowalewski, who was the first female mathematician to get her PhD. And then also it's collaborative eval or conversational eval. And so it kind of has this double meaning. That's super cool. I love that.

I'm going to have some info in the show notes on Sofia Kovalevsky for people. So you can click through and read about probably her Wikipedia profile or something. I'll be sure to include that. That's really cool. Yeah, and I'm sure I'm butchering her name, which I should really now, but there's a lot of consonants in there. Yeah, yeah, yeah.

I'm sure you're doing better than I am. At least I should be pronouncing it the same way you do on air. So I'll try to switch to cove all. Cove all. Yes, that's what you could do. I think I honestly switched back and forth. So I went to stretch. Gotcha. I think the, and all this isn't. We can do cove all. Nice. Well, sweet. I think part of why that one seems so right to me is it's like eval.

Yes, that's what we try and say, like copy it after. That's how I'm trying to pronounce it. I can't say that word for some reason. Pronounce and mispronounce. Nice. All right. So anyway, back to the kind of the conversation flow. You were just giving a great answer for me on the butterfly effect. Yeah.

And yeah, crystal clear, tools like Coval are going to be able to move us in the direction of having agents handling more and more kinds of tasks in more and more kinds of scenarios. They are going to be ubiquitous in the future. It is inevitable. Something else that Coval offers is

is custom metrics. So there could be complex scenarios where standard metrics, just plain old accuracy, aren't useful. I mean, actually, that would be something. How do you, in a scenario where

This isn't like a math test. Scoring a conversation isn't like a math test where there's a correct answer. You just get to some integer or some float and you're like, okay, that is the correct answer. Nice work, algorithm. When you have an agent handling a complex task, there's an effectively infinite amount of variability there.

where, you know, there's an infinite number of ways that it could be right. Not even, you know, not even including the infinite number of ways that it could also be wrong.

So what kinds of metrics do you use to evaluate whether an agent is performing correctly? And then maybe building on that, what kinds of custom metrics might your clients need? I think you're exactly right that it's really hard to find the line between this is objectively a good conversation and this is objectively a failing conversation, but rather it's a spectrum conversation.

And so what we find works really well is layering metrics. So being able to run a whole suite of metrics and then looking at trends within those metrics. And this allows you to make trade-offs as well. So maybe you're a little bit worse at instruction following, but you get the cases that you care about most 100% correct. Because the distribution of how well you do on all these cases isn't like machine learning where you just care about

you know, getting 99% of examples right. Because if you're getting the one most oftenly used case wrong, it doesn't matter if you get the other 99% right, because when someone tries to book an appointment, they fail. And so we see that these patterns of what matters is correct, is different than other traditional software applications or machine learning applications or even robotics. And the other piece of this is being able to show

By having a variety of metrics, you can create a whole picture of how the system is behaving. So for example, a short conversation isn't inherently bad, but a short conversation where the goal wasn't achieved and the steps that the agent was supposed to take were not executed, that's an objectively bad conversation. So you can filter down by whether potentially true failures or false positives are

or false failures, et cetera, you can basically figure out which ones are ones worth looking into through filtering by these metrics. So I think while we aim to provide all automated metrics for things like, did it follow the workflow? Was the conversation successfully completed? Were all the right function calls called with the right arguments?

There's also always going to be space, I think, for human review and really diving into those examples. And the question is, how can you use that time most effectively? So it's not that you never look at all these examples, but you're looking at the most interesting examples. Did you know that the number one thing hiring managers look at are the projects you've completed? That's why building a strong portfolio in machine learning and AI is crucial to your success.

At Super Data Science, you'll learn how to start your portfolio on platforms like Hugging Face and GitHub, filling it with diverse projects. In expert-led live labs, you'll complete an exciting new project every week. Plus, through community-driven projects, you'll tackle real-world multi-week assignments while working in a team. Get hands-on experience with projects like retail demand forecasting, building an AI model from scratch, deploying your own LLM in the cloud, and many more. Start your 14-day free trial today and build your portfolio with superdatascience.com.

Nice. Very cool. That's a great example of what to prioritize. Are you able to give concrete examples of metrics? What are the most common metrics for evaluating performance?

Yeah, so we have a metric that allows you to determine if you're following a workflow. So for a given workflow described in JSON, which is pretty common in a lot of these different voice platforms, can you determine if you're following these steps outlined in that workflow and determine when you're not meeting those in the conversation? And this is super useful, I think, especially for objective-oriented agents where they're trying to complete a task.

Often, if they miss a step in that workflow, it's a really good indicator that the task wasn't completed correctly. So, for example, if you're booking an appointment, just to use a consistent example, if you're booking an appointment and it asks for the email and the day that they want to book the appointment for, but they forget to ask for the phone number, that task has been completed technically, but hasn't been completed correctly because it missed this key step in the workflow.

Another interesting metric that we do, and then we also dynamically create these workflows in monitoring so that you can see what workflows your agents are actually going through in production and see how often, if that matches with your expectations or where you're seeing new use cases or new patterns of user behavior. We also have metrics around function calling. So, yeah,

you know, where the right arguments called for these different tool calls and that's all custom configurable.

And what's interesting here is I think we try to make all of our metrics reference-free. So there's two types of metrics. There's reference-based and reference-free. Reference-based is metrics where you have an expected output and you must curate that expected output with a golden data set and maintain that as your agent behavior changes. Reference-free, we infer what the correct answer should be based on the context of the conversation.

And I think for LLMs in general, reference-free evaluation is really helpful because of the non-deterministic nature, whereas traditional unit testing and software is all reference-based, right? It's easy to make some assertions about what an API call should look like.

But even more so with voice and chat agents, the conversations can go so many different ways. And this changes when you change your prompt, when you change the models, when you change your infrastructure. So having reference-free metrics or at least a strong subset and test sets that rely on those is really important for being able to iterate really quickly.

So we tried to do function calling, create reference-free evaluation for function calling. So we say, for example, if we're taking an order, can we confirm that the right function call was made based on what was described in the order from the user? Those two things should match based on a prompt and a set of heuristics. So this gives you, the users, a lot more flexibility.

So those are just two examples. We've been building out a lot of metrics for new use cases and pulling them from all over the map of using off-the-shelf models, drawing from inspiration in self-driving of, can we measure, for example, the agent performance

against the human performance. If it took the agent longer to perform a task or shorter to perform a task, that's interesting intel. It's not necessarily good or bad when it stands alone, but if the agent takes significantly longer to perform a task and then ultimately doesn't or is repeating itself a lot, it's a good indication that your agent is going in circles. Nice. That was a great comprehensive answer. If I try to recap back for you how we can effectively evaluate

conversational agents. It would be to have lots of, in a way that Coval makes it easy to have lots of permutations of relevant conversations. So you can have lots of different examples that you test over, and then you have a handful of metrics that you evaluate

each of those scenarios in. And so you kind of, through scale, you end up being able to ensure robustness and you can then watch those changes over time. So you could say, you know, there's probably not a, well, there's probably not a huge number of, I don't know, maybe there are. I was going to say there aren't a huge number of people doing agentic AI where they're training or fine-tuning their own LLMs for doing this, but

Let's just, you know, I'm thinking to my experience, training a deep learning model where you can over time see how the training accuracy and the validation accuracy are trending over time.

You could imagine that same kind of thing here, where if you were training your own LLM to be handling some agentic task, you could then run your suite of examples and suite of metrics provided by Coval at some reasonable kind of number of training steps. And you could be watching how that

curve, how your metric curves change over time. And you're like, okay, you know, we're kind of plateauing across the board. We've probably trained the LLM enough. Similarly, you could compare multiple different LLM providers and

Or you could, with your tech, you can actually also monitor in real time. So you can see how these metrics are performing for your customer. Your customers can see how their agent's metrics are performing over time in real time to see if something's going off the rails. Maybe one of the tools that's required for fulfilling a

a common request in the agentic workflow is down. There's, you know, that, that the, you know, there's AWS in Virginia has gone down. And so, you know, so being able to monitor in real time allows your customers to be able to fix things before they become even bigger issues. Exactly. And I think every piece of this puzzle, as you mentioned,

shed light on is really important where you might discover some issues are much easier to detect in production monitoring. For example, AWS going down is just going to be something that is, you can obviously have recurring tests, but it's going to be really overwhelmingly clear when you start to see this happening in monitoring or these really long tail issues.

For example, being able to see new user trends, unanswered questions from users. So this is something else that we do is we can detect the unanswered questions within your transcripts and then be able to help you either then answer these by adding things to your knowledge base or adding those capabilities or using UX to let your users know that this isn't something we support.

So just as much as it is covering the things that you know you should be doing, it's also understanding what the user behavior is or how your system is behaving unexpectedly. And then, yeah, every layer, I think, will catch different issues and is an important part of that workflow. From, yeah, what do you simulate versus what do you catch in monitoring versus what do you do just by manually testing things?

We also have the capability to send things for review. So you can actually send these out to labeling teams and be able to go through tons of examples and then be able to feed that back into your metrics, into your evaluations. So this is really helpful for being able to understand the effectiveness of your metrics over time and

But yeah, as you mentioned, I think the long-term vision of being able to have self-improving agents so that they get better over time based on these metrics you define is a really exciting goal. I think it's still too early to do this. We get a lot of questions around doing automated prompt optimization and automated agent optimization. I think we're still so early in agents that having visibility into how these systems are improving ultimately produces better results than before.

the time savings from having self-improving agents. But I think that will change a lot. Who knows? At this rate, in the next few months, who knows? You've right at the end there with this idea of self-improving agents kind of being in the loop and, you know, adapting their own prompts.

that relates to a question that our researcher Serge Massis pulled out to ask you, which is related to, so in self-driving cars, and I guess in autonomous systems in general, according to what Serge has written here, level five autonomy refers to complete autonomy. So this is a self-driving car that can operate in all conditions without a human behind the wheel,

So bringing that analogy over to these kinds of conversational agents or agents more broadly, web-based agents that you'll be supporting in the future at Coval. Yeah, I guess you kind of answered the question there, which is that at this time, it seems like it would be premature to try to have a fully automated system without a human in the loop at all.

And without, you know, potentially being able to in, you know, in some scenarios, being able to have the redundancy of a human operator be able to come in and help out. But it also sounds from your response, like we could potentially be months away. And it seems like certainly years away from having that kind of complete autonomy. Yeah. And who knows, I think, what the timelines are. But I think the...

I think there's two parts of autonomy here. It's how the agents are developed and how autonomous that development lifecycle looks like. And then on the flip side, how autonomous the agents are once they are released and within a task, how autonomous they are. And I think the exciting parallel with self-driving there is can the agent figure out things on its own without having to be programmed? So

There are many systems right now responding to the non-determinism and creating reliability through making the steps of the agent even more clear, even more restricted, having more heuristics or programmatic logic to determine what the agent should do next.

The flip side of this is having an agent that's more autonomous and you give it more context into what a good next step would be so that when it encounters unexpected situations, it's able to better adapt. A good example of this is the example I've been using, which is booking a calendar appointment. So if you have an agent that has very restrictive habits,

workflow where you say, first, you should say hello, then you should ask for their email, then you should ask for their phone number, then you should offer some dates. If the person says, hello, this is Brooke Hopkins, I am calling to book an appointment for tomorrow, and you

Maybe they're weird and they say my email on the first message. Now your agent isn't able to appropriately respond to that. Or maybe the agent, maybe you ask it some questions about the company, which it actually should be able to answer. You're trading off being able to adapt to new scenarios versus, you know,

Versus precision. And so I think with self-driving, there's something to think through of like how, for example, Waymo is able to adapt to construction sites versus having, you know, logs of roads that it's already seen before. And I think there will always be this trade off. And I'm hoping that agents go more in the direction of true autonomy and

Such as function calling, for example, there's a lot of work around, can we call these five sets of functions with these arguments? Instead, could agents come up with what APIs exist on the internet? Can I go read the documentation about that and then come up with what the right API format is? And there's no function calling provided. So I think there's a lot to explore in terms of true autonomous agents.

Great. Yeah. So you're kind of giving us a glimpse into the immediate hurdles that autonomy faces and how we might be able to mitigate those hurdles. Assuming that we will be able to mitigate all of them, we're going to have more and more agentic systems. Are you able to try to see into the future? This is a tricky question. But in decades, you know, you're

You're relatively young. At the end of your career, what do you think the state of the world might be like? How different might society be as a result of agentic AI systems, AI in general, maybe other exponential technologies like nuclear fusion? Is that something that you spend time thinking about, or is this just a silly question?

No, it's definitely something we spend time thinking about is like where, especially where Koval will go in the not so distant future of agents being exceptionally capable that, you know, their near human intelligence and given a task can execute really well. And I think even in that near future, Koval,

The vision of Coval is being able to manage and understand how these agents are behaving at scale, regardless of even if you have agents that behave exceptionally well, we still care about human, for example, human performance at scale. These call centers or large companies care about performance reviews. And so being able to monitor and understand how agents are behaving is, I think, just paramount to having agentic systems that

In the worst case, in the sci-fi sense, take over the world. But in a less maybe dramatic sense, just having agents that we understand and are able to reason about, legislate, employ for the right use cases, understand how they're impacting our users. All of these things are really important for just the well-being of everyone. And for an even more distant future, I think that there's...

Things are never, I guess, like as bad or as good as they seem. I think the same could be applied to things are neither as dramatic nor as linear as they seem. Like I'm sure the future 50 years from now is going to be, you know, for our grandchildren is going to be, you know, just dramatically different in the same way that, you know, 100 years ago was very different from now.

So I think there's definitely a rapid pace of adoption, but I think humans are so adaptable that even if you start to have agents that are able to do the majority of mundane tasks like spreadsheets and things,

emails and communication and whatnot. I think that humans are just, I really believe in human creativity and I think humans are exceptionally creative and will continue to build on top of those and become even more capable versus, you know, in the same way that computers today have not replaced. We've only become more creative, more connected and more global as a

as a society. That was a great answer, Brooke. And probably on some of our evaluation metrics, it was the correct answer. But the absolute correct answer would have been, John, you will be able to upload your brain and live forever. That was the correct answer that we were looking for. Right. That was just actually what I was getting to before you interrupted me. I was about to say that

You'll be on a beach on Mars. Your brain will just be making money, more money than everyone else. Everyone will be making more money than everyone else through agents. Everyone will be the richest person on the planet.

Yes, exactly.

Why is it that Koval is so excited about voice agents in particular? Yeah, well, the reason we started with voice agents is because on one hand, compared to self-driving agents or web agents or all of these more complex agents,

Voice is a really great medium where you have one person talking to another. It's a little bit more constrained. And so we're able to develop these more advanced metrics and these workflows. And we've also seen voice agents just taking off, exploding in a way that no other agent

was at least six months ago. And so that's a great place to start is just when you have an exploding space and it's building on top of existing infrastructure. Like companies are used to having call centers, they're used to having phone trees. And so taking that one step further to having an automated voice agent that's even smarter than they were before is a much easier step than going from, you know, someone who's spending their all day, every day thinking about a problem and saying, now an agent is going to handle this.

But I think beyond just the practical reasons why voice is really interesting, I think people are underestimating how exciting voice is as a space because we're not just replacing all of these phone calls that you would make otherwise back and forth.

But I think there are several other really exciting things about voice, which is now you have this universal API between any two businesses or establishments. You have essentially a natural language API where you can say, these are all the things that my agent wants.

knows about my company and wants to reveal about my company. So either via text or via voice, I can call you and ask about claims data. I can call and ask about appointment availability or opening hours or all sorts of other data.

And then conversely, it doesn't matter if the other party either has a traditional phone and so likely wouldn't have an API or they also have their own agents. And so now you have this very flexible agent interaction where these two systems can talk to each other without any sort of maintenance. And then I think also there's going to be so many new types of voice applications.

of voice experiences that I think come out in the same way that, I think Steve Jobs talks about this in a talk that I just watched where he says that when you have, when they first had TV, they just put, you know, put a camera at plays and then put it on TV. And in the same way with computers, they just put, you know, essentially static pages on the web and then discovered all of these interactive capabilities.

And with mobile, they had like a website on the phone, but then they evolved all these mobile native applications. And I think voice is the first big platform since mobile that where every company is going to be expected to have a voice agent or a chat agent. And that chat agents, chat and voice agents are going to be expected to have a lot more capability. You'll be able to do anything like all these natural experiences online.

And so what does that look like beyond just what I would call a business for? But when I'm in a website and something is pretty verbose to type out, maybe it's better to explain that. And then I go in and enter, you know, a bunch of numbers into a form, which are super annoying to say out over the phone.

Or are there ways where you can interact with web applications more seamlessly via voice because you're driving around as a delivery driver or you're a policeman or you're a truck driver where you often are on your computer, but then when you get back to your computer, you're able to go through all of your orders in the web browser.

So I think that we're just on the cusp of figuring out the role of these really advanced voice agents. And that's really exciting of like, what new experiences can we create with this new medium? That was a beautiful answer. Something else that,

occurs to me related to this is that in order to be able to have a great voice conversation, it requires a great world model in the model, say the large language model that the agentic system is relying upon.

And so it's also really cool in that respect, like in the same way that if you have a model like Sora that's creating a video based on some text prompt, it needs to have encoded in some abstract way within its embeddings. It has to be able to encode that like a bullet flying through the air has to keep going straight across all the frames in the video clip. So it has this kind of physics understanding built into its embeddings somehow.

In a similar way, when you're having a conversation, especially a complex conversation, which agents in the future will definitely be able to handle, you need to have these really sophisticated world models or great understanding of how the world works in order for that conversation to go well. So it's cool in that respect too.

I think that's such a, that is such a great concept is kind of what are all the accidental things that agents discover in the process in the same way that you said, like agents accidentally discovering or models accidentally discovering physics, what will agents accidentally discover?

Mm-hmm. That's really interesting. Something else that occurred to me is this is going to be my last kind of big question for you, and then we'll just do the wrap-up questions. You've been very generous with your time today. But Koval graduated from Y Combinator. And so as you were talking about these kinds of things like

refining what your business is doing. I mean, you may have gone into Y Combinator with all that already figured out, but it just occurs to me that Y Combinator is that kind of place where you really kick the tires around. What market should we be going after first? And so I'd love to hear what your experience was like with Y Combinator. Why did you choose to apply to it? What was the experience like going through it? Would you recommend it to our listeners? All those kinds of things.

Yeah. Well, the reason I went into Y Combinator is even though I had the idea, I had pretty much the idea before Y Combinator. I think I honed it in specifically to voice. We are doing more generalized applications before, but really honing into voice based on our first customer, but still went into Y Combinator with that idea. And I think the reason I did Y Combinator is because I'm a solo founder. So it's,

Obviously, I think it's easier to run a race with lots of other people. And I knew the person I am. I'm a rare extroverted backend engineer, ML engineer. So I knew that I was going to just personally really enjoy the program and find a lot of it. I think one of your biggest challenges as a founder is not the idea or external factors or your luck or anything other than just your own personal psyche is half the battle.

And so if you can find environments in which you thrive and you're inspired and you're pushed and you're, you know, constantly pushing harder and harder, that's so important. And so I think I knew that being in the context of a lot of other really smart, inspiring founders was going to be important.

just great for the company and my own experience. But I also have found starting a company so much fun. And it's really great to be in a community of people who also think the same way, where they're like, this is the best job I've ever had, is being able to build something from scratch. And it's

I think YC Filters does a pretty good job of filtering for people that are really excited about their company. Yeah, I mean, it's been great to meet you in person in December. And then we spent time together with you showing your platform to me in a demo last week. And then now we're recording together. And you have, you unusually...

For a founder, and it seems maybe even more unusually for a solo founder, you don't convey the sense of having the weight of the world on your shoulders. It seems like this is the right fit. Obviously, it's going to be challenging. Obviously, it's going to be a huge amount of work. But you seem like such a safe horse to back because it seems like you're just...

You know, you have this, you know, just the right personality to stay calm, to figure it out and to enjoy the process. So it's really cool. Really appreciate, yeah, you taking this time with me and with our listeners. Oh, well, thank you so much. It's been a pleasure to be on this podcast. Yeah. And before I let you go, though, I do have two quick questions, which is, do you have a book recommendation for us? Yes, I have many book recommendations. I'll do one that's more like personal. And then one, I think that's pretty related to work or informed a lot of my work.

On a personal note, I think Kim Stanley Robinson's books in general are great. I love Ministry for the Future. I think he did an excellent job of kind of painting a very realistic near but far term version of the future. He talks a lot. It's mostly about climate change and like, what does our world look like with climate change? But that kind of builds on the question that you asked of like, what does our world look like with agents changing?

100 years into the future can often be really hard to imagine. And I think Kim Stanley Robinson does just a beautiful job in all of his books of painting that future.

And on what has informed a lot of my work side, I really love the book Creativity Inc., as well as reading that in parallel with Bob Iger's biography and Steve Jobs' biography and how these leaders were able to cultivate creativity within their organizations and what it means to really

creates a company that builds really novel, beautiful products. It was really exciting to read through how Pixar, which has built out some of the most advanced both technology and really novel films, goes through that process, as well as then through the lens of Apple, which is obviously interlaced with Pixar through Steve Jobs' involvement in both

and his close relationship with Bob Iger and just how all three of those books paint this image of how do you build really large organizations that are creative and inspired. Nice, great recommendations. I love those. And I think I would add actually, they're both creative and also very technically advanced and those things often can be at odds. But Pixar and Apple, I think are great examples where they're both technologically and from a design perspective.

you know, incredible. For sure. And it does sound like something you have prioritized at Coval in terms of your user experience as well, which is,

Super cool. Very last question for you now should be a layup is how can people follow you after this episode? It's been so great to learn from you throughout the episode. If people want to be able to get in touch, you mentioned, for example, people being able to reach out and ask you about, you know, what kind of agentic systems or platforms they might want to consider using for their scenario. How can people reach out to you or follow you after the episode? Yeah, definitely. You can always find me on LinkedIn. Um,

There, always feel free to shoot me a message. I also, when you sign up through COBOL, you can always book time with me to have a quick session on going over your voice architecture. Or even if you're not using COBOL, feel free to just book some time to talk through your voice agents with me. And you can always also see a less filtered version of me on the X or Twitter, whatever people call it these days.

So, yeah, we'll add those to the show notes of my LinkedIn and Twitter. Fantastic. Yeah, we will have those. Yeah, Brooke, thanks again for taking the time. I appreciate it.

I know that you and Koval are going to be such a great success. So yeah, it's been an honor to have you on the show in these early days. And maybe we can catch up in a few years and see how the product in the world of agentic AI has been evolving since. Totally. Or 50 years and see how your brain is doing on Mars. Exactly. We can do it in exponential increments. So we'll do it in like 330, 300, 3000 increments.

Which we'll be living because of our longevity efforts. Exactly. Well, thank you so much, John. It's been awesome to chat through all of these really exciting comments that I love nerding out about what someone as smart as yourself. Perfect. Thank you so much.

I really enjoyed having Brooke Hopkins on the show today. In today's episode, Brooke covered how Koval is building a simulation, evaluation, and monitoring platform for AI agents, starting with voice and chat agents applying lessons learned from Waymo's self-driving car testing. She also talked about how the Koval platform helps companies balance precision versus scalability by enabling comprehensive testing across many conversation flows while maintaining high signal quality.

She talked about how key conversational agent evaluation strategies include reference-free metrics, workflow validation, function call validation, and comparison to human performance benchmarks. She talked about how companies are building redundancy into AI agents through techniques like fillback mechanisms, self-healing capabilities, and human backup options.

She talked about how the development of reliable AI agents will likely follow a similar path to cloud infrastructure, building robust systems from inherently unreliable components through redundancy and engineering. And we talked about how voice agents are taking off because they provide a universal natural language API between businesses and with consumers.

As always, you can get all the show notes, including the transcript for this episode, the video recording and materials mentioned on the show, the URLs for Brooke's social media profiles, as well as my own at superdatascience.com slash 857. And if you'd like to connect in real life as opposed to just online, I'll be giving the opening keynote at the RVA Tech Data and AI Summit in Richmond, Virginia on March 19th.

Tickets are really reasonable and there's a ton of great speakers, so this could be a great conference to check out, especially if you live anywhere in the Richmond area. It'd be awesome to meet you in person there.

Thanks, of course, to everyone on the Super Data Science Podcast team, our podcast manager, Sonia Brajovic, our media editor, Mario Pombo, partnerships manager, Natalie Zheisky, our researcher, Serge Massis, our writers, Dr. Zahra Karche and Sylvia Ogwang, and our founder, Kirill Aramenko. Thanks to all of them for producing another fascinating episode for us today.

Thank you.

the episode on your favorite podcasting app or on YouTube, subscribing if you're not a subscriber. And something that I've only recently started saying is you're also welcome to edit videos into shorts and post them on social media, YouTube, TikTok, whatever. Just refer to us and we'd love for you to be doing that. But most importantly...

I just hope you'll keep on tuning in. I'm so grateful to have you listening, and I hope I can continue to make episodes you love for years and years to come. Until next time, keep on rocking it out there, and I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon.

857: How to Ensure AI Agents Are Accurate and Reliable, with Brooke Hopkins 01:22:43 Share

Super Data Science: ML & AI Podcast with Jon Krohn

Deep Dive

Shownotes Transcript

857: How to Ensure AI Agents Are Accurate and Reliable, with Brooke Hopkins