We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Building AI Systems You Can Trust

2025/5/23

AI + a16z

AI Deep Dive AI Chapters Transcript

People

Matt Bornstein

Scott Clark

Topics

Scott Clark: 我发现企业AI应用的最大阻碍不是性能，而是信任。企业优化AI系统后，最关心的是系统是否引入了新的问题。现在对LLM的关注点集中在高级指标，掩盖了系统内部潜在的不良行为。因此，我们需要通过测试来解决信任问题，而不仅仅是优化性能。 Matt Bornstein: 我认为对AI系统的信任甚至比其原始性能更重要。企业需要构建一个平台，以解决AI项目激增的问题，并实现平台理想状态。集中式Gen AI平台可以减少影子AI，并提供测试的理想环境。

Deep Dive

Shownotes Transcript

Translations:

中文

After helping people optimize models for about a decade and a half, I came to a really interesting realization that I was solving the wrong problem. And foundationally, it was that the thing that's holding back people getting value from these AI systems is not performance. It's not about squeezing out that last half a percent from some eval function or some performance metric. It's about being able to confidently trust these systems.

And I can't tell you how many times over the decades we would help someone optimize a system and they would say, okay, well, what did you break? What bad behaviors are you introducing? What lack of robustness do I now have because I've overfit this system?

And we're seeing people do the exact same thing again today with LLMs, where they're focusing on these high-level metrics, these end outputs, these performance evals, and that ends up masking all of these potentially undesired behaviors within the system itself. Welcome to the A16z AI podcast.

I'm Derek Harris, and I'm joined this week by A16Z partner Matt Bornstein and Distributional co-founder and CEO Scott Clark for a deep dive on deploying and testing AI systems, particularly, but not exclusively, LLMs in enterprise environments. We cover a number of topics under this broad umbrella, but the focus was really on how enterprises can and are trying to establish a trusting relationship with AI so they can actually put it to work on important work and scale their deployments beyond small projects.

The discussion begins with Scott giving his somewhat tongue-in-cheek definitions of machine learning and artificial intelligence before sharing some of his background.

including the hard-earned realization that trust, not performance optimization, is the biggest factor in how heavily large companies will deploy AI. As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details,

Please see a16z.com slash disclosures. So can you just define for us what's machine learning, what's AI? Like as someone who's lived through, you know, the ups and downs of this market. I remember answering a similar question, I think on an A16Z podcast, maybe eight years ago. All right, maybe we can pop in the old answer right here. I think the answer is somewhat the same. Like machine learning is the stuff that's now become easy. And then AI is all the fun new stuff. And then as soon as it stops becoming...

the cutting edge, then it just becomes, oh, that's just machine learning. It's not magic anymore. It's not magic anymore. And I mean, you can go all the way back to the old like Dartmouth conference and things like that of like, oh,

spellcheck as AI, because that's an intractable problem. But I don't think anybody would constitute that as AI today. Now, of course, with generative AI, I think one of the big differences is instead of these systems being focused on just kind of like classification or regression, trying to determine whether something exists in a set or what the next element should be, now they're actually much more interactive. And like the types of

applications, the types of ways that you can use these systems is becoming so much larger because, I mean, the generative aspect of them is in the name, but that's fundamentally different, I think. And I think that's opened this whole new wave of value for enterprises. Before we get too deep into the state of the world today, can you quick talk through your background and then how you ended up with the realization that inspired Distributional? It starts...

about 10 or 15 years ago with my first startup, SigOpt. That company was based off of my PhD research, and it was all about how we help companies optimize AI systems. So how do you get a little bit more performance out of a very sophisticated model? And so this is in the sort of traditional machine learning days. Exactly. This was all traditional first wave ML and AI. The definition of AI has obviously shifted over the last decade or so. But it was, yeah, tuning XGBoost models, tuning like

tuning reinforcement learning algorithms, that sort of thing. But it was all about how do I tune all the different parameters? And now, of course, these parameters still exist with temperature and different ways that you can segment various aspects of these foundational models.

but it was all about how do I get this peak performance? And we did this for about a decade and ended up selling the company to Intel in 2020, where I led the AI and HPC division for their supercomputing group. And after helping people optimize models for about a decade and a half, I came to a really interesting realization that I was solving the wrong problem.

And foundationally, it was that the thing that's holding back people getting value from these AI systems is not performance. It's not about squeezing out that last half a percent from some eval function or some performance metric. It's about being able to confidently trust these systems. And I can't tell you how many times over the decades of working with SIGOPT and going all the way back to my PhD, we would help someone optimize a system and they would say, okay, well, what did you break?

Like, what bad behaviors are you introducing? What lack of robustness do I now have because I've overfit this system? And we're seeing people do the exact same thing again today with LLMs, where they're focusing on these high-level metrics, these end outputs, these performance evals, and that ends up masking all of these potentially undesired behaviors within the system itself.

But it's really a much harder problem because instead of just having a binary output, now you have maybe more freeform text. Maybe you have an agentic system that can do all of these different things. And behavior matters more now than ever.

But there aren't great systems out there to help you understand, define, and test these systems. And that's really what distributional is aimed to do. It's trying to attack this problem of confidence through testing instead of just optimization through performance like SIGOptoS.

You were managing a pretty large team at Intel, right? And managing a pretty large customer base. Was that part of what kind of helped you see a different form of this problem? Definitely. That was a big jump from managing a team of about 25 when we were acquired by Intel to leading a team of about 200 by the time I left Intel. I got to see problems at a higher level. And I got to start to see some of the things that some of our customers now are frustrated with, where they're

Yes, it's great to build these models. Yes, it's great to have them be performant and things like that. But at the end of the day, if you're responsible to your customers, whether they be internal or external, you care about reliability. You care about consistency.

And we kept running into problems there. And I mean, people are running into this problem across the Fortune 500 and Global 2000 today of, okay, how do I sleep at night effectively? And I couldn't find a good solution when I was at Intel. I couldn't figure out a way to outsource this.

And so like any good entrepreneur, if you have a problem and then you see it's a very pervasive problem, you can't see the solution, you go try to build it yourself. And it's been really exciting to do that again the second time. Again, taking all of those past learnings, all those past mistakes,

made a ton of mistakes as just a naive PhD student trying to build a company. I won't force you to go through the mistakes that you made. Maybe that's for a follow-up podcast. Yeah, we'll need a handful of podcasts to enumerate all of those. But now we get to do it again. And now make all new mistakes 10 times faster. We work with a lot of founders who cycle through focus areas very quickly, which can be great. That's a great way to sort of find product market fit. I think you're...

You know, it is rare in a sense where actually spanning three companies, you've sort of worked on a similar core problem and you're just really almost obsessing

obsessed with finding the right way to solve it. What was it like to sort of have this light bulb moment where you're like, oh crap, I wasted, you know, X years doing the wrong form of this? Or is it like, oh my God, this is so exciting. Like I finally found the right angle on this problem. Or can you just talk a little personally what that was like for you? Yeah, I think it's more of the latter. And one of the joys of being persistent for better or worse is that it allows us to really empathize with the people who are building these systems and

We've been able to work with people doing machine learning, AI for the better half of a decade or more now. And we get to see a lot of the same patterns, a lot of the same systems, a lot of the same mistakes being made.

And we get to help them do a little bit better, do a little bit more. That's really been one of the ethos of both SIGOPT and now Distributional. It's about building tools to allow these domain experts to do their expertise better, to do it more confidently, to really not have to worry about some of the things that they know they should be worried about and really focus on where they can apply their expertise to make something great.

And it's been great to be able to work with, honestly, a lot of the same types of people. A lot of the people who are now in charge of building Gen AI platforms or productionizing these massive use cases are the same people who built those original machine learning systems. At a systems level, though, how does generative AI compare to those other machine learning systems that these folks have been working on, let's say, in the pre-LLM era? I think one of the main distinctions is technology.

How atomic some of these units maybe are. And so with a lot of traditional ML and AI, it was all about, can I make a specific decision? Can I make a yes-no decision on whether or not I have a loan? Can I make a specific prediction? Do I know which direction the stock is going to go in the next minute or something like that?

And with generative AI, it becomes much more collaborative and it's much more expansive in what it can actually do. From literally just being able to have a conversation with it to some of these agentic systems that we're seeing come up where now you have a model calling a model, calling an MCP server, calling a model, making a decision.

And these pipelines have always existed, but the kind of end-to-end nature of them and how hands-off they can be, how dependent they can be on these internal components and how some of these behaviors can propagate through this system, I think is fundamentally different from the more atomic, like unit-based, almost microservice aspect of traditional ML and AI systems.

So I guess what does that mean if for like a product owner, what do they kind of worry about when something goes to prod in kind of new AI land versus traditional machine learning land?

Yeah, so I think some of the concerns are the same in the sense that they want these systems to be performant. They have to do better than whatever the status quo is today, whether that's a human or whether that's a traditional machine learning and AI system. But fundamentally, again, and this is the learning that I got after doing this for such a long time, focused squarely on optimization is it's more than just performance, too. It's this behavior of these systems.

And it's making sure that not only does it hit whatever KPI you want it to do, but it also doesn't do the bad thing. It doesn't have some sort of undesired behavior. And that undesired behavior could be, I just don't want it to change dramatically without me knowing, or it could be that I explicitly don't want it to exhibit a specific bias or a specific type of response.

And I think the main thing that's shifted, though, is just how large that output space can be. There's really three main things that make it difficult to quantify and understand the behavior of these AI systems. And one is that they're inherently non-deterministic.

And so this is true of some traditional machine learning and AI systems as well. But basically, the idea here is the same question can get you different answers. And this non-determinism isn't just that the exact same question can get you different answers, but it can also be a very chaotic system where like slightly different questions can get you very different answers.

Another aspect of this is that they're inherently non-stationary. And so they're literally shifting underneath you. And this can be because your LLM provider decided to change their infrastructure. And then that changes the way that memory is accessed, which for some reason changes the types of responses. Or it can be because maybe upstream of your application, somebody added more things to your vector database, or they changed the retrieval prompt, or they changed something else that

These systems are constantly shifting underneath you when it comes to a product perspective. And this gets to this third component of the complexity of these systems is just getting larger and larger. Again, they're no longer these atomic units where you're just getting a single yes, no answer out of them. It can be these systems where, yeah, you're retrieving and then you're generating a response. And then that response is being fed into another system, being fed into another system where maybe an autonomous decision is being made.

And so some of those issues around non-determinism and non-stationary end up just propagating through that system. If it's chaotic at the beginning and chaotic at every step along the way, these very small changes to the input that we're creating, large changes to a single output, can now create massive behavioral changes by the time it actually starts to affect an end user.

And so it's really important. And what I think a lot of firms are running into right now is if you're only looking at that last step, if you're only looking at the system's performance as a whole, it can be very difficult to understand when, where, and why behaviors are shifting within this application upstream and be able to use that information to adapt, make changes to your application, or understand exactly what's happening. I think we're pretty rapidly entering a world where

Trust in AI systems is even more important than their raw performance, right? Because these things are really good at doing a lot of different stuff.

But how do users, how do customers actually accept that? Is that sort of part of what you're talking about when you describe this problem? Yeah, and that trust can come in many different forms. It can be making sure it's reliable, making sure it's consistent, or even going so far as to say making sure that these kind of latent behaviors are aligned with my desires. And there's obviously a lot of great companies out there trying to solve this at the kind of AGI global level.

But for individual enterprises, you want these applications to also be aligned with your business, with the values of your business, with your individual goals, not just trying to squeeze out a little bit more click-through rate, a little bit more retrieval performance or whatever. So it's almost like the enterprise has to trust what the model or what the system is going to do in order for their customers to kind of trust them. Is that sort of a fair way to put it?

Think about it. Definitely. And like anything in life, it's important to trust, but also to verify. And that's where testing comes in because you do need to be able to trust these systems. But you want a mechanism to be able to

consistently, reliably, and adaptively verify that they're doing this, both as you're making changes, as the world changes underneath you, and as these models change as well. And this is what you're doing at Distributional, right? Definitely. So Distributional is an enterprise platform to allow teams to test these applications in production to make sure that they are behaving as expected.

You said behavior a handful of times, but behavior can mean a lot of things to a lot of people when it comes to AI. So how do you define behavior in the context of what

you're trying to sell for, what distributional is trying to sell for? Behavior for these applications ends up being not just what it produces, but how it produces it. So it's all of these characteristics potentially of the text itself. So not just was this a good answer, but maybe what was the toxicity of the answer? What was the reading level of the answer? What was the tone? How long was the answer? All of these just properties that can be characteristics of just the language itself.

But if it's part of maybe a RAG system too, it's like, what was retrieved? How often is that what's being retrieved? What were the timestamps related to these individual documents? Was it starting to ignore things that it used to cite all the time or vice versa?

And as we start to get into these agentic systems, and a lot of teams speak about this, I know Google's deep research team spoke about this recently. It's like, well, in the reasoning step, how long did that take? How many reasoning steps did it take? All of these are characteristics for how the model is actually behaving online.

On the way to that end answer, on the way to that performance metric. And I definitely don't want to say that performance doesn't matter because it definitely does. And that's an aspect of behavior. You want good behavior. You want good performance. You want consistent performance. But it is fundamentally this very high level bit.

And it can mask all of these underlying latent behaviors that could have an effect on the system. And especially because these are inherently non-stationary and chaotic systems, you want to be able to catch those latent behaviors as quickly as possible. And some of them might actually turn out to be things that you want to be performance metrics, much in the same way that with traditional machine learning, fraud detection, you want to be as accurate as possible, but you

By detecting that you have bias, maybe you want to factor that into your performance metric over time. And this is about adapting that desired behavior over time as you learn more and more too. Sounds like the first time I ever did a job interview. It's not just what you say, but how you say it. Exactly. And then what you do as well too, because at the end of the day, if you have a lawyer or something...

Having a high score on the LSAT's great, but that doesn't tell you anything about how they're going to behave in the courtroom. So AI, you know, bedside manner or courtroom gravitas. And there's so many classic examples, too, from traditional ML and AI. And you'll remember some of these, I'm sure, too. You train a reinforcement learning algorithm to, like, pitch a baseball. And it decides that just running the baseball up to the catcher is the most effective way to do it. You're like, well, okay, I did just tell you to get the ball in the mitt.

You technically solved it, but that's not great. But even in these Gen AI systems, like there's this classic example that came out recently of chess. Or if you just tell it to win a chess game and you give it access to board state, it's going to rewrite the board state to win. And that's an example of a behavior where, yes, if the performance metric was games won, it's got a great performance metric. But the behavior was it cheated.

One thing that we've seen that's really interesting over the last year, year and a half is people have started to shift from kind of science project prototype land where they have a bunch of individual teams trying to roll their own stack and

and trying to build everything that they need to get something out individually into starting to build more centralized platforms. This is, again, very similar to what we saw in the early ML and AI days where maybe every individual data scientist was like, I'm going to spin up Scikit-learn, and I'm going to have all my data locally, and I'm going to have some model. That works incredibly well for rapid prototyping and exploration.

But when it comes to exploiting these models and really making sure that you can do it at scale, making sure that the organization can protect itself, can really leverage all of its resources, you want to have more centralized tooling.

And so much in the same way that we saw the rise of ML and AI platforms over the last decade, we're now starting to see the rise of these Gen AI platforms. And this has organizational benefits from, again, making sure you can scale, making sure you have proper cost allocations. But it can also really tamp down on a lot of what we're hearing from CIOs and CTOs of kind of shadow AI. Right.

People calling models they shouldn't, feeding them things that they shouldn't, creating vulnerabilities that they shouldn't. And so this is the perfect kind of harness to start to do testing, though, as well, too. Because once you have the centralized Gen AI platform, once you have a gateway or a router that's logging all of your API requests, all of your traces, etc.,

Testing can live on top of those logs, on top of that data store, to give you this more holistic view across all of your applications. How are they behaving? Which ones are behaving differently? And provide that behavioral analysis and testing to the end developer for free as part of that platform. And so is this shadow IT problem worse with language models and generative AI compared to the past?

I would say so because everybody's doing it. You had to know what scikit-learn is. Yeah, exactly. And it was a somewhat localized problem because you're doing data science on your laptop versus now I'm just shipping off a secret IP to some SaaS company or something like that. And it's really easy to do it. You just need an API key, basically. Exactly. I mean, it's great that all these...

developer tools exist and they've made it really easy. The downside is they've made it really easy for people to do things they shouldn't do either. And I think a lot of organizations are starting to realize that. So not only do you have the benefit of centralization and scaling and support and things like that, you're also mitigating the harm that sometimes is accidentally being done where it's like, oops, I guess our code base is now public.

or whatever it may be. So if I'm a technology executive at a big company right now, experiencing this phenomenon of little AI projects popping up everywhere, some of them doing really well and actually becoming important parts of our business,

What should I think about doing just practically speaking to kind of get my arms around this? And what should I think about building to reach kind of like this target, you know, state of like platform nirvana? So we see a lot of different variations of this, to be honest, across all the different companies that we're talking with or working with.

And I'd say there's two things that they're trying to solve here. One is trying to make a platform be useful enough to basically draw in these people who are saying, no, don't worry, I've got it covered. Like I've built my own stack. And so some of that is by providing value add services. So it's like, we're going to take care of scaling for you. We're going to take care of cost optimization for you. We're going to take these things off your plate.

And one of the things that they do is like, well, we'll build kind of a centralized router so you can access all of these great LLM models. And we're going to basically create a store that you can switch between versions and models and things like that. We can centralize logging as part of this. So you don't have to deal with the fact that this is producing maybe a large number of logs and

On top of that, too, you can provide testing. So now we can do this layer of making sure that you can detect and understand these underlying behaviors. And this is fundamentally different than the way a lot of people are rolling this today because if you're using one of these kind of all-in-one, out-of-the-box platforms focused on developers...

They allow you to get going very quickly, but maybe they're not logging everything. Or maybe they have very rudimentary monitoring that is just looking at a handful of performance evals. Or they have the ability to kind of look at individual inputs and outputs and do hand annotations. But it's very difficult to do that at scale. And so for these technology executives, they need to be able to come up with this way to create this kind of

or lowest common denominator interface so that everybody can leverage it. But then they also have to provide that value add, and we're seeing this from a lot of executives, to entice people to move on to that platform. I mean, you made an interesting point about giving developers a reason to want to get onto the platform. If I now switch to my developer hat, it's like,

Consume logs for me. That sounds great. I don't know where to stick logs. Test for me. That sounds great. I don't like writing tests. Give me a store that somehow standardizes the interface across a bunch of different LLMs. That sounds less good to me. It sounds like you're just adding a layer between the thing I actually want to use. Just practically speaking, what pieces often come first? And how do you actually do this if this is your job? One of the first pieces that we see is this kind of gateway or router interface.

And some of it is less a value add and more of a, this is the only way we're going to let you access these things. It comes down to some of these models need to go through a GRC process. It's like, we're going to let you use open AI, but we're not going to let you use hosted deep seek or something like that. And, and,

By centralizing that, they can then start to rein in some of the chaos. Here are the 30 different models that we do support and the 20 different versions of each that we do support. Oh, wow. So you're seeing enterprises will actually support that. Because I was sort of picturing in my mind, like, here are the two or three models you're allowed to use. But it sounds like you've seen one company may have 30 models available. People want to use the right tool for the job. And different models, different versions have different trade-offs, different costs, different context windows, different rate limits, all these sorts of things.

And we're also starting to see more and more people wanting to fine-tune or create SLMs or use more kind of static weight models as well, too. And so it ends up being this difficult infrastructure problem of hosting against a handful of non-stationary APIs, against a handful of internal models and things like that. Creating this uniform interface ends up being extremely valuable for that developer to be able to pick and choose, to be able to A-B test, et cetera. So one big complication is...

that sometimes the incentives are misaligned. So open AI obviously wants to create the best general purpose foundational models. But an individual business may want a model that solves a very specific problem a very specific way very well. And a developer may just want to be able to get something stood up and integrated into their application as quickly as possible.

And so this platform engineer, this technology executive is kind of stuck between these two things where they need to be able to provide access to all of these cool tools and all of these great abilities while also making it as easy as possible for the developer. And so we've seen kind of

Two approaches here. One is where they will leverage kind of a generic platform from maybe a cloud provider, and there's many great examples of this from the various clouds, or build up from best-in-class tools. They're going to pick the vector database that fits their needs the best. They're going to pick the log store that fits them the best and the testing solution that fits them the best, etc.,

But fundamentally, they're trying to solve the same problem of fitting these two puzzle pieces together. Here's all the great research that's happening across the world and changing every day. And here's what my developers need to move their things forward.

How do I create this kind of universal kind of puzzle piece adapter in between and then provide enough value on top of that? And so that's a real technical problem. If I have 30, as you said, 30 models with variants and fine tunes and all, like that's actually a real problem that needs to be solved, not just I'm going to lock this down. Yeah, exactly. And something that if you were just trying to stand something up, you're just going to pick one route, one path, etc.,

After you've built that router or gateway, though, the very next obvious thing to do is say, okay, well, this is actually, there's a lot of data going through this. I should probably log that data somewhere instead of just throwing it away. And that's where they start to leverage sometimes more traditional data stores that they have in place today, where it's like, okay, we already have a way to log API calls or traces or some of this richer information.

But once you have those logs, then you can start to do analytics, testing, monitoring, and things like that on top of it. And so it starts to accrue this value and end up looking more like a traditional platform as well. And meanwhile, the developer starts to get some of these capabilities kind of off the shelf. And so they don't have to think about logging. They don't think about having to test. They don't have to think about any of this. They can focus on

fine-tuning their prompts. They can focus on building these agentic systems into their current user workflows. I think a lot of developers, to your point, when they're trying to solve a narrow problem, kind of think they have it covered

Right. They've got a handful of test cases. They just kind of play with it for a little while. You know, what people used to call vibe check. I don't know if that's still a term in use. What really should they be thinking about? And from an enterprise standpoint, what should people be thinking about? We believe that they should be thinking about the holistic behavior of these applications at scale.

And so that's not just looking at a handful of performance checks on a small data set. That's not just throwing in your 100 favorite inputs and making sure that you always get your 100 favorite outputs. We talk to a lot of firms that are terrified to cross this AI confidence gap from, I've developed something that works good in

in theory, how do I actually scale it up in practice? And a lot of times we'll talk to individuals who say, every single time I bring on a new user, every single time I add more data to this, it changes a little bit. And

Right now, I'm doing that incrementally. But when I turn on the fire hose, I have no idea what's going to happen. And I'm terrified about what that is. And this can create this gap where things languish in this prototype phase. They languish in this, I've got this great proof of concept, but I still am terrified to turn it on to a million users, 10 million users, whatever it may be, or turn it on to real enterprise value. And so I think...

The shift in mindset needs to be not just, does it work the way that I want to look at it? Does it work more holistically? Do you have any good stories about things that have gone wrong when people don't test?

Yeah. I mean, there's a handful of stories where people think that they're doing the right thing, think that they're, of course, this is a no regrets value add to the system. And it ends up having these weird trickle down effects. And so RAG has obviously become very prevalent in a wide variety of industries and people use it for a lot of different things. We've spoken with different firms that

They were like, okay, I'm just going to continue to add more and more data to the corpus because more data is better. Like, of course. But this ends up messing up the retrieval mechanism. And so where before it all had very recent data and it was giving very good responses because people were asking about things that had a lot of recency, now they put their whole history into it. Now it's grabbing old stuff and

Pretending like it's new. 1902. Yeah, exactly. I wanted last quarter's earnings and now you're giving me six quarters ago's earnings. Or I wanted this specific entity.

And that was relatively unique to begin with. But now that you've flooded it with all these other things, I'm picking up all these things that are kind of around the edges of it as well, too. Have you seen anything released that upset all their customers or trading strategy that just was a bottomless pit of money? So thankfully, we haven't seen anything like that quite yet. But there's definitely...

I mean, hallucination remains a problem. And it's one of these things where the system can convince itself that there's evidence that there isn't or will convince itself that you want an answer that's different than what you actually want.

And sometimes that's because it's been given information that it tries to interpolate between. Sometimes it's just trying to fill in the gaps. Like fundamentally, that's what these systems are trying to do. And that can lead to really bad behavior for users. And it can lead to... Another example is...

People using the system in the way that they've always used it, but for some reason it's starting to trigger all these guardrails. And it's because intermediate parts of the system have transformed it or morphed it in different ways that all of a sudden it's flipping a switch somewhere. And now they're getting this terrible user experience because they're being told they violated some policy when in reality maybe they only changed a single word in their prompt.

Is your belief that AI systems should basically be tested atomically, like in the same way that we had unit tests for traditional software, you can sort of isolate each piece and make sure it's performing up to spec? Is that sort of the same idea for AI systems? I think you need to be able to quantify the behavior atomically, but then...

be able to test across that behavior more holistically, like a regression test. And so you do want to be able to quantify the characteristics of how the retrieval step is happening. You don't want to just look at the very end answer, but you also want to be able to see how changes in inputs have propagated through the system. I see. And so it's a mixture. And I know you've

done some pretty intense math. I'm resisting the urge to ask too many math questions because I know the two of us will go a little too far down the rabbit hole. But I know you have done some pretty sophisticated work as a team to think about the right way to approach these sorts of tests. Did you mind giving just a brief overview of why it's a hard problem and how you've addressed it? Yeah. So one thing that's a little bit counterintuitive and maybe a little bit different than the way people approach evals today is that

Instead of having a small number of strong estimators that can conclusively tell you whether or not something's performing. Testing is a difficult problem because, I mean, just being able to quantify behavior and trying to understand what it is that's like intrinsically happening within these systems is a very difficult problem. And this is where we differ a little bit from maybe traditional approaches to LLM eval.

Because instead of trying to come up with a small number of strong estimators for performance, where we want to be able to conclusively say A is better than B, instead what we want is a large number of potentially weak estimators to be able to determine whether or not A is different than B.

And you're asking a fundamentally different mathematical question at that level because having these weak estimators that are maybe higher entropy and things like that can give you an insight into the subtle shifts in the way that the system is behaving or acting or processing information that will then have some end result in performance that you actually care about. But being able to correlate that, being able to go back and root cause and be able to say my performance dropped because

because this shifted, because this component changed. Here's evidence of what distributions shifted. Here's evidence of the results that are no longer the same. It can be an extremely powerful tool to not only give teams understanding, but allow them to actually react to this. Because one of the things we've seen is

When your performance drops, that's helpful because you know something must be broken. But it kicks off this net new research process to be able to be like, okay, now I need to build again from scratch versus performance dropped because this thing is not what it used to be. That gives you this foothold to be able to go fix that. Or maybe not all change is bad. Maybe that's actually something that's

that you want to factor into your performance as well. I see. So you're almost saying it's like rather than a normal software stack where you're just kind of testing end-to-end what's happening, you're sort of saying you've got some little lab subjects in this glass tank and you need to attach all sorts of little sensors and probes to see exactly how it's behaving in order to kind of know what's going on. Exactly, exactly. And it's really upping all of those sensors and probes, basically. So don't just see whether or not

lab subject A versus lab subject B was able to complete the maze, but like, what was their heart rate? Like, what was all of these types of things that then would enable you to say? And so are you effectively doing statistical testing to sort of understand the change, you know, in each of these sensors from one iteration to the next? Or just, again, at a very high level, how does it work? So this gets into the name of the company. We think of all of these things as distributional. And so fundamentally, it's not about having a single input be bad that you might want to trigger or something, but it's about whole

Holistically, how is this behavior changing in a population setting? How is one distribution of behavior today versus a distribution from yesterday? And that distribution isn't just what's the distribution of performance, but this higher dimensional distribution, this distributional fingerprint of behavior. How has that shifted? And that allows you to get these insights, which fundamentally allow you to root cause and understand

understand your models in a way that you couldn't otherwise. Which is very interesting because there's a lot of talk, especially with large language models, about what's in distribution versus what's out of distribution. And typically people are talking about what was sort of well represented in the training data versus what wasn't, which is important because these models tend to do well at things that were in their training data or similar to and not so well at out of distribution things. I've seen very little actual kind of quantification of this, right? There's a few good papers about it. But what you're sort of saying is for any particular system,

meaning using a certain set of models in a certain way with a certain set of prompts. You can actually characterize literally what the distributions are and then sort of check how those change over time. Yeah, exactly. The distributions of the outputs, but also the entire process itself.

to getting to those outputs. And there's a lot of rich information there. And a lot of it's being masked today by only looking at individual things. It's very cool. I mean, it's really an industry-wide problem. And so what's the benefit to an enterprise in the end of deploying distribution or something, you know, a testing solution in general? Yeah, so it's more confidence, which allows them to tackle harder problems, to be completely honest. We see some firms...

attacking the low-hanging fruit, internal chatbots to ask questions about HR because they're afraid to take that leap to develop the difficult problem because it's so unwieldy and there is so much risk associated with it.

A lot of the most valuable use cases also have the most inherent risk. And testing and confidence is a way to understand and mitigate those operational risks, whether they be financial, reputational, or regulatory. Given what you're seeing in that regard, do our definitions of things like reliability need to evolve as we adopt more generative AI? Like, for example, how can someone be confident a system is going to act the way they expect it to if, say, it's never exposed to real-world conditions? Yeah.

So being able to define like what does reliable mean, it starts with change. It starts with at least is today different than yesterday? Like that's a question you can ask in an unsupervised way because you don't need to have any preference between the two. But then from that, you can start to say, okay, I liked this change. I didn't like this change. You can start to become more and more specific about what types of change are acceptable or not.

But just by being able to see differences, then you can start to apply supervision, start to apply a preference for one versus the other. And I think fundamentally, people are starting to kind of accrue

accrue more and more sophistication. Maybe originally they were just replacing kind of classic NLP models with LLMs and things like that. And they're starting to push the boundaries of what used to be possible, not just making it better. Yes, it's easier than doing LDA to just like hit an LLM or something like that. But now, especially with agents, they're starting to really see what

these completely new frontier but with that comes a lot of uncertainty with that you need kind of a flashlight to be able to say this is how i understand this this is how i can really get confidence so kind of on the same topic how should organizations think about change management cost creep and that sort of things or tweaking things or i guess even swapping models in order to settle on a system that like both delivers what they want and that also that they can trust

One thing that we've seen with enterprises, especially over the last six to 12 months, is as they've started to roll more and more applications into production, they start to be able to make different trade-offs and they start to accrue tech debt, to be completely honest. And that doesn't exist when you're first building, but of course, like any technology, it accrues over time and sometimes.

And so they're starting to try to make tradeoffs between like, hey, can I use a cheaper model? Can I clean up this system prompt that I've just appended to over and over again over the last year? But in order to make that decision, you need to understand what the tradeoffs are. What if I switch from this expensive model to this cheaper model? What if I refactor my system prompt or whatever it may be?

And so, again, understanding the performance impact is an aspect of that. But understanding how does this change the way that I'm tokenizing things? How does this change the way that this application is behaving can also lead to an ability to make these decisions in a more clear eyed way. And like traditional software, you have a build. You have a test suite when you're refactoring to say whether or not you actually made it cleaner or you broke the build.

And by having good behavioral test coverage, you can now ask that exact same questions with these Gen AI apps. Did I make it better or did I actually break the build? So you're saying my system prompts shouldn't be full of capital letters and exclamation points and pleading, please, please do what I ask.

Potentially. Whatever works for you. Again, the right tool for the job. Maybe you can go back and make it one long capitalization please instead of... A hero on Twitter whose name I won't say who extracts system prompts from all these things. And my biggest takeaway from this is that system prompt often...

reflects the organization it came from. It's some new form of Conway's law. Conway's law is this thing that you ship your org structure as a software company. I think as an AI system company or a prompt writer, you're shipping your org too. It's like the Google one is very comprehensive and technical and dry. And then there's the startup one that's all over the place. So it's an interesting thing you say. The

The system and even the prompt sent to a language model kind of reflects the organization it's coming from and the behavior they're looking for in a way. Exactly. And as you modify that or as you, I mean, it's nested system prompts all the way down, right? As you modify yours that's calling theirs, like you're going to get different behaviors. And it's a combinatorial mess, even just selecting a model, let alone you exploring the

system prompt space yourself. And again, having at least insight into the behavioral implications of your changes is incredibly important because

I mean, fundamentally, once something's actually making money, once something's actually useful, you need to take a little bit more of a conservative stance to make sure you don't break things as opposed to just trying to build something as fast as possible. Just love the idea of every new person, you know, it's like the engineer, then the compliance person in the market, or everybody's just tacking on to the prompt. This poor, confused language model probably has no idea what to do. And some of them are conflicting and like, whatever it may be. Yeah.

Yeah, policies shift. We've talked to a lot of firms that are afraid to change that prompt because they were like, well, the GRC team wanted this line in there and this team wanted this line in there. And what happens if we combine them? But with a system like yours, basically, they're able to check

how things change once they do make some of these fixes. That's very cool. So one thing that seems to be true to me, at least, is that there's a pretty big delta between the people building a lot of the major foundation models and, you know, in the sense that they're more researcher than a traditional entrepreneur and then enterprise buyers. But at least with many previous tech shifts, like it's enterprise adoption and sales that ultimately provide the bulk of revenue for these new technologies. So how do you see enterprise users ultimately

exerting influence over product design? Or is this wave of AI sufficiently novel that even large buyers are just going to have to keep reacting to the model changes that the large labs introduce and maybe have less control or less influence than they might have had historically? Most of these AI labs are not enterprise folks, which is fine. You know, it's a lot of very, very, very smart researchers who know their field very well. But yeah, I'm curious if you see this interface, you know, because I guess some of the very big customers for the

the big labs do have more influence than ordinary devs? I mean, I think it's going to be a co-evolution where obviously some of these labs are very focused on the cutting edge of research and pursuing AGI and things like that, but they also need to make money. And so they're going to adapt to what their users need, and then the industry is going to adapt to the tools that are available, and it's going to be back and forth. It's going to be like the finches on the Galapagos Islands. And

Overall, we're going to get some, I think, specialization. Certain models are going to come out. They're going to solve specific enterprise needs incredibly well. But in general, too, the enterprise is going to continue to adapt to, hey, I've got this new tool. Hey, I have access to this new thing. How can I make it fit?

But unfortunately, again, I feel like the platform owner within these enterprises ends up being this connector. And they need to be able to be this interface to this technology while also providing it in a way that's accessible, consumable, and understandable and ultimately fits the needs of the enterprise, whether that's through testing and auditing and things like that or just scalability.

That makes a lot of sense. And you're sort of saying that these enterprise platforms actually have a pretty important role to play in the whole industry. If I'm sort of understanding, right, because the labs put out what they put out. Developers love experimenting with things. Some of them work. But if there's nobody connecting the puzzle pieces, as you say, that actually could be a real problem. Definitely. And somebody needs to...

make sure that these models continue to work, these applications continue to work over time as well, too. And I've seen this with traditional ML and AI, and this happens in traditional software, too, where

The developer moves on and then it needs to be maintained. It needs to stay up. Yeah, because we don't have like AI ops teams yet, right? There's no dedicated group of people that just try to make sure the system is running. And in the way that we do is sort of traditional, you know, DevOps and things like that. Who gets paged in the middle of the night when your AI bot just sold the office building by mistake? Exactly. And who has to figure out whose fault it was?

At the end of the day, too. Yeah, I think as we see the rise of these Gen AI platforms, we're going to see the rise of more AI ops, the people who have to make sure the system's working and understand when it isn't and then fix it. We talked about global versus local solutions to this reliability problem. What's your view? What should be solved kind of by the industry as a whole versus what needs, you know, this kind of like local context? Yeah.

Definitely. So I think there's aspects of this that are universal. Just being able to define some of these behaviors and detect large scale changes in them, being able to set up a system that's able to give you that level of insight.

again, at that extremely high level. But then every individual team has different behaviors that they want or things that they don't want. And so very quickly, you need to be able to take from a system like ours an ability just to detect change

at a global level and then be able to go through a workflow to adapt that and build a better behavioral test coverage that becomes more and more specific and bespoke to your individual application but fundamentally that same

global level can help many different teams within an organization or even across different organizations get that foothold. And that's what we're trying to develop with our platform. But then beyond the platform itself, that workflow is about fine tuning it and specifying it to the behaviors that you care about individually.

And there you have it, another episode in the books. As always, if you enjoyed this discussion, please do share the podcast far and wide. And keep listening because we have some great discussions lined up in the weeks to come.

Building AI Systems You Can Trust 47:40 Share

AI + a16z

Deep Dive

Shownotes Transcript

Building AI Systems You Can Trust