We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Making AI Reliable is the Greatest Challenge of the 2020s // Alon Bochman // #312

2025/5/6

MLOps.community

AI Deep Dive Transcript

People

Alon Bochman

Topics

Alon Bochman: 我认为构建可靠的AI系统是2020年代最大的挑战。我们不应该盲目相信权威或流行的观点，而应该依靠数据驱动的方法，通过实验来验证不同的模型、提示和配置，找到最适合自己任务的解决方案。构建AI系统的评估应该从最简单的开始，逐步增加复杂度，并不断测试和改进。用户会以意想不到的方式使用AI系统，因此我们需要不断测试和改进系统，以应对各种情况。评估的数量取决于系统的复杂性和重要性，以及用户的反馈。评估可能会冗余，我们需要关注关键路径和边缘情况。在项目早期，可以使用LLM生成评估案例，以快速迭代；在项目后期，应更多地使用人工评估，以覆盖边缘情况。为了有效地利用主题专家的知识，我们需要创建一个学习反馈循环，让LLM和人类评估者互相学习和改进。通过微调或更新提示的方式，可以改进LLM评估者的性能。当有多个主题专家时，可以通过聚类分析等方法来规范化他们的反馈，并识别出不同专家之间意见分歧的领域。AI的答案并不总是黑白分明的，我们需要根据具体情况来判断。AI可以帮助我们更高效地获取和利用领域专家的知识，从而提高效率并创造价值。通过LLM，我们可以将领域专家的知识和经验进行规模化应用，从而提高效率并提升用户体验。在构建AI辅助工具时，必须充分利用主题专家的知识，否则很难取得成功。为了加快学习过程，可以先使用一些通用的数据集或案例来训练LLM评估者，然后再引入领域专家的知识。LLM评估者的评价标准会随着系统的演变而不断变化，需要不断改进和完善。我们可以通过可视化的方法来展示AI的输出结果，从而提高用户参与度和评估效率。AI输出结果的可视化方法会随着时间的推移而不断改进，从简单的指标到复杂的图表和可视化工具。 Demetrios Brinkmann: (Demetrios Brinkmann在对话中主要起到引导和提问的作用，没有提出具体的核心论点，因此此处不进行总结。)

Deep Dive

Shownotes Transcript

So it's Alan Bachman. I'm the CEO of Ragmetrics. Our website is ragmetrics.ai. And I don't drink coffee, to be true. Sugar is my vice. We talk deeply about how to bring the subject matter experts into this eval life cycle, if you can call it that.

We also talk deeply about evals, as you could guess since we were on the topic. He is doing some great stuff at Ragmetrics on LLM as a Judge. Let's get into it, and a huge shout out to the Ragmetrics team for sponsoring this episode. Dude, I just want to jump right into it because you...

said something when we were just talking before I hit record. And it was all around testing each piece of the entire system. And I am always fascinated by that because there are so many

different pieces that you can be testing and you can be looking at. And I know that a lot of folks are trying to go about this. You had mentioned, hey, do you want to try and use big chunks or smaller chunks? Do you want to try and use one agent or 10 agents? So there's all this complexity that comes into the field, and especially when we start having more and more agents involved.

How are you looking at that? How do you see that? And how do you think about testing all of that? Yeah, absolutely. So first of all, it's let me just talk about the approach a little bit, because I think that a lot of engineers out there are.

you know, we're all new to this thing. And so it's natural when you're new to something to just kind of take somebody's word for it. So we read what, you know, maybe what the big labs have to say about how to prompt our models, or we read what the

what the vector database vendors have to say about how to set up, you know, vectors or we read, you know, and then there's maybe like a YouTube influencer that has like an idea. So, you know, if they're popular, if they're charismatic, then we listen to what they have to say. And that's all fine. Like, you know, we should listen to everybody, but,

You probably didn't get into engineering just to like take other people's word for it. You probably got into it because you enjoyed building. You want to see it for yourself. And this is an amazing opportunity to do that. Like amazing. Because let me tell you, Demetrius, nobody knows what's going to work for your task. Like definitely not me, but definitely not like OpenAI, Anthropic. They don't know. You know, Weeviate, they don't know. Like nobody knows. We're inventing this as we go. And that is awesome because we're

If you get into the mindset of, I'm going to follow what the data tells me, then you will come up with some amazing solutions. And that is how like every magical product that we hear about that I know of, you know, I haven't spoken with all the teams that are building magical products, but like every company

Every time there's a magical product, there's a data-driven process behind it where the team that's building it is not taking anybody's word for it. They're actually like trying out the different models, trying out different prompts, trying out like different configurations of an agent. And, um,

Sure, like you don't you can't try everything. You definitely you know, you can't just shut the world out. You have to learn. You have to build on the shoulders of giants. But you have to build like and it's it's awesome. It's an awesome privilege to build like it's you'd be super boring if we figured it all out. We probably all move on to some other place. So so this is like this is the good part.

Yeah, there would be a lot less hype influencers if it was all figured out already. That's right. You know, I think that let's take like a, I don't know, take a rag pipe, a typical rag application that you're building. So if you are following sort of this like empirical philosophy,

You'd probably set up something that is pretty common, right? Maybe you'd set up the easiest one. So you take the most popular model, take the most popular vector database, maybe you have like one prompt, one task. And then you think through sort of what do you want out of that?

step in the process. Like what outcome are you looking for? Regardless of what technology you use, think about like, what does it mean to generate value from this? What are your users like? What are your requirements? And then in my opinion, you set up an evaluation. All an evaluation is, is, you know, you set up a couple of examples. If a user asks me this, this is what I want out. This is what I don't want out.

And you come up with like, it doesn't have to be a massive amount. It can be like 30, 40 of them. Like it's totally possible to write out an eval in a couple of hours. It's not like a major project. But it is a major sort of journey of self-discovery because you're thinking through like what you want. You're thinking through what you want. It's a way of taking what you want from your head and putting it on a piece of paper. It's not unlike writing a requirements document, but it's like requirements for AI.

And, and then you have this, this great, now you have a benchmark and you can actually benchmark like different models can give you really different results. You'd be surprised. And OpenAI and Anthropic and Mistral, they have no clue which of their models is going to be the best one for your task. I assure you, they don't know. Nobody will know until, until you run this benchmark on your task. And the same thing with like all the other pieces of the, of the pipeline. Yeah. The,

The funny piece is you have that trial and error and you have to test it out and see in a way, like how few examples can you give to get a bit of a signal and at least know directionally we're going the right way. But then...

Later, to get that last 10%, it is just mayhem. And that's why I think it's fascinating tracking all of the changes across the whole life cycle. And it reminds me a little bit of Jupyter Notebooks and how you have Final Dot, whatever, iPub. And here it's, in a way...

That exploration is happening just like you were doing this data exploration back in the day. And so now you're trying to track it and you want to know each little thing that's changed and you want to see against the eval set. How are we doing here?

Right. I think it's not unlike building code, right? You start with, you just start with a happy path, probably. Maybe you're starting with a demo and it's like famously easy to make a demo with AI, right? Because it says some crazy stuff. You try enough things, it's going to say something amazing. If you just show that, like you don't show any of the other stuff, you're going to look fantastic. It's not hard to do, right? So you make a demo, maybe you get a little bit of buy-in, you get a little bit of support, maybe some resources. Yeah.

And then you have the next step, which, you know, maybe some people call it hill climbing, but it's just like, you know, instead of just one happy path, maybe you come up with like 10 or 20 happy paths. So whatever it is that you're

chatbot or copilot or AI step is supposed to handle, just think of a couple of different examples. So it's not just one path. And that's usually also fairly easy. It's not as easy as the demo, but it's really not hard. You're just thinking through a couple of different, you know, examples. And then, like you said, like it gets bigger. Hopefully, you know, maybe you have a second demo, you have a

eventually you let users in. And when you let users in, it's just, it's going to hit the fan. It's going to hit the fan in the most amazing, beautiful way because they will put stuff in there that you would not imagine in a million years. Like I, you know, you always know like the guy who was like when testing software became like a popular thing, there was this guy who was like, I don't need no automated testing tools. You just bang his head, bang his fingers on the keyboard, all like seven keys at once.

And it would break like a lot of software that way. I'm talking about like 20, 30 years ago. Like that was a really good way to test software. You just bang the keyboard. Well, that's what users do. Like they will really bang up your bot in a way that you just like never imagined. And it gives you the scope to just like make it a lot better. And it's like you said, you know, the challenge is that it's very easy for an engineer to focus on the keyboard

the edge case that you're fixing and like break a whole bunch of other ones because you don't

You don't know that you broke them. You just change the prompt. The prompt used to work for 20 cases. You're focused on case number 21. You're working really, really hard on it. It's not easy to get like shoehorn AI to do what you want on case 21. And you finally get case 21 working. Yay. Oh my God. Cases one through five just broke. You have no idea until the next user comes in, tries that. So like to me, the solution for that is to build up that, build up that eval. Yeah.

So every time that you run an eval, don't just run it on case number 21. Run it on all the cases. And then if you broke something, you will at least know right away. That's kind of like the first step to fixing the problem is knowing the problem. How many evals are enough? That is a bit like how long is a piece of string. It depends on, you know, it depends on what you're building. I mean, if you're building a, you know, I don't know,

an entertainment application, poetry generator, probably fewer than if you're building like, I don't know, a government service. True. Depends on the stakes, depends on the breadth of like functionality. In general, I would say like the, maybe the short answer is it's a dialogue with your users, right? So,

You want as few evals as necessary to keep your users happy. And the longer the thing works, like the more feedback you get from users. So it tends to expand. It's like scope creep, right? It tends to, users ask for more stuff. So it expands the functionality. And as you get more functions, you, you know, you add your evals. But another way to think about it, in my opinion, sorry, long answer for a short question, but like, okay, let's, let's say that you're, let's say that you're

comfortable with automating testing for regular software. Let's park AI for the moment. And somebody were to ask you, like, how many tests are, how many tests do you need for to like in your test harness? You probably say, well, I have some intuition because I have some idea about how complicated the code is, how many different code paths it needs to go through. But I'm also not so like literal and, um,

And I'm not such a novice that I have to get 100% coverage. Like it's not about checking the box. There are some paths that are more important than others. And so as a good software engineer, I'm going to make the trade-off. I'm going to try to figure out the, I have an idea of what's most likely to break.

What's the critical path that is like right on the edge of working? It's not so easy. If I mess up a little bit, it's going to break. And I have like, they're kind of like veins going through the body. Like I can trace them because I built the thing. And if I can only write like three, four, five, 10 tests, I know which are the critical cases that I need to cover. I think it's the same kind of intuition when you're building evals for an AI system.

Until you give it to the users and they teach you. Until that's right. And they realize, oh no, like, yeah, what if I, what if I ask it to, you know, play chopsticks? So, yeah. Well, I guess another way of phrasing the question is, can you ever have too many evals? I think you can. So first of all, I think that evals can be redundant in the same way that

automated tests of non-AI systems can be redundant. You can have this false sense of security if you've got, you know, a thousand tests, but they're all like really testing the same like 10 paths through your code. First of all, you're spending a lot of extra time that you don't need to, to get through your test harness. And just because you have that high count should not make you extra confident because you're not actually testing breadth.

So in the same way, you can have redundant evals that are not actually going deep enough and testing the edge cases that your users care about. So that's one way of too many. Another way of too many is you think of evals are kind of more interesting than regular tests, in my opinion, because they, you know, in the same way that you use RAG,

because you can't jam everything into the prompt, right? If we could jam everything into the prompt, we wouldn't need RAG, but it's expensive and needle and haystack and all that stuff. So we use RAG to kind of expand the memory or the mind of the model. In the same way, I think evals are expanding the knowledge base or the area of responsibility of the model

And just like with any kind of learning process, you want to have a healthy intake that takes in like new cases. And you also want to be able to forget things that are like you need a cleanup. So like maybe some evils are no longer relevant. The facts have changed. The law has changed. Your users have changed. The applications change, right? It's a new year. Crazy things are happening. Some things that we believed are no longer true. We need to update our view of the world.

So to the extent that evals are a shorthand for your view of the world, you need to have like just a healthy digestion system. Stuff comes in, stuff comes out. Is that just every quarter or year or half year?

you're going through the evals and you're checking them? Um, so you, you definitely need a cleanup stage and the frequency of that stage, you know, it depends on how fast moving your industry is. Right. So, um, if you're doing constitutional law, right, it's probably going to be really slow moving. If you're doing, I don't know, like news, then, you know, yeah, you're going to, you're going to need to clean it up a little, a little faster. Um,

You don't want to evaluate current events based on stuff that happened like 10 years ago necessarily. So yeah, it's basically like an update speed. And it depends on how volatile and fresh your knowledge needs to be, in my opinion. I like that visual too of the trash collector or the garbage collector. You come through and you also need to remember to have the trash collector for your evals. Otherwise, you could potentially be spending money where you don't need to.

Here's another way to think about it. Think about if you're, you know, if you're planning out a product, you're writing out, you know, let's say you want to communicate with a team that's going to build this product for you. You want to write a requirements document. Now, you could think of,

There's many ways to go wrong when you're writing a requirements document. You could go too detailed. Like if you write a requirements document that is super, super detailed, first of all, it's kind of obnoxious for the engineers that are working on it because you're specifying things that are really like probably they know better than you. They could probably, if you empower them, they could probably make better decisions than if you just like spell out like all the, how the code should be organized and stuff like that.

You can also go like not the, but most of the problem, that's usually not the problem. Usually the problem is it's not detailed enough. And then there's a lot of misunderstanding. An engineer thought that you meant something, you meant something else, used shorthand, they misinterpreted it and they build something that doesn't work the way the customer wants. Like that kind of error, the not detailed enough error in my career has been like a lot more frequent than the too detailed error. And I think that when you're communicating with AI through evals,

The overwhelming risk is not enough. Like, I think it is possible to have too many. But, you know, if we look at 80-20, you know, more than that, like 99-1 are not enough. 99% of the teams that I know don't have enough evals. Let's talk about handcrafted evals versus the...

evals that you just ask an LLM to generate for you. How do you view those and the differences there? So first of all, like it's a matter of cost, right? So let's put that in the context of what stage you're in in the project. So maybe, and we talked about some of those stages earlier, right? In the beginning, you just have the one happy path, you've got a demo, and then you've got maybe

Like a little plant, it's, I don't know, it's like a couple of weeks old. It's got a couple of leaves. And then you got this whole like tree and it gets bigger and bigger, right? So in the beginning when you're just working on the happy path,

you probably don't need a ton of evals. And it's probably fine, you know, when you're starting your evals, it's probably fine to have them synthetically generated you want. You just want to move quickly. Like you don't want to wait for other people. And it's not so, so critical that each one of them be letter perfect. You're just trying to grow the surface area quickly so that you get feedback faster, so that you can address faster. It's like a learning exercise. So the faster, the better.

Over time, as you get exposed to more users, synthetic, in my opinion, loses some value because you have, it becomes more important to cover the edge cases. And there's more of them. And you have access to user data. And when you have access to user data, it's usually better than synthetic. Now, you could still get some help, in my opinion, like just processing the user data into something useful.

It's just a slightly different, I don't know what the right label for it is. You're not asking a model to generate from scratch, which is what I think of as synthetic evals, but you could be asking the model to say, okay, help me scan through this like massive ocean of stuff that the model has said to people. And I want to boil it down into the ingredients. Like, you know, take the soup and I want to get the ingredients out of the soup. I just want to like, give me the carrot, give me the celery, give me the Brussels sprout.

And I'm meaning that I want like a few examples that are different from each other. They're all orthogonal and together they collectively will give me the same flavor as the soup. And then you end up with, so AI can actually help you with that. It can help you boil down this like massive log file into themes, groups, and then like kind of reduce them so that you have like a few examples to work with, but it's different than generating it synthetically.

Yeah, and probably also, I've never tried this, but it makes me think about it as you're saying it, recognizing where we have

less evals where we're a little skimp on the evals? And can you just help us augment that from because we have a few users that have gone down this path, but we don't have a very robust eval set in that regard. So let's augment it with a few examples from the LLM. Yeah, absolutely. So AI can be like a sparring partner when it comes to growing your evals. Exactly like you said, maybe

maybe you noticed a couple of failures in a particular area that your eval is supposed to do, but there's only been a few user inputs, but you're anxious about it because you know how the thing is built and you feel like there's going to be more and I'm just not ready. And so AI could be a really nice, safe sparring partner that could just elaborate on those examples, come up with Harvard ones. And it could set up this sort of dialectic, right?

Answer, you know, help me ask some questions that are going to be hard for me to answer. Well, it brings up this topic that we were also talking about before we hit record, which was how can you loop in the subject matter experts, especially the non-technical stakeholders, to craft better evals, to create a much more robust system?

Totally. So let me just maybe set up. This is my understanding of the status quo. This is how I think things work today.

Most teams don't test. In my experience, they just don't test. So if they even just even tune into this conversation, listen to like two minutes of it, thank you. Just thank you for that. I appreciate it. Hopefully they get a ton of value from just set up the first eval with 30 examples. Amazing. That would be a win. But like the teams that do test, they get into these ruts where on the one hand, you

you know, maybe they generate the data synthetically. Usually the people that know the synthetic tools are the engineers. They don't have the domain expertise. And so if they show the synthetic data to the domain experts,

The domain expert looks at that and it looks like a fifth grader did that. Like they don't, to them, it looks like you don't even know what you're talking. I can't understand what this is. Like you're asking me questions about the footer and like what? And so, so it, it makes engagement hard for the domain experts. On the other hand, you know, if an engineering team asks the domain experts say, Hey, can you help us with an eval? Of course, just like, what the heck's an eval? Like that is not the word that they use.

So that hurts engagement. If you are able to have an hour with them and you show them what an eval is, and you show them the spreadsheet, then it's a blank page problem. Because nobody wants to start with a blank page. The columns are a little unclear. It's very difficult to get domain experts to start writing these evals. So the way that we think about it is, it's more about creating a learning loop, a feedback loop.

And the feedback loop should be that you have an LLM judge that evaluates what the application does. That's the typical LLM judge. But then you also have a human being that potentially evaluates what the application does. And that human being can either be the user or it could be the domain expert. It's up to them how much they engage.

But it's really, really important to have the LLM judge scores alongside the human scores. So you have, let's say, the user asked X, the model said Y, and then the user said, you know, Y is not correct. I actually want Y prime. And the domain expert said,

Y is not correct. I actually want Y double prime. So it's really important to have Y prime and Y double prime like on the same screen. So you can compare what the model said, what the user wanted, what the domain expert wanted. And they can review each other. That's what makes, in my opinion, the feedback loop really powerful. What does it mean to review each other? So the domain expert sees what the LLM judge said.

and gives feedback to the LLM judge to make the LLM judge better. So, you know, hey, LLM judge, you keep focusing on the footer, but really you should be focusing on what's between the header and the footer because that's like the part of the document that we lawyers care about. Or, you know, hey, LLM judge, you keep picking on my spelling, but just ignore my spelling and focus on the math. Whatever is the right thing for your application. And so the feedback from the domain expert

to the LLM judge can make the LLM judge a better stand-in for the domain expert the next time around. And that kind of stand-in quality, there's a statistical definition for that. So you basically can, you can measure the correlation between LLM judge decisions and domain expert decisions. And the higher that correlation goes, the happier everybody gets. So domain expert trusts the LLM judge more, the LLM judge is a lot more scalable, and

and good things happen. At the same time, domain experts need the LM judge because sometimes they

the LM judge can notice things that just as human beings, we just don't have like the bam, because like maybe there is an error in the footer. Like, do you actually look at the footer? Like maybe it's got the wrong year in there. I don't know. Like you have all these websites that say copyright, like 2022. It looks terrible because nobody looks at the footer. So my point is like, sometimes the LM judge can notice things that as human beings, like, oh my God, there's just like, I just don't have, I don't have the,

There's not enough years until the heat death of the universe when we look at that. So the LN judge actually makes the domain expert better at their job too because they're noticing that. And they say like, you know what? Yeah, you should keep that. So that feedback loop makes the LN judge better and makes the domain expert better. And the fact that they review each other

improves the chance that your domain expert will actually engage because they're getting feedback. Now, instead of a blank page, they give a thumbs up, thumbs down. They explain why. And the net effect, in my opinion,

is that this is really like, it's a way of sucking in domain expertise into your application. And there's not a lot of other ways for it to get there. If you think about it, all of these, the AI labs are solving for general intelligence, right? They want to make the best model. So they use general benchmarks, benchmarks like MMLU and theoretical math and physics and

You know, when we're building stuff for customers, like, who could care less about how well it does in theoretical? It just doesn't matter. So you need to know, like, I don't know, the last app that you built, you know, you need to know how it's, you know, whether it's extracting the right bits from an earnings release or whether it is being rude to a customer or not, or whether it's using the up-to-date HR manual or not. That's really what you care about. And there's no benchmark for that.

you need to kind of make it. That's why it's your software. That's why you get to charge the money. That's why people come to you. So when you're building one of those, the value that you build over time is going to be proportional to

how much domain expertise you can bring into your bot. And the domain expertise is going to come from your domain experts, your application, your knowledge, your inside-out view. And it's going to be different from mine, even if we're in the same niche. Like we might both be writing financial parsers, but you've got a different view on, you've got a contrarian finance view. And that, you know, that's why you have alpha. So,

When I'm building mine, like I can't just copy yours. First of all, I don't know yours. It's private to you. Secondly, like my customers are looking for something different. Like there's a reason why your company exists and why my company exists and we have like different business philosophies. That's why we're different companies. And so it's a beautiful thing. So this spurs a whole load of questions. I want to start with, are you just, when the expert is having that conversation

exchange with the LLM judge? Are you updating the prompt every time? How are you making sure that the judge then is able to solidify that and do it every time? Yeah, there's a couple of different update mechanisms. Let me go through the easiest one. The easiest one is just with fine tuning. It's really, it's like dead simple. So every time that a domain expert

reviews what an LLM judge said and says, actually, you're wrong about this part. It should be that part. You can construct a better LLM judge response based on taking the domain expert view and the original and combining them into like an improved, like what I wish the LLM judge response would have been now that I know what the domain expert thought about it. And so that becomes a question, an input output pair.

And if you've collected, let's say, even like 50 to 100 of those input-output pairs, you can fine-tune the model that you use for the LM judge, and it gets better. Like, it gets better and better and better. That's the simplest update mechanism. You can, of course, also update the prompt of the LM judge. I think that it's a little bit quicker, maybe, than fine-tuning.

But it can get bristly. It could get brittle over time because, you know, the bigger the prompt gets, the slower the judge gets. And eventually you end up with like this sea of edge cases and nobody, everybody gets scared to update that prompt. So, you know, I think it's really good for the early stages when you just have a few use cases. But over time, you probably want to switch to fine tuning. Fascinating. Now, the other question that came up right away was when you have

different subject matter experts, how do you normalize all of the feedback? Oh, that's amazing. I'm so glad about it. I have this problem. So I was head of AI for this company called FactSet. And let me just give you this example. So FactSet, it sells financial data like Bloomberg. You've heard of Bloomberg. Oh, yeah. Imagine you're a financial analyst. You're building an Excel spreadsheet. And it's a financial model of a company. So you don't want to hard code the

sales and the income. You just want to pull in like the last five quarters. Yeah. And this way you could just hit refresh and it pulls that. So, so there's like an Excel formula you'll put into some cell to pull in the revenue and a different one to pull in the EPS and a different one to put in the expenses. And after, after,

building this excel language for 40 years you end up with half a million formulas half a million formulas and like absolutely nobody knows them nobody can even remember them nobody can remember what they're called so okay so then users start calling you and say like what's the formula for eps the manual is giant and out of date they can't and like um

And more and more users call. So like two thirds of the calls are from users just asking like, what's the formula for X? It's a super annoying, you know, thing for you to answer because it's a really, it's just like a lookup answer. But there's just a huge volume of users asking because, you know, nobody, the manuals can't keep up with half a million formulas. So then, and it's not so easy to answer because there are nuances. Like if I ask you, what's the formula for cash?

I might mean cash in the bank, or I might mean cash in marketable securities, like short-term debt. And it's arguable which one I mean. Cash is kind of like a gray area word. Some analysts interpret it one way, some interpret it the other way.

So when we wanted to build a copilot for this, we were like really, really excited about it. And the domain experts told us, you know nothing about finance. And so how are you going to build a copilot? We're like, we got this. So we built it. And we very quickly got up to like 70% accuracy. And we were feeling like rock stars, Dimitris, rock stars. And then...

But the domain experts basically said, look, we're not going to let your bot talk to actual people because it could mess up. Like we cannot have the wrong. So we're going to have a human in the loop and the human is going to look at what your bot says and they're either going to accept it or reject and you'll know. And they kept rejecting the cache question. Like it kept, you know, we kept getting it wrong. And then when we looked at the data, we realized that it was like, it was like an edit war in Wikipedia where you have like two people

They're not talking to each other, but they're sure that they're right about what cache means. And they keep overriding each other's edit. So no matter how we update the prompt, it's never right. And it just keeps, like the bit keeps flipping. Oh, cache means this. Oh, no, cache means that. Oh, you know, my mother, my sister, my mother's sister. So it's like, you know, it's like you cannot, this is why like we cannot get above 70% accuracy. And I realized that 70%,

is the threshold at which people agree with each other. And 30% is actually the area, that's actually the gray area in the system. And this is an amazing opportunity because these people, the agents that are actually answering these questions for people, they didn't know that there was a 30% area. They thought that it was just all black and white.

It's like an Excel formula. There's a right one and there's a wrong one. But it turns out that we have a unique ability to organize this knowledge base better than it was before the co-pilot started. So to come back to like machine learning terms and like the nitty gritty,

We did a cluster analysis where we embedded the questions and we identified clusters of similar questions, semantically similar, that had opposite answers. And that's how we were able to find this, like, all of these people are asking, you know, where's cash? What's the formula for cash? Tell me cash. I want cash. Or sometimes they don't use the word cash, but they use a close synonym for cash. And all of those questions got grouped together

purely through semantic clustering, through the embedding model. And then we basically say, okay, the clusters that have consistent answers, I don't care about those. Those are good, solved, check. I want the clusters where it's a cluster fuck, right? I want the clusters where it's like all different answers. They're all over the, the questions seem identical and the answers seem all over the place, right? How could that be? And then, you know, so we, but the beauty part is we could identify those automatically, right?

We didn't know the right answer automatically, but we could identify disagreements between domain experts automatically. And this was hugely value added because they didn't have a process for finding those before. And then it's just a matter of like basically getting the two sides engaged.

Like, how do you solve a Wikipedia edit war? You get the two people in a room and they start to talk to each other. And like, hopefully one convinces the other. And then, you know, then the edit war settled. So we kind of did that. There's like a little bit of plumbing. There's a little bit of organizational design. There's a little bit of, you know, ML programming. And the knowledge base got much, much better. And we were able to push that gray area from 30 to 25 to 20 to 15 to five. And that was not just great for the

for the humans that were dealing with all these user questions because now, you know, now they can have consistent answers. But it was great for the co-pilot because now the co-pilot could clear up that knowledge base as well. But there are going to be scenarios where you do not have a black and white answer, right? That's right. And so are those, I've heard folks say, steer clear of those with your AI. It really depends. Look, I think AI is not going to be any more black and white than people are.

Right. So like we can't we can't expect AI to give us black and white answers to questions that are just like part of the human condition. And like there's no right answer. Like we can't just because there's a new technology doesn't mean that suddenly there's a right answer where there wasn't before. But we can generate a boatload of value.

If we go to the areas of human knowledge that are where like most people agree and there's value to giving the answer that most people agree on. And that value, like if it's very expensive to get to that answer today because you need very experienced, you know, expensive people to do it and you're able to get the 80-20.

uh, with, you know, like 80% of the accuracy value area of agreement with like 20% of the cost, 10% of the cost. That's a, that's a huge value. We don't have to all agree on what poetry we like. We could just like focus on, you know, whatever, extracting data from financials. Um, you know, so, so like all the, all the use cases that we know of, uh, you know, for, for AI co-pilots. And I think like, um, it's, um,

You would think like, how does a company compete in this kind of area? Like, how does an engineer compete? But also, how does an organization compete? So there's going to be, there's usually some secret sauce to every business.

And right, you know, in the beginning, that secret sauce is in the head of the founder, in the head of the partners, the senior, let's say, 5%, 2%, 3% of the people. And if there's a way to scale that, basically to apply that, not to publish that secret sauce, but to apply it at scale so that the bot applies the same level of nuance and the same level of special sauce that the senior partner does, that's a huge value unlock for the organization, right? Yeah.

Like in my example, you know, there's probably like there were a couple of people that had a really good nuanced understanding of cash. It's just that they that that value that was in their heads was not communicated to all the callers. So you basically hit the lottery. Like you get somebody who thought it was one thing or the other. You wouldn't even know. Like you wouldn't know that you're getting part of the value. But as a company, we all wanted to figure out what is the best answer and give it to everybody, even if the best answer includes some options. Right.

And this kind of process is a way not just to make the AI more accurate, but to help domain experts like kind of settle that value pyramid. Then when you talk about the nuances in someone's head and then being able to translate that to reality, for lack of a better term, the thing that instantly came to my mind was, wow, I could do that with folks that I get documents from.

And sometimes one thing that I do not like doing is getting on calls with folks if it could have been an email or if it could have been a document. I would prefer for it to be a document first. And then I'm happy to get on a call afterwards once we have established what it is that we are talking about. Right. And what the problems are, et cetera, et cetera.

What I find, though, is sometimes you can be working with someone and the document is amazing. And it's like, wow, this is incredible. There's...

so much here. It's very clear. It's really well put together. It flows and you can just comment on it and whatever you can't figure out in the comments, that's when you get on the call. Other times I'm like, holy crap, this document is a mess. And it's not a mess because I make a lot of documents that are just brain dumps and those are a mess, but it's a mess because it is so...

verbose or it is so AI generated that it hurts my eyes to even read it and then it's not clear what is this document for what are we even doing here where's the meat and potatoes of this whole document and so I was thinking about

I wonder how I can start getting the ways that I enjoy, like documents that I really appreciate, using those as my eval set and then saying, all right, anytime anyone brings me a document, just ask, run it through an LLM and say, here, I want to make sure, how can I make this better so that it aligns with these documents or so that it's more in Demetrius' style? Yeah.

A hundred percent. You know, I mean, Demetrius, you, hands off, you built an amazing community. There is a special sauce to building a community like that. I don't know how to do it. So right now, in order to get that special sauce out there, you've got to put your own personal touch on it. And I can tell your style because I read some of your LinkedIn posts and I've been to some of your conferences, like it's distinctive.

Um, but you and your community would benefit a lot if, uh, in addition to when you're able to be there in person, you could figure out a way to scale your style to like all the touch points that are just like, you only have 24 hours a day. You got to sleep like I do. So if there was a way to scale your style, your community would benefit.

grow that much faster, get that much more value. It doesn't, it's not about replacing you. It's about like just taking the, the thing that's unique, the thing that made you guys grow and making it accessible to people, even if like, whatever, you know, even if they can't be at the show. Yeah. There's something that I wanted to mention before we move on to around the whole subject matter expert piece, because you had said this before and it's worth repeating how you

traditionally when engineers are building SaaS products, it's okay if they are not a subject matter expert and they don't bring in the subject matter expert because at the end of the day, the users will suffer through it. And if it kind of helps and it kind of gets the job done, that's great. It's useful. However, now if you're trying to create a co-pilot,

and you're doing it without the subject matter expert, you're probably not going to get very far because it's not going to be that valuable. So the crucial piece here is, again, like hammering on this, how you can bring in that subject matter expert and make it easy for them so they don't have the blank page problem where they don't have to learn how to code and they're not firing off Python scripts or having to really update prompts to make it

So that is something that I really wanted to hit on because it resonates with me a ton. Yeah, thanks. Thanks, Dimitris. I think it's, yeah, like if you imagine, you know, you're building, I don't know, building SAP or you're building Salesforce.com, maybe it's an inventory management system or ERP or whatever it is. It's basically a couple of forms.

And all you really have to do as the engineer, you have to make sure that the data goes where it needs to go. Like it's pretty black and white. You just have to like not lose the data. And the form has to render, you know, it's pretty clear like when you've messed up. It's pretty clear. And you will on purpose avoid any kind of situations where, I don't know, you need to...

you need domain expertise. You will avoid those because you're going to try to stay to the objective piece. And in the AI world, it's kind of flipped on its head because the value is in delivering domain expertise more broadly, cheaply, scalably. So if that's the goal, like if you're trying to, and whether that domain expertise is finance, legal, architecture, whatever it is, like AI is reaching into all these places, that's really the new part. The new part is not

being able to draw a GUI from scratch. Like, yeah, it's kind of exciting when you talk to AI and you can draw a GUI from scratch, but we know what it's like to draw GUIs. Like, we've seen that before. The exciting part, the economic value, the new part, the unlock, is, you know, you deliver legal expertise that used to cost 500 bucks an hour. You deliver it for five bucks an hour or architectural expertise or medical or whatever it is, like stuff that people really, really need

and used to only be accessible to the few, and now it's way cheaper accessible to the many. That's the unlock. And in order to get that unlock,

you need a different skill set. Like that's, that's our challenge. We're so new. And it's the same challenge with every new technology. Like we, we got here because we're into the AI. Like we're both you and me, we're into the gears. Yeah. Like I love to see those gears going. Yeah. I just want to like jam my hand in there and like fix it. And, you know, I love playing with it. I love getting my hands dirty, but the value from those gears is domain expertise that

We don't really have like, so in a sense, the most successful features, the most successful products are going to be the ones where we can make the gears invisible. Everything that we're doing will be invisible and the user will just have the experience of talking to an experienced lawyer or, or, you know, or an experienced engineer or an experienced podcaster, right. Or, or community builder. Like we're, we're the, that's the magic, right.

And in order to get that magic, we have to get these domain experts like right in there with us. And my suggestion, my humble suggestion is that evals are the most friendly medium to get domain experts into your workflow. They're much more friendly than asking them to write stuff from scratch, to edit prompts. Because they're very flexible, they can grow to your needs.

It's so funny you say that because the last podcast that I did an hour ago before we were on here, um,

The guy was talking about how in his product he put, he went and he had this idea of some fancy AI that he was going to add to his product. And after talking to all of his current user base, he realized, wow, people just want like to save some money on their law fees or their legal fees. So maybe we should try and figure out if we can use AI for that. And they ended up doing it and implementing it. And it's been a huge success.

And so you, I really like that, like taking something that you're used to paying a whole lot of money for and then can you, maybe it's not a hundred percent, but it is at least trying to figure out a way to get expertise for much cheaper and figuring out how you can fit that into your product with the help of the subject matter experts. So yeah.

Now I want to jump on the topic of the LLM as a judge because we kind of danced around it with the subject matter experts and how they can help the judge. What I...

am fascinated by is how many different ways you can do this LLM as a judge. And I've seen some papers that come out and say, no, just one LLM isn't enough. You need all 12 of them. So it's like a jury or you need to be doing it where I thought it was hilarious. It's like some researchers are hanging around in the dorm room and having their pizza and beers and saying, well, what if we did

not one llm judge call it was two llm judge calls and that's how many else is like what if we did three of them and so you just keep more llm calls on top of it and you see is there a place where that will kind of top out so i know that you've been doing a lot and give me the download on it

Yeah, that's an ICLR paper for sure, right? Just N plus one LLN judges. Like, you know, get me the seat. Okay, so before we talk about all the different ways of doing it, the how, I propose a metric for deciding. And the metric I propose is the human agreement rate. Because like, okay, we're talking about, we're swapping tips. Like we...

you've read a paper and we have like, and there's lots of ways to do it. And we're, we're, we're swapping techniques. And anytime that you're debating, like is technique a better or technique B better? Like, should I have two or three or five? What's, when is it going to, any questions that you have about which one is better as an LN judge,

Like there has to be a way to answer those questions. That's got to be like more than whichever one of us is like, has the bigger audience or like yelling more or, you know, is higher paid or whatever. There should be a metric. So the metric I propose is whichever LLM judge approach reaches the highest human agreement rate. And it's not just like not a human off the street, but like the humans that matter for you. Maybe it's your users. Maybe it's your domain experts. We can define them in advance.

And we're basically trying to optimize for that. And, you know, we're ML people. We optimize. We understand what optimize means. So, okay. So we start with the same approach, right? We start with like a simple LM judge, probably pick a cheap model, pick a simple prompt, and we measure the baseline human agreement rate. So we have our eval set. We evaluate it with people. We evaluate it with the LM. We measure the agreement.

And it's whatever, it's 70%. And then we could try things. We could try, like the jury could be good because different models have different strengths. So like for use cases where, let's say some of the questions are heavy math and some of the questions are heavy creative writing. Well, you know, maybe Claude is better at creative writing and maybe...

I don't know, deep seek is better at math. And so if we jury them together, maybe we can have higher human agreement rate. Or maybe they'll just like fight with each other and we'll have less consistency and the human agreement rate will be less. My point is the empirical approach over the theoretical.

and try it with your data and with your eval instead of taking my word for it, your word for it, or the paper. The guy who wrote the paper does not know what's going to give the highest human agreement rate for your task. And if you can unlock 80%, 90%, 95% human agreement rate with your domain experts, you'd be amazed how happy they would get because it's a lot of work off their plate. They don't have to look at

Any of the, like whenever the LLM judge expresses confidence, which is going to be 80, 90% of the cases, if they have high human agreement, then the domain expert is going to be like, I don't want to look at any of those. Just give me the exceptions. So immediately, like their work drops down a couple of orders of magnitude. That's awesome. So then would you say you have to always have those human curated documents?

eval sets before like you can't just raw dog with some LLM as a judge right from the yeah you bring up a good point there's there's a bit of a chicken and egg right because like like we said earlier if you just if you don't have anything and you you start by coming to your human human domain experts and you say like

here's a blank spreadsheet and show me all the things that teach me law, basically. That's what you're saying. And they're going to be like, get the hell out of here. Like I, you know, I went to

I spent and I have clients and like, I don't have, I can't teach. I'm too important for this. Right. But so I think like in the beginning, yes, you should probably like you come up with the eval yourself and you are the human that the LLM judge should be agreeing with. Because if the LLM judge does not agree with you on the first, let's say 20, 30 examples in your, in your LLM set, that's really, really easy to fix. It's a very tight loop. It's like you yourself, you,

me, myself, and I, and I test a couple of things. If the LLM is doing things that I totally don't expect, I can just assert like, okay, maybe I don't need to be the world's greatest domain expert to deal with these like really simple few cases. I'm going to start there. And then once you have it, once the LLM judge behaves in a way that you trust on a small group of cases, which should not take long, we're talking about like days of work, maybe,

Maybe not hours, maybe days of work, less than a week. So if you have like some agreement on a very narrow base of knowledge, you could say, hey, domain expert, I basically got it as far as I can get it. Like I'm not a law expert, but it doesn't say totally crazy things that I don't understand. So I trust it. And I could show you some things that it says. If you trust it, we could just let it go. If you want to get in the loop, you can.

And I think they will want to. Like if you've cleared the bar that you can clear yourself, they're much more likely to engage at the next level, in my opinion. I like how you bring up this idea of we need a baseline of what we're comparing it to. So if we're using an LLM as a judge, what are we comparing it to? Let's start with something from me.

And then we'll bring in the subject matter expert when it's needed. But we'll get the basic stuff out of the way so that I'm not wasting some famous or some highly... Yeah, and there's a couple more hacks there. Like you're trying to sort of accelerate learning, right? So imagine that you're really like trying to teach somebody law without taking them through law school. So, okay, you could start with just common sense. There's also probably...

you know, there's probably some data sets out there that are available where you're just teaching them like basics. And then there might be a couple of examples where, you know, maybe your law firm has a different view on things. Like maybe your accounting firm has an idea that it's okay to take a chance in this particular and be aggressive in this like one niche because they have a variant view. Yeah. And,

Because you know that you work in the firm, you could just like build it in as a little, as an example that knowledge is plastic and it can fit to what the firm thinks and not just what the standard benchmark thinks. And you could just build one or two of those to show your domain expert that what can be accomplished and they will put in more. When you're grading the LLM judge responses,

I imagine there's different vectors that you can grade it on. How do you look at that? Right. We call that criteria in our product. And there are hundreds. And just like every organization has a secret sauce, every evaluation will have different criteria, what it means, what success means. So...

So I think it's a really juicy, challenging area for a lot of people. I think there's a lot of value to, first of all, having a library of criteria that's available off the shelf. And I don't think that it's enough to have like five or 10. Like, I think you need a couple hundred to cover all the different use cases. There's really, because they could be very, very different.

And then probably beyond the standard library, there's, you know, you will want to create some that are especially for you, like special for your task. So not just creating the eval, let's say the input and the output, but also creating the criteria, the prompt for the criteria and the definitions for the criteria are going to evolve as your system evolves. There's a lot of value to breaking down. Usually people will start with,

really broad criteria like accuracy. Accuracy is kind of a catch-all, but it means different things to different people. And the way I've seen it evolve, maybe you've seen something different, but the way I've seen it evolve is, you know, typically an engineer or engineering team will just define accuracy and they'll use like a non-domain specific way to define accuracy. And the domain experts will shit all over it.

I'd be like, well, this guy's got a five in accuracy, but it's totally missing that this guy's going to get sued. Or he's got a one in accuracy, but there's all these good things about it, and you should get partial credit for that. And that's great because they're engaged, you have feedback, and the opportunity there is to break down the one accuracy criteria into probably like three or four that you can observe from the pattern of feedback that you get.

is very valuable activity. So like maybe they care about, I don't know, naming things correctly. That's one way of accuracy. And maybe they care about the length. That's a totally different axis of accuracy at the level of detail. Maybe they care about the refusal rate. Maybe they care about the, how persistent you are, like how many times you've tried. So these are like all different dimensions that I find that the best way to elicit those is to show people

obviously wrong things, and then they jump in and correct it. And that feedback loop is a really, really good, it's a good flywheel to get people to think more about what they want. Yeah, it reminds me a lot of a talk that Linus Lee gave at one of the conferences that we had last year. And he was really banging on about how in music,

you can see sound waves in different ways. And you have, like if you're a producer, you have an equalizer that you can play with in the production or the DAW. And in filmmaking or in Lightroom for photographers, you have these histograms and they're so rich and you can change the photos output

by playing around with the histograms. And his whole thing was, how can we bring that

to AI output? Is there a way that we can now start creating more of a visual aspect of this output? And it makes me think about what you're saying here. There's all these different vectors and you have all these different criteria, as you call them. And so is there a way that we can visualize this in like different ways so that it is more engaging for folks to look at?

Yeah, 100%. You know, I think of it like an evolutionary process. We start with really big blocks that are really easy to put in place, and we get more refined and more sophisticated over time. And it kind of never stops. So, you know, it starts with accuracy, and then it evolves into maybe three or four criteria. It can eventually evolve into a checklist.

Checklists are super useful in a lot of medical scenarios, a lot of high risk scenarios where you know that you need to basically meet all these different criteria to be successful. It's hard to remember them all. It's hard for the best domain experts to remember them all. It's a perfect, perfect job for an LLM judge because they're tireless. They'll just be like,

super rigid and check all the things that you ask them to check, even if they're, you know, even if they've been doing it for a day. And humans are not like that. So it's a great way to scale a rubric, which is a set of criteria,

And scaling it makes things fair and it makes things cheaper and makes things valuable. And as an organization, you want to have a feedback mechanism about the rubric, not just about the eval for the same, like very same things. And we're, you know, we're,

We're finding that, you know, social media companies are revising their views about what is appropriate moderation. Yeah. So like there used to be a rubric of how to, you know, how to grade a social media post and it was implemented one way.

And for a long time, every post was graded on this rubric. There were people grading it. And now we're in a different era and like it's being graded in a different way. The rubric changes. Yeah. So my point is that for a commercial application, LLM application, AI application, you want to have that flexibility as well. Your rubric can change. First page posting.

Making AI Reliable is the Greatest Challenge of the 2020s // Alon Bochman // #312 01:01:37 Share

MLOps.community

Deep Dive

Shownotes Transcript

Making AI Reliable is the Greatest Challenge of the 2020s // Alon Bochman // #312