We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

We're All Finetuning Incorrectly // Tanmay Chopra // #304

2025/4/8

MLOps.community

AI Deep Dive AI Chapters Transcript

People

Demetrios

Tanmay Chopra

Topics

Tanmay Chopra: 我认为当前AI系统与其潜力之间存在巨大差距。实际应用中，AI系统并未充分发挥其潜力，许多大型科技公司也承认这一点。我们目前处于一个AI系统潜力巨大但实际应用有限的过渡阶段。AI系统可以扮演两种角色：主管和助手。对于专家用户，AI工具可能降低效率；而对于新手用户，AI工具则能提供帮助。通用AI模型难以满足特定任务需求，专业化模型更有效率。传统的机器学习旨在处理无法解释的系统和流程，而现在的AI则需要用户在提示中解释其流程。评估AI系统的有效性需要考虑两个方面：AI系统本身是否优秀以及AI系统是否适合用户。有效的AI系统评估应基于可量化的指标，而非主观感受。AI系统评估的困难在于难以定义“优秀”的标准，以及技术指标与业务指标之间的映射问题。AI系统并非一劳永逸，需要持续的再训练和改进。在进行模型微调之前，应先尝试优化提示词，只有在提示词优化达到极限后才考虑模型微调。提示词优化无法改变模型的目标函数，而企业级任务通常关注的并非下一个词语。只有拥有模型所有权才能根据目标函数改进AI系统。构建AI系统应简化流程，目标是使其比提示词优化更便捷。LLM降低了构建AI系统的门槛，只需少量数据即可启动第一个版本。不应该依赖于大型语言模型的未来更新来改进AI系统，因为大型模型并不会关注所有任务。LLM在不同阶段具有不同的作用：在原型设计阶段具有灵活性，在生产阶段则擅长流畅性表达。通过设置阈值，可以提高AI系统的准确性和可靠性。当AI系统缺乏信心时，应避免给出答案，而应使用传统方法。未来的AI系统将成为传统机器学习系统流程中的一个组件。可以将LLM用作分类器，解决传统机器学习能够解决的问题。机器学习的经验和直觉是宝贵的，但需要时间积累。平台可以帮助AI工程师获得更丰富的经验和直觉。在AI系统达到一定成熟度后，需要考虑本地化部署，以提高效率和降低延迟。Python语言在机器学习领域拥有丰富的支持和资源，建议使用Python构建AI后端服务。是否迁移到Python语言取决于AI系统在业务中的重要程度。构建AI系统，模型并非最难的部分，模型周围的系统才是关键。 Demetrios: 当前的AI模型难以胜任复杂任务，例如视频编辑，直接使用专业软件效率更高。AI全面取代应用软件的设想在短期内难以实现。AI中心应优先关注那些能显著提升业务效益的重大问题。

Deep Dive

Shownotes Transcript

Hi, I'm Tanmay Chopra. I'm the CEO of Emissary. We're an AI infrastructure platform for model optimization. And I take my coffee. I actually do a double shot latte in the morning and then in the afternoon every day.

We're back with another MLOps Community Podcast. I'm your host Demetrios and today we got into how traditional ML systems can inform and help you level up your game on the new Gen AI systems. Tenmei's

an original thinker in this space and I really enjoyed the last part of our conversation when Ten Mate talks about the Pythonic universe and how you really got to have it when you are dealing in AI. Let's jump into this conversation.

I just posted something and it said, if LLMs are so smart, why are AI products so dumb? And I cannot take any credit for that because I completely stole it from one of the Agent Hours talks that we had. And it was from Srivu Shankar. But I think that resonates with you, right? What were you saying?

100%. I think that's kind of the big mismatch we're seeing right now, right? We see that these systems have so much potential, but realistically in production, we're actually not seeing them utilize any fraction of that potential. You're seeing this now from, you know, the leaders of large tech companies saying, you

this is promising, but actually maybe not all of it works, which is a kind way of saying, yeah, most of it doesn't work right now. And so this sentiment of, you know, AI systems could be amazing and AI systems are amazing. We're kind of in the gulf between that right now. Yeah. It's so funny that you mentioned that because just yesterday I was talking to a friend who was saying, you know,

apps won't exist in a few years it's just going to be all agents doing stuff and I'm like

Dude, give me a break. I've heard this argument a few times and I really am having a hard time getting on board with it because let's just take my favorite video editing software like DaVinci Resolve. I know where all the buttons are on that that I need to press to get what I need done. I know the hotkeys. I know how I really like the visuals in the color grading, all of that stuff.

If now I need to just tell a large language model or language model or diffusion model, whatever model it is, a foundational model that I want you to edit this podcast and color grade it, then...

At the current rate, I don't think that is possible. And then when I have to debug it, it's going to take me more time than if I just used DaVinci from the beginning. So maybe helping me on a few steps is...

really where I see the value but actually like what this guy was saying yesterday is yeah we're just gonna spin up apps like the agents will create apps anytime that you need something done and I'm like dude that is that is very hard to like grapple with I do not think and maybe I'm pessimistic but I don't think that's a reality that we're gonna see

Yeah, I think there's this really interesting distinction there, right? Between like what's possible and what's probable. I think it's very possible that this happens. I think the probability of this happening in the next five years is probably not that high. And so that's how we think about it. If you think about ChatGPT now, we're...

on year three, right? Like that's kind of the, in the post-ChatGPT era. We're on year three. And if you were to go back two and a half years and say, hey,

in 2025, we're not actually going to have a lot of AI in production. We're going to have a lot of funded companies, but we're not going to have a lot of AI in production. Most people would be surprised, right? Because when this thing came out, everyone was like, hey, two-year timeline. We're going to automate our particular jobs. And what we've really seen is there's two personas, right? So you have sort of AI as your supervisor, and then you have AI as your assistant, right?

So the scenario you're describing is AI as your assistant. You're a really good video editor.

and you're now being asked to use this AI system for better or for worse. And the other scenario is you're a really bad video editor and you're being asked to use AI to just get started, right? You have no idea where to go. You have no hotkeys. You're sort of the zero to one user. And I think CodeGen is a really good example of this. So this is played out in front of me in like real life.

Most of the time, all of the, you know, fascination over CodeGen is coming from CTOs or CEOs, right? It's actually coming from people that don't write code as frequently as sort of a frontline engineer. And I love CodeGen for context. I'm a huge shell, huge fan. But now if you go talk to an infrastructure engineer that's dealing with like a genuinely hard info problem,

they have the experience that you're having, right? So given that they're an expert, this system actually slows them down. But it only slows them down in the context of infrastructure. If you ask an infrastructure engineer to do front-end engineering, it's actually amazing. They all love them, right? So for the bucket of tasks where you're not the expert, but you have some verification capability, CodeGen is really good. And it's taken up so much of that space.

But you really got to think about like, okay, what is that user journey when you're not an expert? So when AI is your coach, and then that user journey where you are an expert, so AI is your assistant. And in the latter, repeatability, determinism, accuracy, latency, these things matter a lot. An expert is not going to waste their time trying to prompt chat GPT four times, they'll just go to the job.

But someone who knows nothing, they'll take that alpha and actually they're learning in the process. So as the system fails, they're also learning. And so I think that's how I've been looking at these tasks more and more. So where I think in my mind as a user experience type of perspective, that this whole thing breaks down, especially if we are...

Trying to do tasks that we're not necessarily experts in and we don't know all the lingo and we don't know the best ways of doing them is trying to explain what we want done. That is very difficult. Like I was just playing around with this tool this morning that

The whole thing is you create a prompt and it gives you a pipeline. It will create the workflow for you and then you can go in and you can tweak that. Yeah, it didn't really work. And even me explaining it and then going back and trying to prompt it better. And I knew what I wanted. I knew what the workflow was going to look like.

But inevitably, I had to go in there and create new nodes and create new calls. And so it's this low code, no code experience that I had with the GUI. And at the end of the day, it's like, yeah, cool. The prompt kind of got me there, but I'm not sure if it really saved me that much time.

Yeah, I think this is actually a big problem with generalized systems. Right. They're just not aligned enough. And so if you look at a model that's sort of trained on everything in the world, it's a great model for everything in the world. But it's not the best model for trying to do what you're trying to do. And so.

In an ideal world, right, whatever system was generating that pipeline out of your prompt would have seen hundreds of thousands of prompts and the ideal pipeline that was generated out of them.

Even then, it's not going to be perfect, but your edit distance from first output to what you finally ran is going to be a lot shorter. So there's a couple of things there. The first is inherently AI was, or ML, old school ML, was about trying to digitize systems and processes you could not explain. And so we are doing a little bit of morphing here, right, of like what this was meant for, right?

Machine learning was always meant for like, you give me inputs and outputs and I will converge to the right function. So don't tell me your process because there are processes better than yours that I will discover in this larger search space.

And so that was kind of why we used ML, right? If you think about all the use cases before, like I cannot write software to do recommendations. There is not enough lines of code that I can generate that would create a recommendation pipeline for every user in the world. Great. That's a machine learning use case. If you physically cannot describe enough of the process, it's an amazing machine learning use case. So we are morphing that by saying, hey, actually now you have to explain your process in the prompt.

Don't give me inputs and outputs. Give me your actual like chain of thought process. Right. So whenever you start writing sort of the chain of thought prompts. So that's one thing. And this is sort of a more general philosophical thing. There's two others. The first is, is the AI system itself good? Right. And then the second is, is it good for you?

And those are two very, not very different alignment vectors, but they are distinct alignment vectors. A lot of AI systems are still stuck at the first one. So they're not even thinking about hyper-personalization for the user. They're thinking about like, how do we be good?

And I think this is coming more and more to the forefront. This was sort of very popular towards, you know, 2021 in ML, but now very popular in AI is eDALs, right? And not LLM as a judge. No knock on LLM as a judge. I've just never seen it work. But the thing is, you need to know what good looks like for you. The first thing we ask every customer to do on our platform is,

Just stop and think, right? Like, tell me what good looks like. Does it mean I have the right files in place? Does it mean I have the right text in place? Like, in your universe, what does success look like, right? And ideally, this evaluation is not gut-based. Because if it's gut-based, you're adding another layer of uncertainty.

So we now have some folks that are just not evaluating their systems. And then we have some folks that are evaluating their systems on vibes. And the bucket of folks that are sort of evaluating their systems deterministically is actually very, very small. But if you were doing that, everything would get better over time because then you can start using math. You can use backpropagation to start building systems that are better over time. So that's kind of where we think a lot about this.

But why do you feel like people aren't using evals? Is it because they don't know what good looks like? Is it because it is a new product and they want to get it out there? And then once they get users giving them that feedback, then they can create that eval loop? What is it that

is missing? So I think it's two things. One, this is the obvious one, eval is hard. Right? So I don't think people are like actively choosing not to do eval. I think this thought process of what does good look like, in some cases, is hard to describe. In some cases, it overlaps between technology and business. So this is also why machine learning used to be hard, right? You've got people trying to do calc, and then you need to kind of map that to a business metric, because you don't actually have any signal

till the business metric comes out. So that's sort of that number one piece, right? That this is actually a really hard problem. I think number two piece is a lot of people are thinking about AI systems a lot more statically than we thought about ML systems, or at least did for the last couple of years. They would ship something that looked decent and then kind of say, okay, this is it.

That works in software. You can front load a lot of thinking in design and then you ship it. Because over the years, we've gotten really good at creating software abstractions that result in good output software the first time or the second time or maybe the third time. AI systems are inherently perishable, right? So you kind of have to think about from moment one, how are you going to keep retraining and improving this?

But this is very new. Just as a paradigm, it's very new. And so if you're a software engineer building AI, that's the thing I always encourage folks to think about is how is this different from your software development lifecycle? It's not one and done. It is inherently improving over time. That's a feature, not a bug. I remember when we had, there was a great article that came out from Google probably in 2021 talking about continuous training.

And it showed the whole maturity levels that you could get to that would automate that retraining pipeline. And it's a little bit going back to that, even though now we're using large language models instead of these traditional ML models, you still want to look at it as as soon as you get the model out there, in a way,

You need to be ready or you need to have a plan for how you're going to keep iterating on making it better and what you can do to fine tune it. And I think a lot of people are having success with fine tuning prompts. I know that you don't necessarily like that idea.

I don't think I dislike that idea. I think I've just seen that sort of hit a ceiling with the early movers from the last two years. And so in some ways, I try to preempt folks struggling when they hit that ceiling. But actually, the approach we take is we say, don't fine tune till you have to. Right. So.

do the best you can in the prompting world and if it's good enough, like fantastic, let's help you maintain those systems through the retraining piece, right? So part of what our platform does is it actually streamlines the retraining piece because retraining used to be this most painful part of machine learning. If you have your own model, you're now like signing up to retrain this model month over month. And so this is actually why we discourage people from fine tuning their own models, right?

Because you need to know that it's a lifelong investment, right? It's like having a child where you kind of have to take care of that child month over month, year over year. You don't get to just like give birth to a child and then be like, I'm out. And so...

I actually don't, you know, knock on prompting or prompt fine tuning. I think it's more a function of you should be intentional about when you're doing that versus when you're fine tuning your own model. So the big challenge with any optimizations at the prompt level is that you can't change the objective function. So the model still only cares about the next word.

most enterprise tasks care about things other than the next word, right? You might care about some form of classification. You might care about, you know, you generate an investment memo and you're like, okay, did this memo map to reality? You don't actually care about the words of the memo, right? Or like words of memo one versus ideal memo. What you care about is like, what was the grounding in reality of memo one versus memo two? So yeah,

improving your model or your system, AI system, based on your objective function is something you do not have access to. You start owning your own models. That's the only reason that we encourage folks to think about this is it can improve for you. Right. It doesn't just improve for words. It improves specific to your AI system objective. So the idea here is that you're doing as much work as you can with foundational models and then

And eventually you feel like going back to what you said before, there's these two vectors that we need to look at. Is AI good? Is AI good for you? Or is the AI product, I guess, that we wrap it up in, is it good? And is it good for your use case? You can get pretty far, I imagine, with the foundational models. And indeed, a lot of companies, that's as far as they get. Yeah.

their AI center of excellence says, we've got a win under our belt. Let's go play around with other stuff and wait till the new model comes out. But what you just described is very much like going back

back into the old world in a way, old traditional ML world. So how do you look at that? Is it not just a ton of burden? And when do you want to use those types of use cases? Because I think when these AI center of excellences are talking about their use cases that they're going to tackle,

If you bring up, oh yeah, we now have to do AI like we did ML, it gives people cold feet.

Yeah. So that you bring up a bunch of interesting things. So in terms of the burden side, that's why we're building MSLE, right? That's, that's the whole purpose of our platform is to say, hey, it was actually really hard to build models before and it shouldn't have been. So let's try and make it easier, right? Like how much easier can we make it? And our bar to success is really can we make it easier than prompting?

That is the level that we're pushing to go to. We're pushing to go to, say, if you want to do a GRPO-based optimization, right? So the system that DeepSeq used to build their model. We want to make it as easy...

as it is for you to prompt. In fact, in the longer run, we want to make it easier because you don't have to be guessing around, removing a word here and there and being like, oh, did my prompt regress? You can just be like, hey, math, please take care of this. Optimize to this function. So burden is an infrastructure problem, right? That's something that, you know, even if we don't solve, somebody should and will solve. If the industry becomes... Sorry to cut you off, but I want to stress this point

You said, hey, math, go and figure this out. So it's not that it's someone that is pulling something out of their ass with their prompt and trying to explore latent space. It is math equations, which that resonated with me a lot the first time you told me it because it was like, huh, that seems like a much better way to systematically go about improving your product.

We have a deterministic way to reduce the loss, right? Like that's all of what machine learning was. We were like, hey, if you can define what good looks like into some equation, we can use that equation to start approximating outcomes based on your input. This is what all of machine learning was. Somewhere along the line, I think we kind of lost the plot and said,

Let's not use that. Let's generalize that, right, to the next word. And let's try and guess the series of words that result in the right word outputted instead of actually trying to use the math that we've been relying on for this long. But we made so many influencers so much money with their prompting guides. 100%.

I still think it's useful. So here's the thing. I think the real LLM unlock is you need a lot less data to bootstrap your first version, right? So if you wanted to start doing machine learning three years ago, your first call to action was I need to go to, you know, scale AI or surge AI or somebody like that and get 100,000 samples labeled.

Now, you can say, hey, let me start guessing some words. And this is where those guides are really useful. And let's get us to V0. This V0, if it's good enough for customers to use, becomes our pipeline to get better every day for the rest of our lives. Our only burden is the infrastructure burden. And maybe you can, you know, check out MSRE for the infrastructure burden, or you can check out other sources. But there are people helping you with this infra burden.

And now you have this self-improving system. And I know there's a lot of founders or a lot of AI people that say, hey, why don't you just wait for the next GPT? Or what happens when there's the next GPT? And this is my standard response there, right?

They don't care about every task in the world. They cannot. They can care about every task in aggregate, but they do not care about your specific workflow. And so GPT-5 could be much better, but will it be much better for you is a very interesting question to ask. That's number one. Number two is if they do start caring about your task,

That's a whole other can of worms, right? So you see with Anthropic CodeGen, they started caring enough about optimizing for code. And so now you're competing with them. And you're not improving your systems based on your users' interactions, but they are. And so you're now in a universe where they came to your game six months or a year later, but because they're better at improving day over day, they're going to be better in weeks.

So actually, you want to be as orthogonal as possible in terms of better for you versus better from the foundational models if you want to keep space between what you do and a foundational model sort of trampling on you in some sense. Everyone has to go up against OpenAI or Anthropic at some level, right? Like that is, if you're building an application layer company,

you at some point they will come for your application you can't simultaneously believe that code gen will make all software mode zero and then go out to public markets and say our mode is software mode right like those two things cannot exist parallelly so you either say code gen is going to be really bad and so there will still be software mode or you say oh shit we need something more than just a software mode

Not the start, but over time. It goes back to your last point or one of the points that you made before with the workflow solution that wasn't working up to standards. And that is because it is trying to be generalized workflows. It's not a here's a...

workflow that scrapes keywords from SEMrush and then goes out and looks on Reddit and looks at all the posts that are for those keywords and then looks at all the blog posts that have been written about those keywords and then creates a

a intro or an outline for a blog post that you can write and then it creates the intro and the body text and then it has that run through some kind of SEO optimizer. I'm just explaining the very deep, deep workflow that I've seen my friends in marketing create for generating a shit ton of SEO optimized blog posts.

from AI generated content, right? And so whether or not that is actually useful for the internet is a whole nother conversation, but it's out there, it's happening. It is super advanced. And that happens because a marketer or a few marketers sat down and they said, you know, it'd be great if we had all these different steps.

And then if I go and I say, I want a workflow that will do all these different steps, even if I know exactly which step is which and I prompt it or I try to prompt it, the model isn't going to understand or the magical workflow that is created from that isn't going to understand because it's, again, going back to what you're saying, they didn't have thousands of these steps.

And so it couldn't know what I was trying to look for because it is such a broad type of a product. It's not that verticalized product. Yeah. And what's really interesting there is you added another layer of uncertainty, right? So this is one of the...

I always like, everyone laughs when I say this, but you should use as less machine learning or AI as you feasibly can to build your system. Right? And obviously this is counterproductive to my business. But the idea here is you had two layers of uncertainty. You prompted uncertainty.

And that resulted in the creation of a workflow. So you have what we think of as like orchestration uncertainty, right? And then that workflow needs to get executed. So these are different pieces that need to get executed. And so you have some level of execution uncertainty within each piece, right?

That execution is also volatile. So you also have component uncertainty, right? Now, if you look at the folks that are in marketing that are doing this, if they're marketers, the workflow is coming from them. So they have no orchestration uncertainty. They will say, do this first, then do this, then do this. They have an SOP and they'll execute on that SOP.

They still have execution uncertainty, but even within that, because they're experts, right? This expertise is sitting in their head. They're actually a lot better at aligning these systems. So the whole game now is just a question of like, who is the best at aligning the models? We've kind of flatlined in terms of how good we're going to get by throwing more compute at the problem.

to a large extent, right? And this happens every so often. And so I'm sure 10 years from now or five years from now, we'll have a much larger model that'll do much better than everybody else. What's really interesting is a lot of value is created at the next layer of foundational model, right? So when CNNs came out, it was huge. A lot of value was created. But actually most of the value is captured between the times that

that one large model comes out and the next larger model that blows the last large model out of the water comes out. This is what we're very excited about, right? Is the actual value capture in this journey has always been the applied AI people. It's always the people who are like, these are the new tools we've gotten, this new large model, this new way of training. How are we going to orchestrate these things?

to actually generate value for someone that will be willing to pay for these systems. Ideally, more than what it costs you to run them. That's kind of what's really exciting about this period in time. But yeah, there is the more uncertainty you add to the system, the less likely it is that the system will do well. Okay, so getting back to the question I asked before we went on this epic tangent around

these AI center of excellence is thinking, oh no, I don't want to go back in time to traditional machine learning methods. Yeah. I think, um,

I would focus less on method and more on outcomes. The thing we've been encouraging centers of excellence to focus on is the bigger bets. And what I mean by that is it's equally hard to build an AI system for customer service as it is for something that will completely change the unit economics of your business more often than not.

And so one of the things we focus really deeply on when we start working with folks is what's the biggest problem you can tackle, right? If you were going to commit 10% of R&D spend to a specific problem or 20% of R&D spend to a specific problem, would it be worth it? If not, just buy the solution, right? Like there are companies out there that are doing amazing work

at doing the things that are not worth 10 to 20%. And because these AI systems are perishable goods, you're now putting somebody on the hook for maintenance every time you build a new system. If you can outsource that, you probably should. You want to build the stuff where your feedback loop is optimal. That's the shortest way I can put it. If you have a differentiated feedback loop, right? If you're an insurance company and you write insurance policies, like,

Figure out how you can deal with the insurance broker shortage by building a co-pilot for insurance workers. That's the alpha for you, right? Not trying to figure out like, hey, can I build a marketing AI tool? Because that thing is going to cost you a hundred bucks a month to buy. It's probably going to cost you like a hundred thousand dollars to get right. Right? Like the economics don't work there. I think invariably going back to your original question, I know I've been going a couple of circles, but I

AI systems are going to become a component in the ML system pipeline. So you're not going to forget classifiers, but your classifier might sit on top of an LLM. That's how we think about this universe standing out.

What does that mean? Sit on top of an LLM? Yeah. So say you're building a chatbot, right? Common use case. Right now, there is like so much work being done on trying to figure out how to get the LLM to only answer the questions that it should. The easiest way to do that is to build a classifier that says, should I answer this question or should I not? Stacks on top of an LLM. So basically upstream from an LLM. Question comes in, should I answer, should I not? No answer.

End the chat. Or, you know, politely end the chat. Yes, let your LLM answer. Instead of trying to get this LLM to be this one unit that's perfect.

Just use old school ML to solve the problems that old school ML could solve. So I think there is that kind of fear. That's actually one of the things we've spent a lot of time doing is converting LLMs into classifiers, regressors, and other like old school models. So you can still use their inherent knowledge, right? That's what's exciting about them is the knowledge that sits between the first layer and the last layer. You can still use it.

But you can use it for your tasks. So now with 500 samples, you can build a class of plane. We're really excited about this universe, right? But I also understand that a lot of people are pretty hesitant to sort of slip back into old school ML. I don't think that's a... I think you just have to get over that fear at some level. Yeah. Well, it's interesting you say that because it is a much different scenario if it is only 500 examples.

Even though at the end of the day, you got to go and you got to have those chops to know what you're doing. It no longer becomes I'm pinging an API. Exactly. And that transition is going to be rough.

But that transition is going to fundamentally change how you do business, right? Or within the AI world. Because now you can get a precision recall curve. This is the most exciting part about a classifier. And I know I'm going to get a sale for saying precision recall curves are exciting, but to me they are. Now you have error margins, right? Like this is game changing. You go from a world where you just have to hope that the next token is the right one to a world where you're like, oh, if your confidence score is below 0.9, just don't answer the question.

And then you say, oh, we're making a lot of mistakes there. Let's raise that confidence score requirement. So if your confidence score is above 0.95, don't answer the question. Below 0.95, don't answer the question. That's kind of the world we need to be in, right? Because if you're an enterprise, you can't go out there and have no... AI is never going to be perfect, but at least you need to know roughly how often it's going to go wrong.

And it's interesting, like classifier is one example. The other example that I've heard from the folks at Process for their AI agent is they want to know if their agents have enough information to go and execute a task. And that feels like something that could be a distant cousin or a close cousin of a classifier. Exactly same math problem, right? You're basically going to say, what's my threshold?

confidence score above which I think we have the right answer. You train the model over time to learn, hey, when I'm below 0.9, I'm like never right. Okay, I'm going to be above 0.9. It's like very simple to improve AI systems when you start adding thresholds to them.

Because then you can just work in the world where you're good, right? And you can say, hey, we'll default to the old workflow when we're bad. Right now, there just isn't that distinction. The LLM is never able to say no. That's the manifestation of this problem that everyone's seen, right? It answers with equal confidence when it knows for sure and when it has no clue.

That was the biggest problem that they ran into in the podcast that I did on their data analyst agent is that folks would ask it for data or how many partners have done XYZ in the last month. And they could know that information, but sometimes...

they wouldn't be given the right amount of detail. And so the LLM would come back with some kind of plausible answer. It would go off and it would be thinking for a little while and then it would come back and it would give you this full story. And if you weren't really intimate with that data, you would be like, yeah, that seems about right. Yeah. And so that is a really dangerous scenario because you base decisions off of that, which if you're not careful...

can be completely fabricated. 100%. That is the most dangerous chunk of answers is when it's plausible but wrong, right? Everything else is actually in that grid is actually pretty fine.

It's that part which is like, oh, God, what do we do? And especially in regulated industries, right? So if you're in finance, if you're in healthcare, if you're in cybersecurity, you don't get to make these mistakes more than once or twice before, like, meaningfully having to pay for them. And so that's where we spend a lot of time thinking about how do we make sure that when it's not confident...

It's not answering. That's the fundamental problem you need to solve. No one's going to be mad at you if you default to the old ways, right? So if you have a search engine and it always answers the question, but occasionally it says, hey, I'm sorry, here's the search results. I don't think I can answer the question. That's amazing. But every time it says made up answer, you start trusting it a little less. Yeah, yeah. And it's really easy to lose that trust. That's the other thing is that

It's hard to gain it and easy to lose. But the idea that you're bringing up too, I have heard from Igor who works at Wise and he was talking about how LLMs should just be viewed as another node in a DAG that can do

a bit more fancy stuff. You don't need to think of it as this whole revolutionary new product or new thing. Let's just think about it a little bit more unsophisticated and say, all right, cool. It can take unstructured data and make it structured. That's what an LLM is great for.

Yeah, so I see it as two ways, right? I see it, one, as wet clay. So it can sort of fit into any position before you mold that node.

Anything can be an LLM. That's the best part of it being so flexible. You can treat it as a classifier today. You can treat it as a regressor tomorrow. You can treat it as a generator the third day. So as you're coming up with your MVP system, it's amazing. It's a prototype solution at that stage. And then when you harden that clay, it's really good at one thing, and that thing is fluency.

So it's not good at decision making. And if you use it for decision making, at some point in your journey, you will be like, oops, maybe we shouldn't have done that. But it's really good at fluency. So back when I was in grad school, I used to build, I was doing some work around persuasive language modeling. So how do you generate

that could potentially persuade people. And our biggest blocker there was fluency. So given a user query, we could come up with the best facts to try and convince you otherwise. But we couldn't present those facts in a fluent way.

And now LLMs can do that. So this is four or five years ago, right? And that's kind of where we were stuck with those systems. So if you start thinking about LLMs as another node in this like AI systems world or ML systems world, you start seeing a whole new set of problems that can be solved.

It was the same with sort of the marketing use case, right? You could always targeting has been good. Retrieval has been decent. The only thing we didn't have is fluency. So if I could target you with the right material, how do I actually give you that final piece of text that you're going to read or video that you're going to watch now is unlocked. Uh,

That's, I think, the hard clay version of the LLM. So there is a purpose for the LLM when you're demoing or MVPing. And then there's a separate purpose for the LLM when you're like in production. What are some ways that you have seen folks actually having success with AI? And going back to what you were saying before, having good AI, I think, is the key here and specifically good for them, too.

Yeah. So I'm a big fan of the CodeGen universe. And specifically for dev tools or MVPs.

I've seen it actually, I'm very, very excited about the no-code, low-code DevTools universe. Because basically what LLMs have or CodeGen has made possible is having complete flexibility over your components.

without losing any control. So without losing, with still having a lot of complexity abstraction, right? And I'll rephrase that. Essentially, the trade-off always used to be flexibility for complexity. And that trade-off is now gone. That's what's really exciting to me. You can have all your code

But you can also have it generated in a few minutes. These systems, I still have yet to see them work well with larger code bases. So I think code context retrieval is a really big problem that pretty much everybody in this world is sort of trying to solve. But when you have no baggage, when you have no code base, is DevTools. That's exactly it. You build a bunch of these things every day or every week.

And now you can build them for a lot cheaper. So I think like a DevPro team that now has five people can ship so much more than a DevPro team yesterday that had five people. So this one I'm very, very excited about. I think the other one is localization. That's been really hard for us to do, right? So I used to be an ML engineer at TikTok and we spent a lot of time thinking about internationalization because we were in about 150 countries together.

I think that's one of those other low hanging fruits that I'm very excited about. Can you localize content? Right. And, and furthermore, hyper-personalization is obviously one of those big bets. It's always been a big bet in advertising and now you can go even further. Yeah.

To be fair, these things are both hard, right? There are a lot of problems in both worlds. With dev tools, you might want to think a lot about like constraining universes. So you don't want to use any database, but like if you use a third party dev tool system, like a code gen system, it could call any DB. And so there's some degree of like constraining that needs to happen for a good dev tool code gen company. Yeah.

And then with advertising, I think it's just like there's a huge talent shortage in engineering for any industry that does localization, which is usually sort of the content generation industries. I'm not sure what the localization use case is. Is that just serving the right ads to the right people? No. If I create an ad in the US, right, and I want to serve it in a manner that appeals to people in Portugal,

right so say it's about coffee uh you probably have coffee with a donut if you're in the u.s but you'll probably have coffee with uh a paste if you're in lisbon right and so that's kind of the it's it's a tiny change but your ad means so much more uh-huh and it's the same for blog content it's gonna be in english in the u.s it's in portuguese you have i imagine you would want

more Portuguese-looking people in the Portuguese one and American-looking people in the American one. Exactly. They can relate to it better. And I think that'll happen for news. I think it'll happen for ads. I think it'll happen for...

blog posts, right? Which case study are you serving whom and what you emphasize on where? There's a lot of these like slices to that. We've always been trying to do this. It's just the fluency has been the blocker. Everything else in that pipeline. I know where you're calling me from. I probably know something about your business, right? Identity resolution is pretty good. I just, I don't have the fluency engine and now I do. So that's kind of how, what I'm excited about.

But the way that you can use AI to get that last mile, if we're talking about, oh, I want different looking people in the ad in Portuguese or in Portugal versus in the US, unless you're doing video generation,

then you're not really going to get that, right? Or not necessarily. You can make it a retrieval problem. So how you design your AI system is actually, it's a very interesting industry, right? For example, here, you could have a collection of models from all over the world.

And whichever country you're in, you just retrieve that model. And now it becomes an interpolation problem. It's not a generation problem. You're not generating a fake Portuguese human. You're talking human models. When you say models, you mean human models or? Yeah, human models, like the influencers. So you get like a Portuguese influencer, you get a picture of them and you say, hey, we're going to please give us the right to put your picture in our ad. This is what the ads for. And then all you have to do is inpinked.

Right. Which is a much different problem from generation. And actually in painting is a lot more reliable than pure generation. So now you also have a higher degree of determinism around your system. And so we actually think a lot about this. Right. Like the way you design your AI system can be very, very different from system to system.

And you're trading off for different things. So right now, what we did is we turned a generation problem into a retrieval plus infill problem. And now we have a different pipeline where we can optimize both of those systems. And the retrieval problem can be purely deterministic. You don't need to use AI-based retrieval. You literally just go to a database and say, who are the models from Portugal? And you can infill. So you've reduced your uncertainty drastically over like a one-minute conversation.

Yeah, okay. And it's really thinking creatively about that and thinking about that in these different ways where if you've seen it enough, I'm sure you as a person who has worked with companies trying to do this or doing it yourself, you've seen patterns. And so you're able to go and say, well, yeah, I mean, you can do it like that, but

The way that I would do it is like this. And so there still is almost like these design patterns that sound like they're kind of stuck in your head and you need to get them on a blog post or something so I can go and scrape them and then have an AI generate a blog post that has my name on it, but it's really your blog post. Yeah.

Yeah, I think ML intuition is very tribal. This is actually a big problem that our industry has, right? Most of the people that have been building these systems for a while, we learned by failure. So I failed 50, 60, 100 times at building ML systems before I was like, oh, I'm starting to get a hang of this. I think...

at large need to be a little bit more patient with their ML teams or their AI teams because they're also developing this intuition. It's a matter of time. Uh,

one of the things we're doing on our platform is we're trying to sort of imbibe some of this intuition into the platform. So the things that I'm telling you, right? Hey, maybe we should design it this way, or maybe you should use this base model instead of this one for this reason. We're pushing some of that onto the platform and we're, you know, the goal is to try and push as much of that as we can onto the platform.

So we don't actually think of, you know, MSRE is sort of a fine tuning platform. We think of it as ML intuition delivered through infrastructure. How do we make these AI engineers who are struggling with this stuff, who are maybe one or two or 10 fine tunes in or prompt optimizations in to start getting access to

to this brain that might have done hundreds of thousands of fine-tunes or optimizations. And that brain is sort of the MSRE engine. So we are working towards lowering that barrier because that is a big barrier right now. But yeah, in the meanwhile, we just hop on calls and say, hey, maybe this is how you should try it. Until you can...

take those opinions and bake them into the product. And I really look at it as, yeah, the pattern recognition that you've seen over the years. And I appreciate the fact that you are calling out the obvious, which is it's a system. And it's not just, okay, we've got this ML model or we've got this AI model. We need to really look at the system. It goes back to that old

D. Scully posts from back in the day that everyone used to quote where it was a lot of different boxes in the infrastructure and only one of these small boxes was the actual model. The rest were the different pieces of it. And granted, it's been updated since then because you don't necessarily have the data pipelines anymore and you don't have that training pipeline. But you need to think about

The AI product and the AI system that is around it, as opposed to just, all right, cool. I'm hitting this API and then serving it up and, you know,

that does feel like what a lot of people are learning these days. 100%. And actually they're learning it. The one thing I'm always very excited about is they're learning it a lot faster than we did. Um, right. The ML industry, we took decades to learn some things that like the AI industry is learning in months. Um, and so it's very exciting to sort of see the pace at which things are developing. Um, you know, even evals, like we took so long as an industry to realize how important it was. Um,

And I think it's taken maybe AI like a year and change from the first systems and at least in demo to get to a point where everyone's like, okay, let's stop and think about what a good outcome is here. And then we can sort of optimize towards that. So I think that's sort of a big step in the right direction. Yeah, and realizing that certain evals mean nothing, like the leaderboards. And you need to have your own evals that you keep close to your chest. Otherwise...

in a few model iterations, they're also going to mean nothing. Yeah. You've got to keep your evals, sort of make them more robust over time. And your system has to be learning from those evals. It means nothing if you're like, my model is doing X or is at quality X. If that's not feeding back into the model to make it X plus one, right? Like at that point, maybe don't do evals because you're just wasting time.

It's you have to close the loop from generation evaluation to improvement. Without that last piece of the loop, your system is static, right? And that's sort of the worst thing you can have in a perishable universe is, you know, the baseline is going down and you're not doing anything to sort of keep improving it. Perishable goods, man. I love that idea. And the vision of as soon as you put something out, it's already stale.

And we knew that it's so funny how speaking to you, you're, you're really bringing some of this stuff back. Like as soon as we would put out an ML model, we knew it was stale and you would have to go and you would have to figure out the features or you would have to retrain it as much as possible. And you would be looking at the model drift.

And when it would hit a certain level of drift, or if there was some kind of a big thing that would throw it out of whack, you would try and trigger a retraining and make that happen as fast as possible. And then you kind of have that champion challenge or as you deploy one new model to see, is it working as well as this other one? Let's see. And so now we're

We're taking that same idea, but we're saying, all right, well, what about these prompts? Can we have champion challenger prompts that will kind of work? Are they going to? Exactly. And I think once you start getting good eval out, right, it's very easy to start doing those experiments. So the step two here, after that step one of what good and bad looks like, is how can we start experimenting? Right. Can we do like a... For us, it was so normal before.

to launch a new model, do an offline eval, which was on your existing regression set, and then an online eval on 5% traffic. These were standard mechanics that we ran for every model that we trained because you knew that a lot of the ground truth was limited in offline eval. So you had to see what this new challenger prompt looks like online

And you start to get comfortable with this idea that machine learning is iterative, that you're almost always going to be experimenting. And I think that sort of scares folks a little bit because with software, you sort of put your head down and then release. And if you have to patch, it's a bug. It's a bad thing, right? In ML, you're always patching. Like, that is all you do. You're just, like, sticking patch on top of patch. Yeah.

And it's a feature. It's not a bug. So I think that mindset is something that's undergoing a transformation there. And I'm very excited about it. So basically, what you've seen as the graduation from just hitting an API, calling some kind of open AI API, anthropic API, whatever, and calling it from any language that folks want.

and then embedding the prompt in that API call, that's great until you get to a certain maturity level, and then you have to do what? Then you have to start thinking about what are the other steps that you might want to do locally, right? So you might want to have an AI server that's sitting right next to you. You might want to embed locally. This is like one of the simplest ones is you probably don't need to call an API endpoint to embed your query when the user sends it. You can just hold, because the embedding models are super small.

So you can host your own embedding model on the same server, and now you've cut down the network latency by about a second, right? So if you're running this a million times a month, you're reducing quite a bit of time for your users. And to do all of this, though,

is actually pretty hard in any language other than Python. This is kind of a big challenge. Why is that? The ML community thrived in Python. And so all of the support that exists is in Python. That's not to say you can't load a model in another language, right? That would be sort of superfluous to me to say. It's that the amount of time it would take you to do that and do that right is usually just not going to be worth it. We have like a decade's worth of

you know, ML support in Python, whether that's your PyTorch, whether that's TensorFlow, whether that's now Transformers. There's so many layers of abstraction that make your life a lot easier. And so we're seeing this sort of universe and it could be a good universe or a challenging one. We don't know in a few months where folks are building AI systems into their existing code bases.

That is the fastest way to go. But that usually means that once you hit the ceiling that you can with foundational models, you have to rip everything out, create a separate Pythonic universe, and then start loading everything in there. It's almost starting over from scratch if you aren't doing it in Python. Exactly. And so...

Now that we could get lucky and like there is support for ML in other languages by the time you need to do that. But this is why I'd like the gentle recommendation I tend to give is I know it's going to be a big lift to now start working in a different language. But if you look at this AI universe very seriously, then you should probably start thinking about an AI backend, right? What does it mean for you to have like a backend service dedicated to your AI systems?

then you can hand off eval to that system. You can hand off API calling, model management, gateway management. It just like the separation of concerns is so clean when you start doing that. But there is an upfront investment of, hey, are we really going to take on a whole new language? So I think there is some back and forth there to be had. Yeah, because that was one of the key things is that

people aren't using Python or not as much as you would suspect, especially if they're not used to living and breathing in AI ML world. 100%. I think people are using Python a lot less than folks would believe because there is no immediate reason, right? When you like start hitting an API, you can hit an API in any language. So maybe you're not using, you know, OpenAI and Anthropix SDKs are very limited. I think there's three or four languages, but

you can just call an API endpoint. Especially if you're used to coding in JavaScript, why would you switch over to Python? Exactly. Until you hit that limit and then that's where you're saying is that you're seeing folks, you try and tell them in the beginning when they're starting off their journey and then they say, yeah, yeah, yeah, whatever. You don't know what you're talking about, kid. I think it's, there's like an interesting enterprise decision there, right? So,

there is a big lift to move to Python. So maybe it is optimal to MVP in your existing language as long as you know that if the MVP works, you're going to have to move, right? So maybe you say, hey, we don't want to move to another language before we see some AI value. But if you're convinced that your business is going to run on AI or be AI native in the near future, then I always recommend, you know, go all in.

If you're not convinced, then yeah, it might actually be worthwhile to stay. But think through that journey. Like make sure that you're cognizant of the decision you're taking is sort of my focus there versus sort of pushing anyone in one direction over the other. Well, and what was it you were saying about, okay, then you need to eventually bring your model in house. And I was saying, well, it's probably through

uh microsoft open ai api or it or the anthropic aws version but what was it the the model isn't necessarily the hard part here yeah it's the it's everything around it so that's why we were talking about like the querying stuff right like how do you generate a query embedding um or if you're trying to generate like rewrite your query so query reformulation um

There's no reason for you to call a foundational model to do query reformulation. You could just do it locally with a tiny model. So you don't need to use Anthropix through AWS or Azure OpenAI. You just host that next to you and it's like a dollar an hour if you're using an ATAN. And it's faster, it's cheaper. There's no reason that you wouldn't do that.

But if you're not living in a Pythonic universe, you basically cannot do that. Right? Because you can't do that in Ruby or you can't do that in JavaScript. You can do that in any language. So you could do this in Ruby. You could do this in TypeScript. It just would be really hard. Right? Like, you'd have to rewrite a lot of libraries from scratch. That's kind of the trade-off there is if you're working in, you know, Java, for example, you

you basically will need to rewrite a lot of libraries. I actually went through this personally where I was doing this for Go, right? And it became such a big blocker that eventually I was like, you know what, we'll just call the API endpoint. Like, I really don't think I should rewrite transformers in Go. And it's an interesting challenge to wrangle with. ♪

We're All Finetuning Incorrectly // Tanmay Chopra // #304 01:00:30 Share

MLOps.community

Deep Dive

Shownotes Transcript

We're All Finetuning Incorrectly // Tanmay Chopra // #304