We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Tricks to Fine Tuning // Prithviraj Ammanabrolu // #318

Tricks to Fine Tuning // Prithviraj Ammanabrolu // #318

2025/5/26
logo of podcast MLOps.community

MLOps.community

AI Deep Dive AI Chapters Transcript
People
R
Raj
Topics
Raj:Tao 是一种无需标签即可为特定领域微调模型的方法。它通过强化学习和合成数据,让模型能够评估和改进自身。模型生成响应后,奖励模型会对其进行评分,然后通过强化学习调整模型权重,使模型更有可能产生高分输出。这种方法避免了对大量人工标注数据的依赖,并允许模型从自身的错误中学习,从而更有效地适应特定任务。 Raj:测试时自适应优化(TAO)的关键在于,它在训练时消耗额外的推理时间计算,从而在实际部署时保持相同的推理延迟。客户提供任务提示后,系统会在训练时进行推理过程,并动态调整阈值。此外,生成多样化的响应至关重要,避免冗余计算和信息重复。通过迭代训练和重新部署,并不断引入新的信号,可以避免奖励模型过度拟合,并保持模型的有效性。 Demetrios:我理解了,Tao 的价值在于它避免了对标签数据的依赖,并能够根据用户提供的提示生成更好的模型。这种方法通过使用奖励模型来评估和改进模型生成的响应,从而实现高效的微调。此外,通过在训练时消耗额外的推理时间计算,Tao 能够在实际部署时保持相同的推理延迟,从而提供更好的用户体验。

Deep Dive

Chapters
Tao is a method for fine-tuning models without labeled data, using reinforcement learning and synthetic data. It addresses the challenge of high annotation costs and allows for customized models without labels.
  • Tao fine-tunes models without labeled data
  • Uses reinforcement learning and synthetic data
  • Addresses high annotation costs
  • Enables customized models for specific domains

Shownotes Transcript

Translations:
中文

I'm Prithviraj, or I also go by Raj. I am an assistant professor at University of California, San Diego, and also a research scientist at Databricks. I was previously with Mosaic before we got acquired. So I actually have two jobs. I do two full-time jobs, effectively. And I usually take my coffee black, but with sugar.

So usually like no cream, but we'll have some amount of sugar in it. Welcome back to the MLOps Community Podcast. I'm your host, Demetrios. And today talking with Raj all about fine tuning, but not just any kind of fine tuning. We're talking about tau fine tuning, the method that he came up with and wrote an incredible blog post about. Why don't we just start this conversation? And if you are listening in,

Let me just rock and roll with one of my favorite songs. This is an old one from Nick Mulvey. It's also unplugged called Unconditional. I highly recommend it. It's from the Wake Up Now Unplugged sessions. Go check it out. Unconditional Venus by my side Unconditional

♪ Know where you can hide ♪ ♪ It's something to know ♪ ♪ No matter where you are ♪ ♪ The Venus light is shining in your star ♪ ♪ And it's back in my heart ♪ ♪ Saying that you do ♪ ♪ It's a full loving inside ♪ ♪ It's nothing to know ♪ ♪ But knowing that you are ♪ ♪ And the Venus light is shining in your star ♪ ♪ It was right here ♪

Do you call it the DAO? Do you call it DAO? Or do you call it TAO? Yeah, we're just going to call it TAO. It's gone through multiple naming changes internally.

Tau is actually the final version, which is what ended up sticking that we came up with like maybe a day before the blog post went live. So internally, it's actually key to something else in my in my head, because for the majority of the time we were we were thinking about it as a different name. But we decided that this was a better name overall.

Because whenever I see spelling like that, I always think of like the Tao Te Ching, you know? Yeah, yeah. I think they were just, yeah, we were just thinking like Tao. Yeah. Tao is perfect. So what is it? So Tao is basically at a high level a way of people fine-tuning their own models for their own domains. Okay.

Right. So there's always been this sort of like, yeah,

stress between, well, do you have one model to rule them all? Or do you have, you know, a sort of a separate model for every single individual domain? And given that a lot of people's data is private, we kind of seem to be gravitating towards a world in which these aren't actually necessarily like entirely orthogonal. But

They're very much in a world where a lot of people do need customized models for their data and their use cases. And so that's what it is. It's a way of doing that. And most importantly, it's a way of doing that without people having labels. Right. So.

Like the bane of every single like, you know, machine learning person trying to build a custom model is like, oh, no, where am I going to get the labels for this data? Expensive. Yeah, these are expensive. The annotation costs are going to be super expensive. And we've heard this like time and time again from like so many customers. Right.

Right. Like we were, um, that's actually where some of the initial motivation came from. Um, right. So we were thinking, okay, well, the initial version was like, okay, well, what would it take to, um, have some kind of a system where people, um, are deploying their models, collecting feedback, um, in some form with these models, um, and then using those to continuously like improve them through time. Um,

Not everyone has access to be able to collect this sort of feedback and whatnot in real time. And a lot of people just straight up, they have an idea of the tasks they want to do, right? They have some prompts and whatnot, but they don't have a particularly large data set on how to do it. And so the basic there was, okay, you don't have a large data set. You also don't have a very...

labeled data set or you don't have any labels because yeah it takes a lot of time and effort to create those labels so how can we still get something of value without any of that

Yeah. And it's a little bit like counterintuitive from like all the classical machines, like supervised learning perspectives, right? It's like, oh, we're taught like, you know, always think really hard about your data, right? You need like good data. And then if you think about how like a lot of the big companies are making their models too, right? Like they do have a ton of

like human annotated data. Right. So that was, that was like the, the sort of like initial key challenge that we spent a lot of time, like brainstorming through is like, okay, well, if,

the way to get people to use something like this, right? Like this RL, like XF as a service. And so like, I'll get to the, I'll talk about the RL bits like later, but basically the sort of like continuously improving like domain specific specialized models thing, then we need to like make the barrier to entry as small as possible. Yeah, it's funny. Anybody that has taken the Andrew Yang course

course is like you can't do that no it's all about the data what are you talking about no label data yeah well i it turns out that like it kind of still is about the data it's just like we abstract a lot of it away um for you um tell me more about that and the the the data is like it's it's synthetically generated

under the hood, so to speak. And it's synthetically generated from the model itself, right? So here's the way I like to think about it, right? So the way we do this is very reinforcement learning based. And the way I like to teach reinforcement learning usually is you think of this as like personalized, supervised learning for the model, right? And

Where if you do normal supervised learning, you just have like a ton of labels that a human has kind of annotated and you're trying to learn how to mimic the human, human's labels or whatever. Whereas in the reinforcement learning case, the model is itself generating data, right? The model is making these generations. So you have a prompt, the model like generates like

you know, one or end responses for this particular like prompt. And then it gets feedback. It's like, okay, which one of these things was good? Which one of these things was not? And then it tries to learn from that. Right. And so it's learning from its own mistakes here. Right. So it's,

Whereas like if you were training it entirely on like a human data, it would be learning from a human's mistakes. And the key aspect of this is the mistakes that like models make when they're like reasoning about something is different than like how humans do it. So it actually turns out that in like a lot of cases, the human data is not necessarily gold truth.

And so that's probably like one of the key reasons this actually like works is being able to have these models like learn from their own mistakes in like a way that's specialized for the models. Okay. So break down exactly how this is working and what, because it's all, I think we jumped into it saying Tau,

the Tau stands for something, right? And it's not the land of 10,000 things. It is test time adaptive optimization. Yeah. So I guess at a higher level, right? So it's actually very much the process that I was just explaining. So people will come and they have like a bunch of props, right? So that's the barrier to entry now. Some companies,

customer comes and they say, hey, we have these tasks. These are the types of prompts that we want to give this model. And then what we do is we take those prompts. Internally, we have these base models, these up to some extent pre-trade LLMs or whatnot. We have them

generate a bunch of responses um from the the model right so they're like there's various sorts of techniques on on how to do this but the general gist that they all boil down to is you are trying to generate different responses and then we have another model which is called a reward model um

And this is pretty important. So what we have, this reward model will take these like responses that, you know, you're generating models, right? This like, which I'm going to call a policy from here on out to try to use more like RL language. This reward model is going to take the responses that this policy has outputted and then score them, right? It's going to be like,

This is, you know, plus 1, plus 0.5, minus 1, minus 0.5, right? And then...

via reinforcement learning, we can then train this policy to say, okay, well, some of these responses were really good, right? And so because these responses were really good, we're going to change the weights of the model to upweight those, right? To make it more likely to produce outputs that are like the ones that have high scores, right?

and make the model less likely to produce outputs that are like the ones that have received low scores. And then that process sort of repeats iteratively. Which until now...

is standard reinforcement learning, right? So this is mostly like standard reinforcement learning, right? What I've described. The key things that we've had to do in this particular case is, well, one, if you think about it, this like reward model that we have is kind of providing labels, right?

And it's doing a lot of the job of ranking these inputs and these outputs of the policy. It's giving some kinds of scores. And so it's really important to have a reward model that is trained really well across a wide range of tasks.

And so that's like one of the things that we talk about in the blog post. We have this like reward model called like a DB reward, DBRM. So like a very enterprise focused reward model that's able to judge tasks across a wide variety of possibilities. So

The reason that this is simpler is that it is easier to trade models to judge whether something is correct or not, as opposed to actually trying to come up with the correct answer yourself. Right. So this this is this like verification problem. And so the verification problem is is easier than the generation problem, right?

usually cheaper and cheaper. Right. And so we spent a lot of effort like gathering the data to make as like wide ranging and generic a reward model as we possibly could. And then for the vast majority of use cases, at least everything that like we've shown in the blog post, we use the same reward model.

um across like you know text to sequel across like you know finance bench which is like um well yeah you know like the the finance domain question answering kind of thing um it's actually the same reward model it turns out that same reward model is able to judge outputs for like all of those tasks the reward model was fine-tuned

model that it was like some llama base or something that was fine-tuned or it was a completely separate smaller model

It was like another model that was like a base. So think like, you know, in the Lama style or like something like open source style, Lama, DVRX, whatever, style like base model that we did like fine tune on top of. So the conventional wisdom for training reward models is like you usually do start with like a pre-trained and then like

somewhat like instruction to model. And then you add like an additional like head on top. So instead of like mapping to this, you know, giving you logits, a probability distribution across your entire vocabulary to figure out what token to predict next, it's giving you a single scalar value now. Yeah. Okay. Now the other piece that I wasn't super clear on here was that you're saying,

The reward model is basically giving you, if you squint, some annotated data. Kind of. It's very sparse annotated data, right? So it's feedback data, effectively. So...

The reason this is, like, possible is, so let's look at this in terms of, like, information density, right? So if you have, like, pure, you know, like, imitation learning data, right, supervised fine-tuning, like, sort of data where, you know, you have to have a human write down exactly how to do a task, right?

Right. Exactly like all of the outputs for how to do a particular task given like the import prompt. That's very information dense. Whereas the information that this model is giving it is like, OK, you're generating this entire like sequence after the entire sequence or at like breakpoints in between the sequence. We're just giving you like a scalar value.

Okay. Right. At sort of like, you know, predefined like breakpoints. So it's a lot less bits of information that it needs to be able to predict. And so that problem is easier. So now where does the test time annotation come in? I feel like there's another part to this story. So the other part to the story is that, okay, well, so you have like the reward that's like scoring.

stuff, right? But what's it actually scoring? Right? And what it's actually scoring is what's on the policy side. So the policy is generating like responses and the policy is generating and you can generate one response and score that one response, right? And then learn from that. You can generate two responses and

And score that and then the way you do it is also like important right you can condition, the second response that you generate it based on the first one, right, it's like okay generate something different from what I like generated the first time around, or whatever. But if you think about it, that is, that is,

time compute that's being used. The more types of responses that you're using, the more types of responses you're generating, the more test time compute, so to speak, is actually being used here. Now, the terminology is, it can get a little bit confusing here. And the reason we used, like call this test time compute, is this test time compute is just the model during

during training using additional inference time compute. But once this whole training process is done and the model is actually deployed, it doesn't do

multiple response generation or whatever right um and so the inference cost for the actual user is the same we are eating up a lot of the inference cost during the the training process i see um the additional like adaptive test time like the that bit um

is eaten up during like kind of ahead of time so that like when it's actually deployed you know you can kind of expect the same you know like inference latencies and whatnot. Yeah so it's not like a deep research or whatever R1 reasoning model where you're gonna give it a prompt and then come back 15 minutes later. It is doing that test time compute at the

training level yeah so so the the process would go right like um customer comes um they have like a bunch of prompts that they give us um and then we would do this kind of like you know like reasoning process um at training time and so we would just like drop it in and then come back a couple days later

um right or like you know whatever amount of you know test time compute um that we want to use for for this particular task um these responses all get like scored in various ways um and it turns out like how you generate the responses like also matters right so there's like very popular techniques like best of n which is you just kind of like generate

and different responses, all IID, right? Independently sampled from each other. Turns out that's like probably not the greatest idea, right? It's like you're actually burning a lot of like redundant compute. It like, it works. But in terms of like trying to be at least somewhat more compute efficient, it works.

it gets like a little bit redundant, right? So if your, your underlying model doesn't have like very much diversity and the range of things that it can say, then if you generate like N responses, it'll turn out that like K subset of those N responses will be just very similar to each other. And you've just burnt a bunch of, and like the scores for those will all be the same and there's no additional learning signal to be had and just

trying the same thing over and over again. So you have to be a little bit smart about how you actually do this kind of sampling. And that's like one of the things that we like internally developed is like being smart about this. Because it's not the amount of ones, or I guess to put that differently, you quickly realize that if you're scoring the same thing, basically it's the same words, but

or it's the same idea and different words. And so it's getting the same score, and that's not adding any richness to that base model. Yeah, you're not extracting any new information, right? From an information perspective, you're kind of dead in the water. So you have to be careful about...

How these responses are generated in the first place. Right. And how you're doing this like exploration from from the R.O. perspective such that the feedback that's given by the reward model is actually useful. I was figuring you were just going to turn the temperature up.

to 11 and then see what happens you know we did we did try that that was actually probably one of the first things we tried um is like okay let's just turn the temperature up and see but it turns out that doesn't like work particularly like well it kind of depends on the the use cases so like if you turn up the the temperature really high it's like actually you know if you're doing like creative writing right if you want like character ai sort of like chatbots it could work

If you're trying to do enterprise things, you probably don't really want that level of super high temperature stochasticity. So you have to do things like you have to be able to condition on previous responses. You have to make sure that you have some definition of... If you have one response and you have another response, you have to have a definition of how close or far apart from those two...

from each other they are and then make sure that the next responses that you generate are sufficiently different from all the previous responses that you've generated. Have you tried to do this over and over and do you see like

loss from when you're when you've done it like two or three times because I can imagine a world where you say all right cool let's just set up a retraining pipeline that kicks off every week or every couple days or every month whatever it may be but then you start to see almost like compression loss in a way yeah yeah that's a great question so um like the answer to most things in ML land it depends um

Right. And so I'll tell you cases it works and sort of some cases where it's not so great. So it turns out that actually like one of the things that we did in Tau is we just kind of like ran this process sort of like iteratively more and more. And we're like,

oh damn the the number is just going straight up um right like it just keeps yeah this is the best graph ever it's just going up and to the right like just continuously like this this just keeps working um and this was like like a a year year and a half ago um at this point we're like this is great um you know our reward model works like all of this works um but it turns out

There's two things that could go wrong here. One is this obvious issue or in hindsight obvious issue of reward hacking where it turns out that your reward model is not entirely 100% accurate.

Right. There's some amount of like noise in it. And so if you spend too much like compute and you're just doing this like like looping around at some point, you've extracted all of the useful signal from the reward model. And now you're kind of just learning the remaining like error noise.

Oh, interesting. And this is like, this is, you know, like, I don't know what overfitting really means anymore. But for a classically trained ML person, this would be the way to think of it. It's like, this would be like an RL-ish version, possibly, of overfitting.

where you've just, you're really good at optimizing for the reward for a particular like model. But because that itself is just a proxy for what like a customer or someone might want at the end, what does it really do so well downstream? It's almost like I think about

when you're squeezing a lime and in the beginning you get all that juice and then later you're like trying to squeeze, squeeze, squeeze and it's so much effort just for like one drop and really it would be even better. Yeah, and it starts getting bitter too, if you start squeezing at the end, right? And like...

It's somewhat well known, but it's really interesting to kind of see it in practice. And it's like a very much like a like a somewhat of an art to figure out like, oh, when what is that tolerance threshold? Right. What is what is that cut off? And how do you get that cut off in like a generic way such that, you know, multiple different customers can use a singular reward model?

Because, well, what would you do then? Do you just swap out the base model and like start over from zero again? Or how do you...

Well, I mean, you have checkpoints from, like, sort of, like, all through the training process, right? So you could roll back to a previous checkpoint. Like, ideally, like, the way this would work, right, the ideal loop of what this is would be, you know, you have some kind of, like,

like generating model that burns like a bunch of inference compute and simultaneously you have this like reward model that's trade. And then at some point you cut it off, you just deploy the model and then you collect new feedback so that you can actually like, you know, update your reward model. And therefore now there's more signal in the reward model. And now that there's more signal in the reward model, right, you can now train again on it. Right.

That would be the ideal loop is you're doing this cycle of redeployment. And this is what you were talking about, right? This redeployment thing and retraining across deployments only works really if you are able to get new signal in between each.

Right. So if it's the same questions, right, and you've kind of optimized up to a certain extent already, you're not going to get anything like more after the sort of like initial training run. Because we're already doing like a lot of that like optimization for you. And like we're figuring out dynamically like what those like thresholds and stuff are.

So, like, ideally, right, if somebody comes to us and they're like, okay, so, you know, we have these prompts, this is cool. We will then use this, like, you know, like, TAL method overall, you know, figure out, like, you know, all these sorts of, like, thresholds, optimize against this reward model that we have dynamically. And then we'll give you, and, like, you know, the model goes back to them, right? Yeah.

The reason you would want to then like retrain it like further would be actually, you know, we have these new tasks now that we want the model to be able to do. Here's a bunch of new prompts. And we are reasonably confident that these new prompts are actually doing something different from the old prompts. And to use your own words, they're not like the same thing, but in different words.

Because we kind of count for that too, sort of under the hood of like taking your prompt data set and then just making sure we're covering all our bases of like, oh, here's like all the ways that like people could possibly say or ask for these like same set of prompts.

Yeah. And then so if you have like a bunch of new tasks that you're reasonably certain is like different from the capabilities, the model has already like been trained for, then you come back to us and say, okay, well, we would like to optimize again, please. And then you do it. Right. And that would be the reason to retrain the model.

So what are some of the results that you saw when you did this? Like compared against different ways of fine tuning, I guess would be the main thing. And most of the time, right? And I'm sure you've seen this a lot. It's just try and optimize the prompts, try and prompt tune it to the best of your ability because you get such a fast feedback loop. And if you can't do that, then figure out if you need to go and you need to fine tune and then what,

kind of fine tuning with how many GPUs and all that, what data, what labels, all of that fun stuff. There's so much of a headache that comes along with fine tuning and not doing it right. And then later after you fine tune, you realize the model actually performs worse. What's going on? And so here I see the value prop really clearly. You say, look,

the label data you don't need to worry about, and we'll actually just give us your prompts, and we'll make a model that works better on these prompts.

Yeah, yeah. So again, you know, the standard machine learning like adage of there's no free lunch is very much true, right? So, you know, these props can be from like multiple different tasks. And we can make your model better at these like multiple different tasks. But you're trading off performance somewhere else.

Right. So like you're like, you know, like most of the customers will come to us. They'll have like a bunch of like enterprise based like prompts. But, you know, you're probably, you know, like reducing the models like underlying creative writing. Right. Which they may or may not care about. Yeah. Which they like probably don't really care about. Right. And if they do care about it, they can come back to us with like.

creative writing prompts as part of it. And then we can optimize for that too. Give me my financial statements in the style of a Bob Dylan Zonk. Yeah, exactly. That'd be cool, actually. Yeah. Maybe I should figure out ways of teaching students that way. But the other thing that just popped up there in my head as you were saying that is, do you need a certain...

amount of diverse prompts or does it perform better if you give it a much bigger set of prompts because you were saying at the end of the day all of the output that gets scored you reach that threshold so in my mind I'm wondering oh well if the input is more vast then does that mean that you can get more out on the other side and I'm sure you looked at that

Yes, we absolutely did look at that. I guess the high level here is that it is absolutely true that the bigger your prompt data set and the more diverse your prompt data set, the better the performance is going to be. And for the longest time, actually, we only found that we were able to get good results

At least good performance means like we could train like a relatively small parameter count model to match the performance of like 4.0 or something on a particular task if we had like a really big prompt set, like a really big diverse sort of a prompt set. And really big is like a thousand? Yeah, big would be in like in the thousands. Okay. Right. But...

That's a pretty big barrier to entry again, right? If we're like asking for like no labels, but we're like, okay, only prompt. But actually the asterisk is that like you have to give us like, you know, 10,000 prompts or whatever. That's a pretty big barrier to entry. And so we don't require that. We found a way around that.

basically, such that for the sort of like end customer, they can give us just like a subsample of the kind of range of things that they think the model can be used for. And then we'll take it from there and we'll figure out how to like, you know, get that kind of like, you know, diverse like prompt set or whatever and learn from it.

Yeah. So it's like I come to you with my three prompts and you're like, don't worry. Hopefully you have more than three, but like... You gotta work.

Yeah. But, you know, like if you come to me with like only three prompts, like we would do a great job of that make our life easy. I just need it to do well on like these three types of prompts. I was like, OK, sure. Right. But, you know, like, you know, we'd be OK with like that range. It's like, oh, like, like, you know, 10 to 50 prompts even. Yeah.

Right. Like that, that range of stuff. And then, you know, we can go from there and internally like get to the stage where we see like prompts that are a lot more along this line and, you know, much bigger sort of like data set, much more diverse data set, so on and so forth. And so the thing that is also fascinating here is that you were able to get

such high performance from small models and I wonder if you had no real constraints on how you were to do it but I just told you hey smallest model best performance highest accuracy on x amount of tasks

How would you go about it? And it can be anything from, all right, I'm going to distill this model and then I'm going to fine tune it and then I'm going to whatever you want. Like, how would you look at that and try? Or is it not even worth the headache of doing all these extra hoops that you jump through and you just say, I'm going to do tau and it's going to be good enough.

I actually think like I'm going to do tau and it's probably going to be good enough because like I think we've done a pretty good job of like figuring out what the sort of like optimal scenario is, even given kind of like all of these constraints, right? So these constraints are like fairly realistic. But under the hood, we were kind of like relatively unconstrained, right? We have no compute constraints and things like that.

under the hood. And so, like, I think we've done a good job of, like, being unconstrained as much as we can on our side, whereas still accepting constraints and making sure that the customers are, like, don't have, like, a particularly high barrier to entry. Yeah, so I actually wouldn't change too much, but

Honestly, I think I think like if I'm thinking about it, if you're like, oh, you're like fully constrained, how would I do it? That's actually kind of the question I asked myself. And because that's what would be ideal, right? And it's like, oh, because I guess the worst feeling would be like, oh, like somebody will come to you and they'll be like, oh, OK, well, you know.

we have this particular task, like, you know, we want to train models on this task or whatever. And then we're like, okay, here's this model that like kind of works asterisk, right? But if we were more unconstrained, it would do better, right? And it's like, that's a scenario I wanted to avoid. Yeah. Yeah. Yeah, I hear you. So the other interesting piece here is, all right,

does the how small of a model can you get to do how great of things or how good can you get these small models yeah yeah again it's a no free lunch question but it turns out that like the smaller models like the sort of like 8B range of stuff you can get them to do pretty darn impressive things right and

And again, it really depends on the difficulties of the actual tasks that people are trying to do. But what we found is that in a narrow range, if you take away its broad intelligence and specify it for more narrow data intelligence, so to speak, the

The small models are actually perfectly fine. The relatively smaller models. Right? So there's, like, a threshold at which they're, like, you know, good enough. Now, obviously, like, you know, you might be like, oh, you know, like, Raj, does that mean you don't believe in scaling? That's like, that's not true. I do believe in scaling. Right? I just...

Like scaling is like an axis that, you know, the more complicated a task you have, the bigger the model that you will need in order to be able to like achieve it. But not everyone needs. So like you see the new llama force, right? The behemoth.

um is two trillion um parameters you know only a fraction of that are like are active uh but that's still going to be like a massive pain to to deploy for like the vast majority of people yeah people are not stoked people are not stoked about the the size of the behemoth even even like scout is like relatively like large right and the performance is

Yeah. So, like, you know, like people have these kinds of like, you know, like cost, quality, like, you know, trade-offs. And

every single customer, I guess, like lies somewhere different on that Pareto frontier of cost quality. And so when you're thinking about setting up any sort of a product where you're trying to like get people to make their own customized models, like things like, you know, adaptive compute or whatnot are great because they basically like allow them to control where on the cost quality like,

frontier they are. It's like, oh, spend more cost, burn more compute during training, get better quality at the end. Yeah. Yeah. That's a great point is how, okay, you may need to optimize for a certain

place in this whole life cycle on you're okay with spending a little bit more money on that training to fine tune because later it means now you have that small model and I imagine have you played around with trying to like compress these models or distill these models even more to make them even smaller and

We haven't... Yes, internally, we've played around a little bit of distilling it down from the 8B size even further. And you can...

like still do pretty reasonably well for certain tasks on those. But you start seeing like pretty sharp drop-offs if you're like doing like a 3B, like a 1B or like a 3B sort of a model, even for some of the benchmarks that we were like thinking of or that we showed in the blog post, like the text, the SQL, like BirdSQL and whatnot. The drop-offs were pretty high, but for some of the other benchmarks, it worked okay. Yeah.

Right. So I guess it's like, again, it kind of depends a little bit on how much or like how difficult the task is. But my impression is that the difference between 3B and 8B in terms of performance is disproportionately high compared to the cost difference. Oh, okay. Yeah.

Yeah. Which is kind of why we presented a lot of results at the 8B scale, right? So it's like, oh, it was not a linear correlation. Like, the performance dropped off a lot more than, like, you know, say, like, you know, three-eighths of the compute or whatever. Yeah. And the cost wasn't outrageously different. Yeah. The cost isn't, like, outrageously different between, like, a 3B and, like, an 8B. Like...

Right. Now, of course, there's like all sorts of like inference time, inference things that you can do. You can do like multi-tenant serving, like all that kind of jazz. But like we felt that this was like a reasonable tradeoff, like an 8B was like a reasonable tradeoff that is like still like fast and lightweight enough to deploy in like a lot of scenarios, but it's still able to

perform pretty well, right? And the beautiful thing about Tau is that, like, it just, like, it just works. Like, it's size independent, right? Like, we have results on, like, 70B. We have results, like, internally on, like, 405B. And it just keeps working even better and better. Did you notice any...

drastic differences or notable differences, maybe not drastic, between the models that you were using to fine-tune on and some took better to this method than others?

Yes, kind of. So it really boils down to the models being able to generate diverse responses. So there are certain models that are out there that have kind of been RLHF'd to death.

Um, right. Um, and so, so, so, okay. So, uh, like from, like my, my, my primary field of research is like, like RL. And so like, what, what is RLHF really doing here? So that kind of underneath, right. If you have some kind of like distribution, um, so the pre-training, um, lets you learn all these like token distributions over the entire, like, um, internet language distribution, um,

like supervised fine tuning, like makes that distribution like smaller, right? It's like, okay, well, I'm going to bias you towards this like range of instructions that I think the model is most likely to hear people like say to it and interact with it. Now, like RLHF, there's been like a lot of sort of contemporary work that's been out in the open, which it shows that it makes these probability distributions super spiky.

Right. It's like instead of now being able to generate like a wide range of responses, you are now super constrained. Right. You like can only say like one thing. And this is why you get all these like responses of like the safety types of responses. Like I can't do that for you. Yeah. Right. Like it's kind of been beaten out of the model. That's not great.

from the the perspective of doing further like optimization with like RL right so if you have like a a reward like I was saying right like a lot depends on being able to extract good training signal from this like really great reward model that we have and which means that the

The more diverse the underlying sort of like base model that we're using is able to generate responses, the better it is. Right. So that's part of the trick is like, you know,

don't use things that are already like to RLHF. They won't take particularly well to this method because everyone's, they've already been sort of like, you know, RL trained for a different set of tasks. And now they're no longer particularly great at this. Do you remember way back in the day, two years ago when the Mistral first, the first Mistral model came out and it had pretty much no,

RLHF on it and I think that's why people loved it so much was because it was like whoa we get to do it ourselves that's amazing

Yeah. Um, and you know, there, there's, there's pros and cons of, of releasing like a model with like, you know, uh, no, no RLHF, but I, it certainly became really popular in, in certain subsets of the, uh, the open source community that wanted to use it for, you know, all sorts of like use cases that like a model like Lama, um, was kind of too RLHF to, to do well on. Right. Um,

But yeah, so that's kind of the thing to think about here, right? But well, I guess this is from like a researchy kinds of perspective. From a customer perspective, they don't really have to think about it because we've already thought about it for them. Yeah, you've already tested a bunch of different models. Yeah, we already have a pretty good idea of like the distributions and like,

like which models are spiky, which are not. If they are spiky, how to get rid of some of the spikiness in the distribution such that it's more useful to generate, you know, like a diverse range of things. We've kind of already taken care of this with internal research. Okay, we've been dancing around this and really talking about how the sausage is made, but...

what we didn't talk about is the actual results. So maybe you can give us a quick TLDR of like, like how is this better in what ways I guess is the, is the big question that's going through my mind right now. Yeah. Yeah. So, so the one thing that it's like super obvious, right. The very obvious like baseline is just like supervised fine tuning. Right. Um, it's like if you had supervised fine tuning, like labels, uh,

It turns out that it does way better than supervised fine tuning with labels, even though this doesn't have labels. Right. Um, that's pretty clear. We tested it on like, you know, like text to SQL. Um, and then like some other benchmarks that we, we developed, um, the other people in my team have developed previously, which is like the enterprise arena. Um, and we have our own version of like finance bench and whatnot. Um,

which test all these sorts of like, you know, question answering open and closed book question answering scenarios in very like enterprise-y ways. Like I would never go back to like supervised fine tuning. Right. Like it's, I think, right. I think it's, it's, this is like just way better unless you have like a ridiculously like massive data set.

um of of labels that's like rich and whatnot which like effectively nobody does um and even then like tau is not orthogonal to that like you could still run tau on top of that oh nice right because if you have like a label data set with like prompts and responses guess what you also have you have the prompts which is all tau needs um

So you can still run like tau together with the supervised fine tuning step up front. The other interesting thing is that, you know,

the models are actually like I'm going back and looking at the sort of like exact numbers. But it turns out that like, you know, even the 8B models end up being comparable to like 4.0 for that like narrow like task, right? For that like range of tasks on some tasks, right?

Right. So, so the, the AP models, like, you know, if you have a task like finance bench or whatnot, we'll do as well as like 4.0 or like, you know, O3, like mini or whatever, which are like an order, at least order to two orders of magnitude bigger in parameter count. Now,

For the tasks that are like a little bit more complicated, you see like, you know, for like bird sequel or whatever, which I, yeah, is more complicated of a task. You see that the performance of like the AP models are still slightly less than like 4.0 or 0.3 mini, but that gap narrows and basically vanishes when you get to 7.0 DB. Oh, nice. Yeah.

And so this is what I was talking about is like it's where you are on the sort of like cost quality trade off rate. What would be good enough for you? And so if you are a person that is doing like financial QA in the line of like finance bench.

ap is probably good enough for you um right if you're doing something significantly like more complicated um right you'd probably need something that's like a little bit like larger sized um in order to be able and like if you're doing something that's like a little bit more complicated and you want to absolutely at like oh three mini level right then you're you're gonna like need to spend a

The interesting thing I will say also is the comparison that we did do, we didn't show results of the actual inference time comparison cost. We kind of just presented them as all equal. But O3Vinny is really doing test time inference. Yeah. Right. They're doing a lot of additional compute, whereas we've baked a lot of that additional compute in during training time itself. And so the real correct score

sort of to speak comparison would be like probably like 4.0 in terms of like test time latencies after the model is deployed. Yeah. So it's like, what's your threshold? How complicated is the task that you're doing? But the results are pretty promising in the sense that like, you know, with like a 7DB model, even the 7DB is like still like,

smaller than 4.0 and like O3 MIDI and all of these things you can erase and basically like get to near parity and then will now you control your fate

Right. So to speak, right? Like you have this exact model, you know, nobody does anything under the hood that changes like the model behavior. The models don't swap out under the API hood. If like, if you give a particular prompt, like one day, and the model does a certain way, you can build guarantees around that.

right the model is going to continue behaving the same way because you know depreciating a model after a six month stint no depreciating a model yeah like 4.5 is like depreciating already which is crazy dude well this is this is awesome man