We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Jonas Hübotter (ETH) - Test Time Inference

2024/12/1

Machine Learning Street Talk (MLST)

AI Deep Dive AI Chapters Transcript

People

Jonas Hübotter

主

主持人

专注于电动车和能源领域的播客主持人和内容创作者。

Topics

Jonas Hübotter 博士生介绍了其关于测试时计算和局部学习的突破性研究。他展示了如何通过策略性测试时计算，使小型模型的性能提高 30 倍，并介绍了一种结合归纳学习和转导学习方法的新范式。他解释了如何使用贝叶斯线性回归作为代理模型进行不确定性估计，使模型能够有效地适应特定任务，而无需进行大规模预训练。他还提出了将本地计算和云计算相结合的混合部署策略，建议未来根据任务的复杂性而不是固定的模型大小来分配计算资源。主持人与 Jonas Hübotter 就测试时计算的基础知识、系统架构和智能、资源优化和局部学习、信息检索和模型可解释性以及分布式系统和部署等方面进行了深入探讨。讨论涵盖了主动学习与局部学习方法的比较、信息检索和最近邻的局限性、贝叶斯不确定性估计和代理模型、以及从静态到分布式学习系统的演变等主题。

Deep Dive

Chapters

Smaller models can outperform much larger models by strategically using test-time computation. This involves automating data selection, allowing the model to determine the data it needs for accurate predictions.

Outperforming larger models on the Pile benchmark by 30x through test-time computation.
Automating data selection to improve prediction accuracy.
Using the model's intuitions and learned instructions to guide compute spending at test time.

Shownotes Transcript

Translations:

中文

I think even within context, earning some of these works have shown that sometimes you run in two cases where the data that you provide to a model in context conflicts with the information that IT has been shown during tree training. And then some of unexpected things can happen. Part of intelligence is coming up with these subtractions that, regardless of the environment, allow you to adapt right to these kind of to the environment and and be able to fulfill your fundamental desires. And those will depend on .

any system that does retrieval. There still needs to be some kind of manifold or some sketch of a future situation, which we could then lean into to .

the degree that any intelligence system constantly learns from its environment and learns kind of what is actually what are the right attractions to make good predictions in that environment, any intelligent machine has to do the same.

So the million doa question is, how do you do retrieve, taking into account the interactions between the data?

This is actually quite straight forward. So what we do is we emos t is .

sponsored by a center ml, which is the model serving platform for machine learning workloads. You log into their platform, the first thing you see is the sas option. So this is the really simple option, just like on OpenAI.

You can send requests up, you stick your access token in there, and you can access all of the latest open source models. And it's faster, anish, cheaper on lamor four, five billion. I was getting about forty five tokens a second, they quote, sixty five or so.

But anyway, it's very fast and fast ster than the competitors. They also have pass options, which means platform as a service. He's an example. So we're going to spin up in L M infrangible service. These guys support olam and V L M out of the box.

So for example, when you got the model up and running, you can use the OpenAI A P I, right? So you can use the python D, K for OpenAI, and you can use that to talk to your model. I was just using the cheapest st option that was getting very good performance on the new lamer three point three billion model.

By the way, when you sign up, you get ten three credits. Everyone gets ten three credits, which means you can do some without spending a plant. Also watched the interview I did with gani the CEO the other day, enjoy, introduce yourself. And why are people going to love this episode? What are they?

Gonna ve so I am yona sebert A I am PHD3EH。 Eric, in the institute for machine learning, well, i'm working with on the cause, on local learning and to entity decision making. So recently I worked a lot on of making these ideas scale, to state of the art models.

We work with largest language models, and we're very excited to be able kind of to outperform some very large models on the pile benchmark, which is this large, huge manage mark that comprises multiple kind of sources of of language from language that you would find on the internet, on maybe steck exchange hack and use, to language that is used for math or coding. And we have been able to outperform the previous state of the art, which was in much more than thirty times larger model than the model we use by spending additional time, additional compute at test time when you face with a certain problem. So what will go in today's episode is really kind of the idea of why can I be useful to spend this additional computer at test time.

And the key aspect to this, which is really like fundamentally important, and which without which this kind of learning a test time would work, which is to how to automatically select the right data. So how to automate, how can the L M. Say what data IT needs to make a good prediction? And so essentially, how can we use the L, M, the intuitions, the instructions, learn by the LLM to decide what compute? How should I spend its compute at test .

time with the pile shouted out to a life for i've got bell rose coming on interpreter at icml. And of course, I know corner and and some of the facts that um but the pile is interesting because you actually have the data, right? We don't have the data that OpenAI trained on, yes.

But I guess there's a couple of things. I mean, first of all, you could in principle still do retrieval against the pile even to augment you know, influence that you do on OpenAI. IT shouldn't matter in principle that the distributions completely different. But yeah like maybe you can talk about that. So so does that matter that the distribution of of the tribal data that is different?

I don't think IT is because, I mean, so we we evaluated on a bunch of different model where on some models, I know they didn't use the pile as a training data set. For example, GPT two, as you said, I mean, they didn't train on the pile, think the pile was actually released after GPT two was trained.

So h GPT two IT still works really well if if we kind of augment GPT two band information pile um but at the same time, I think you can think of information in the pile as very kind of central in the data distribution that a lot of state of the art elements are trained on really because these are very kind of academic data sets. A lot of these data set are very kind of good innocence is good data. So there's a lot of scientific publications, for example, and again, coding, math, these are things that today's elements are a very much trained on. And I think what is quite remarkable to me is to see that these of this information that should be already very well encoded in this huge parametric model can still be useful if you show IT a certain sliver, like a subpart some very informative bit through that test time.

The other thing I wanted to quick comment on is am a big fan of machine teaching. So that was invented at microsoft, the research, and that is essentially an interactive form of transducers active for tuning. And it's it's mostly used for like classical models.

But you know the idea is that you have a an application and if you allow a supervisor to kind of like deliberately choose the most informative training examples, just like the dataset distillation, is remarkable. How much your sample efficiency goes up of your training set when you actually pick the most informative examples. But something like that could be used, I guess, with your approach because rather than using the pile, why not just should use google search so you you could create an interactive workflow where the supervisor guides the search and selects the most informative examples. Maybe you could still run your like non interactive version on top of that. And and then you've got like a really informative .

trans structure form of, I think, a very powerful paradise that you know, I think you can still think about this in terms of A A of memory that you can access. But now where a certain piece of memory might not be static, but might be something that actually enrolled some underlying computation, right? Where the outcome is may be not clear a, but which has a certain description.

And we actually also in our in our group with A A mark batta, we worked on on the setting like this where your information gain is essentially not kind of recomputation in the sense because the the information is not static. Instead, you have to leverage your models expectations of what might be the outcome of an experiment to say, OK. Given that I expect this experiment to go kind of a certain way, how informative would that be? The task that i'm .

sold me an extension question is i'm interested in retrieve log mental generation. So I interview a patrol luis recent about that. And I could also imagine that this could improve, retrieve a augmented generation. But rather than doing like an in context documentation version of rag, you could do like a fine shoe transducer act, you know, like A A transducer active functioning version of IT. Do you think that a version of that using your methodology could improve on on right?

Um so I think I mean rag is really this big ecosystem of various different methods that that people have worked on. Certainly one of the big problems with rag or some rag systems is rooted, precise. The fact that they use this news neighbor search over some dense embedding, dense factors emitting, often that's not exclusively what they do, but often that's a very court to what they do.

And as we have discussed, this is very this can be very limiting the sense you can select or done that information. So empirical IT may very well be that selecting of most informative relative to some pretty useful metric of inform, informative veness. Selecting the most informative data can be Better than just selecting the most similar data, which may be redundant.

So so that's one thing I would say. The other thing I would say that um the question of whether in context, whether you should put data in context or whether you should in gest kind of data via gradient steps is still a very open research question. That has a bunch there has been a bunch of different works at half looked at the difference between the two. And it's not very clear yet in general when which approach is most effective. What is quite interesting to me is that both approaches seem to be quite different in when they work and how they work. So for example, what was very interesting to us is when we work with this pillow, data said, which again consisted of these multiple pieces of sub data sets, which comprise, like related to multiple things that you could express in human language, was very interesting to us, is that with certain types of this, of these data sets, fine tuning was much more advantageous over in context, learning IT then with others. And those tend to to be this math data said, that is part of the pile deep my math, which comprises school level math questions like computer derivative, solve this equation for x um those types of questions but also coding or scientific papers from archive or free law which is the data set that comprises court opinions yeah so .

so just addressing both of those things. So yeah, and just with rag improving the retrieval mechanism, so doing some of the stuff you are speaking about, you know, not selecting redundant information could improve rag. And with rag as well, sometimes just because of the architectural complexity, maybe are doing retrieval from diverse data sources and maybe there's some subsequent step to again remove the redundant information or we rank them or or do something like that.

Not i'm not sure whether existing systems do that, but then there's the question of like in context learning vest, fine showing and there must be some interesting kind of trade off there because i've always thought the in context learning is Better because just anecdotally, the the precision seems Better like IT if if there's something in your context IT doesn't seem to hold tonight where I have this instruction that. When it's inside to the model, the information is lower resolution because IT gets in tann guled and expressed as part of like the low level representations in the model that might confer some advantages because the model can use more of its base knowledge to reason about the thing that you're doing. But disadventure ges, in the sense that is more likely to pollution, ate about the thing that that you're talking about.

Um there has been a bunch of works ah that's also just related to trying to understand home contest lending works. I think even within context learning some of these works have shown that sometimes you run in two cases where the data that you provide to a model in context conflicts with the information that IT has been shown doing pre training and then some of unexpected things can happen.

Um generally, I agree that the data with these the data that you put into context with these elements that we train today, um the the of the next token distribution is very heavily by that. And so your model is much more likely to the first to some specific piece of information that is in context, first of some specific piece of information and tasted and has been trained on on the other hand though, back propagation and and great to send are very effective methods. IT seems so at least they work during crew training, seems that they are very effective methods to teach a model how to reproduce a certain pattern.

And of course, and there's this argument kind of of of the connections, is that if you do that long enough, if you scale that up enough, um you will have to develop some abstract forms of understanding because to be able to kind of predict the next token of previous tokens, you have to maybe learn abstract learning algorithms and on so for example, if you if you are face with a math question and now you're doing back background over this example that compute a derivative uh of a certain of a certain term, then to be able essentially what back proper will teach the model is how to reproduce this. And if IT teaches that on on multiple of these examples, arguably the multiple will find some form of reproducing that behavior, whether that is the algorithm that we want or not. That's a bit of a different question. But I will try to imitate to some people. I love thinking about purpose.

different levels of description. I mean, we talking about this on our patron call last night, Keith dogar gave the example of an ant hill. So IT IT has a purpose, right? And IT programs.

All of the little ants and the ants can adopt different roles. And it's very, very simply list c and they smell, fare, reminds or something. And you know, sometimes they they go in the path when they find food, they they would like drop the ferns.

Sometimes they don't. And that means other acts will follow the trail, and then the trail gets reinforced and so on. And even though the ant is just following these very, very simple rules, IT serves the purpose of of the the but then as the question of, well, IT doesn't have an intrinsic purpose, that the purpose is something we described .

to IT maybe um I mean, IT makes you certainly think if you talk about kind of these simulated games that stimulate humans in their life, to what degree we have like more purpose than the ants, because certainly like some of the city build games, they I mean, they get pretty close, I mean, not perfectly accurate, but pretty close to how humans move on in the city. And they're certainly kind of a lot of society inflicted purpose that makes us go around like we'd go to school, be IT, go to work, be add, go to like a restaurant or to grocery store to get food. And I mean, those are like fundamental needs that either I come directly from the from being a human or just from from society and certainly a lot of the purpose that is driving us around is is not necessarily intrinsic in that sense and is like externally um comes from external forces but yeah yeah .

he really does how much of IT is internal that I don't think .

like I don't have the kind of background to be able to actually answer that question. Um I think I mean probably because I mean the city t like staying with the city builder kind of metaphor.

I don't think they in any way shape perform a represent truly internal motivation, right? So you can already say like anything that is uh, kind of not in the gap between these simulators of cities and what you observe in the city, anything that is kind of not inside that small gap cannot be internal, right? Because that somehow external from society or from kind of and by product of the largest simulation and then there's only small liver remaining, right? Of course, I think the human like part of the human constitution is much more than how we move around in the city, right? And there's a lot of stuff that's happening in our brains, which is not covered by that. But yeah, I mean.

this came with li as a yoko ka and Stephen more from the other day because they were talking about where do once come from you know you you have an intelligence system and and you can model IT as a thing that has a planning horizon and the biggest horizon, the more intelligence the more IT has and and then, okay, well, why would he argue for something like instrumental convergence?

You know that when you have a super r intelligent A I system, that IT will have predictably bad into media or instrumental sub goals. And you can think of IT just in this city setting that you are just talking about, that all of these different agents, that they are constraints, but like they have to follow through certain rivers. Or in the case of a city, they have to go through certain routes.

They have to in in order to transact with each other, they they have to use money, and they have to do all of these different things. So I can do like canali, zer or traces the space of future behavior. And it's not just physical extrinsic constraint. It's also kind of like interactions with the behavior space of other agents in the future.

Um yeah I think like my view that a large part of this is abstraction. So as you as you kind of give kind of I think a lot of these systems, even like these artificial systems that we kind of give today, they do have some kind of extensive goal, uh, that we escorted beat, either like just to imitate of the distribution of things that we that we give to IT or like some explicit formulated goal um and then I think the kind of wow moment that the A I community has accumulated over the years to a large part, are situations where now leaving that goal.

Assign the agent if you want to call IT an agent came up kind of with some sub goal that was actually the result of kind of inability action like the agent was kind of abstracting from the immediate goal. IT wasn't kind of greedy and trying to south equal IT came came up with this abstraction that allowed him to take this intermediate step which maybe wasn't obvious to a human to then still still with the purpose of solving that gold which was obvious to the human and um yeah I I don't know I think um there is definitely it's definitely interesting in the sense uh an interesting question whether um coming up kind of with these intermediate goals or intrinsic goals, as you might kind of say, is purely a function of being able to abstract or there's something else going on. I tend to air to the site that this is really purely the power of abstraction um but that I don't have all the answers there yeah boss me tube is said .

that intelligence is compression by which he meant abstraction. But something I think about as well as that, again, we were talking about this last night.

The reason why you know asy chronos distribute to software systems are so exciting is because that that's how the natural biological world works, right? If you use an extreme example, like if we add some space station, you know thirty light years away or something like that, you know, like you you have to get a message to, you have to have a apology, which has a kind of locality prior, right? And that's the way biological system's work, right? The cells send messages to each other in all, just kind of eminent.

Now in this asic quenus way, you kind of send messages. But when we design A I systems, it's kind of different. They're they don't emerge, they they not grown to not I mean, they learn using the cash credit. So at some point when we build our systems, we have to design them. You know we we actually have to put in information boundaries, and we need to say that, you know, like you are talking to this at this level of abstraction, and that seems completely divorced to have .

things this I don't know, I fully agree with in some sense. I mean, I agree with the kind of framing that we design our systems and we kind of put some boundaries in place like what is the amount of computer get? What is the month memory you get? What is the in some sense, what is them out of data that you will see. But I think we are going beyond that. Um so you can certainly think about these systems that are following these fundamental principles, whether that is some form of gradient to send or some other learning mechanism that is human engineered to, in the end, interact with the world and maybe design their own experiments to then be able to actually decide themselves what type of data I crew for my environment in some sense, how I kind of move into and space, and how I kind of in in what direction I evolve of my kind of compressed representations.

Well, let's talk about this kind of behavioral plasticity that they are talking about. So you know, what I love about active inference is that there are no explicit goes. There's no explicit reward function of think of IT as as a form of maximum entropy in invest reinforcement learning. And they were IT actually starts with the statistical distribution of of the data and then IT kind of inferred within constraints what the reward might might be. And and I I really like that because I kind of feel that we need to have systems that can kind of like materialize their own goal, rather than asses designers kind of saying, you need to do this thing.

I think on a different level of abstraction, that is still a goal that is human engineered, right? We still, as you said, we still design the AI system and designed to learning mechanism like even, I mean, the various forms of active inference how materializes in tracks because you can have to come up in a case with like some probably stic model uh over which you can optimize the expected free energy um.

So I think in the end, it's still even kind of that form on an abstract level you can think about as some form of human engineered. I don't like the term reward because it's like soling to this reinforcement learning community, but a similar thing. And then I think really the argument that people make us out, of course, that will lead to different emergent behavior.

Then if you gave you a different backboard like something that is more commonly done in reinforcement. Lanning, well, I say you are working with an attar game, and you just give the game score, of course. I mean, also reinforcement learning. If you give an agent and different reward, the agent will kind of start learning differently.

And again, going back to of what we discuss earlier with kind of intermediate kind of sub goals being actually a function of the attractions or kind of by product of the amount of compression that the agent was able to achieve of of kind of its information. Then I think to that degree, both ways can and will probably find these intermediate and of intermediate sub calls that um that we as humans would find interest, interesting behavior and I think that is to a large pot what um in my view, people in in active inference are kind of hoping for that. If you scale these probably tic models up, then the by product of maxims ing this expected for energy will be these interesting behaviors that lead to, I don't know, kind of self sustaining organisms and that .

kind of so yeah, I agree with what you set out, like even with active inference. I think part of its selling point is that it's kind of human, you know, parasitic in the sense that when you build the systems, IT will have agency in a constrained way with preferences that are compatible with us. And that's fine.

But don't we want to have A I systems with more plasticity, right? And I think like what you say is true, right? We design all of these inductive priya.

We constrain the systems in in such a way that that they will behave in an answer more great. But I guess you could make the same argument about DNA. So, so DNA has this insane cause. Some people might say IT has a kind of constraining force on our behavior, but IT doesn't seem to do that right if if you look at our society like the way we construct language with all this beautiful diversity all over the place. But so there's this a symmetry that even though like it's all based on DNA, like we deleted all of the DNA, like we all be gone tomorrow, but but IT allows this complete found out of behavioral complexity but still constrained in the same way.

I think this is very compatible with the view that I mean, in the end, I think the DNA in our form of life gives us some very hard constraints, right? We have to consume water, we have to consume food, right, to survive. But within that, there is quite a lot of kind of pathways to achieve that.

And they are also very dependent on the society of around this ride. That's and not as the society, our environment, right? And that's I think part of intelligence is coming up with these abstractions that regardless of the environment, allow you to adapt right to these kind of to the environment and and be able to fulfill your fundamental desires.

And those will depend on the environment in case of humans, because arguably we have kind of some not trivial and amount of intelligence between humans. How we kind of achieve our goals is also is also different. But I think this is really a expect from across all animals.

And you can see now that of the animals, not with climate change, you know, the animals that are very kind of tune to their environment while they can kind of cope with small, small changes in the environment and kind of adapt to those drastic changes environment, they are learned abstractions, if you will, are useless right? And then they cannot survive anymore. Um so yeah I think what the abstractions .

come from in a do we what kind of abstractions do we hard code into the architectures and what kind of abstractions are metal and .

um yeah I think I mean this is a fundamental question. I think that has been at the core of machine learning um and even more so A I since kind of the inception um and I mean a lot of people they started with kind of hand coding these attractions right in the sense leveraging human of human learned abstractions or what we deem to be useful abstractions, but fundamentally being limited by this language of or kind of by the language of how we can express these these attractions and and also by our final time um and then and then I mean no networks won't beyond that and and learn these abstractions .

and to end do you are are you a fan of shoes? Kind of type one type to dichotomy .

generally I try to not use whose kind of this kind of system one system two dichotomy and also kind of the word reasoning that is somehow a link to to system too sometimes because I don't find IT um always so of informative and definitely like a low to term that is of very differently interpreted in various communities. Um I think my my hunches, but I might very well be very wrong about this.

Is that really kind of this more, if you will, planning behavior can emerge from learned abstractions. So as we kind of go up layers of abstraction, of abstraction, of abstraction, if we do make predictions on a very highly level of abstraction that may seem like something kind of planning, more intelligence is going on. So maybe this is, maybe this system one, system two thing, thing is just kind of a question of what level of government action we are Operating on and certainly going up levels of abstraction requires more energy, more time and is computationally harder, strictly harder than than doing this more shallow kind of finding these shallow attractions. So in the in in that sense, I think you know um kind of I I resonate with that idea but I think they have very is different interpretations, right that the communities different communities have come up with of how actually the system one versus system to should be materialized. Uh yeah I don't find myself necessity of those.

Very cool. Why why does find show ing a model, let's say on on a batch of examples costs less than doing influence within contact letting?

Um it's because the essentially it's because the uh the information gets compressed into the weights. And in context, learning has this problem that for every new token that you generate, you have to attend to all previous tokens, right and that essentially um you also have to go once over all previous tokens and yeah that is but doesn't I I guess i'm .

just trying to another understand fundamentally, why does the back would pass not be symmetric with the forward pass from a .

computational point to um um I think the key is here that it's really amatis ed in the weights. So in context learning kind of has to present all evidence at once. Oh wise IT doesn't work because there's no emails zone going on and that sends its really like a full working memory. Every kind of bit of information is successful at once and you .

have to filter IT out at once. Oh, I see. So when you do, when you do the the fine .

shining and obviously is a bad size of right, and eventually pack IT into multiple new boxes like the multiple like the sequences of tokens and separate them. And then .

because I got IT, so it's still draft, but amatis es to linux, it's so small.

Yeah yeah exactly. And you have to I mean, really that's a key and that's why I mean, that's why we train parametric models in the first place ah because we try to advertise the data or if we try to extract some compress the data, I supposed to give all the data at once, which he could also do with a non parameter model.

Yes yes because I suppose to something like A A canal function or something like that, you know you would need to .

have all the data exactly. I mean that I mean that is kind of the first ideas of of local learning are in that direction, right? I mean, it's actually the trajectory of local learnings is quite fascinating to me because it's quite linear going back.

It's quite a linear trajectory kind of working with these separate component. So actually, like in the fifties, people came up with news neighbor retrieval where you may have like this distance function and then you just average kind of forget a prediction of a new point. You average the predictions of all the points around IT. And then people came up with current regression in the sixties, which is essentially doing the same thing. But now you're waiting the importance of the points relative to their distance or relative to some similarity measure, which people call the corneal often.

Yes, flag made .

that yeah exactly the and then you from there on, people went into different routes, like there was more development of the non periodic side people than in the seventies. They started to do locally waited linear regressions, so whether would train a separate linear regression head for every prediction that they make, and then locally wait the training data around the prediction. So I scribe hia wait. And then based on some chronic l function highway to the data, that is around the point where you want to make the prediction and less weight away from IT. But that kind of ask this already this parameter ary component, which I think is key .

to this framework, there's always been this kind of, you know, just to position between inductive learning, right, which is when we learn a general decision function which could work for any new piece of data, and um transducer learning, which is particular to particular. So this is when we learn a statistical function for a given inputs. And this is quite a new idea to a lot of people. But actually, as you are saying, this has been spoken about for decades by people like flather m APP .

nic before this kind of revolution of deep learning took off. Uh, so this was this idea has been definitely around since the seventies at least, but of early indications already before with people doing this neighbor's search in in non parametric models.

Ah yes.

this is kind of the spectrum. I like to think about this as a spectrum where in some problem to main well, let's say it's natural language. You try to fit kind of some like data manifold that describe some aspect of neural language that you care about, and this change, this inductive paradigms, all about fitting all of that at once.

So you extract some general rules that hopefully generalize to to describe being accurately, your kind of function that you want to learn about over the entire mental. And that is still, in some way, you know, goal directed in the sense that you are limited by what is expressive expressible in your problem of rain that you work with. And traditionally, people have worked with rather small problem domains, right? So you start with enius, like.

Recognizing hundred written digits and then you quickly realize, okay, it's a fairly narrow kind of amount like small amounts of information that you need to extract to do this prediction, right? Because you're working with on a very small problem space. But nowadays, we are scaling this problem space to natural language and beyond images, video, really a huge problem space where a lot of kind of really a lot of hard computational chAllenges can be expressed in in these formats that we are learning over.

And this paradise has has roughly stayed the same, where you still enough, try to learn one function that advertisers everything, right? So you can learn, and of advertized intuitions about how your data manifold looks everywhere. And then at fake, you know, on when you do inference, when you do a follow pass, when you make a prediction, you use a fixed amount of compute cost to access kind of your amatis ed intuitions of how your function should look everywhere.

But of course, in practice, there are certain parts on the data manifold that will be very easy to predict and easy to learn. And there are other parts of the data manifold, they're very hard. And on these kind of more complex problem domains, this is very evident in human language.

There are a simple tasks like, you know that's what humans do when they go on on topic that we talk about the weather, right? Like you you have like this small these small chats where you talk about random stuff um but then you can also go deeper and that's where humans need to think, right? That's when we also spend kind of more of more of our kind of brains computation to to be able to solve these problems or even attempt to solve these problems. And that is not really captured by this inductive pardon.

When I learned about machine learning, IT was in the days when I guess IT was the second A I winter. So IT was like support vector machines and kind of methods and stuff like that. And I learned about conformal prediction, and I learned about constructive inference, because flooding of that, mick was at my university.

And it's a little bit like, remember, in the one thousand nine and eighties, we used to have to do these insein optimizations with memory because we just didn't have things that worked very well. So we were leaning into the optimization back then. And I guess now we've been in a regime because of the deep learning revolution that we felt that we haven't needed to do IT because the deep learning models work so well.

Apparently even in the multi motor, high scale static, they just seem to work very well. And it's almost like people have started to delude themselves that these aren't even statistical models anymore. And why would we need to do any kind of like local optimization? So so that's very interesting.

The other interesting thing that you said is I think humans do this as well that when we deal with situations that are unpredictable and full of ambiguity, we do more processing. I interviewed professional songs and he's like a neuro neuroscientist, kind of what's the best way to describe me. This is the same neuroscientist, and he says that consciousness is something that you that we become conscious of situations when we are faced with ambiguity.

And it's counterintuitive because when you're outside, sitting in the sun and you are very mindful, you think of that as being like your brain is doing less processing. But you can kind of interpret actually, as you start noticing the the clouds and the trees, your brain is actually like processing more information, not less information. So having having like a variable amount of computation that we perform, any of you like sensing more information, processing more information is a very interesting thing.

I mean, it's certainly fascinating with a human brain where not my expertise, lights for shore. It's the case that there is a certain amount of energy right that we can turn over to do processing in a brighter a given time. So we cannot do everything at once and it's impossible. So we have to be strategic of how we use that energy uh to to achieve our goals. And I think fundamentally, the same is true with machines.

Yeah yeah which which brings me to the next thing. So the other day we came up with this really cool analogy to describe the behavior of of our supposed machine learning models. And it's like google of so in in google, if you you have this variable resolution so you you start off up here and you have big course tiles and then you zoom into london or to eric and the tiles get smaller and smaller. And it's a great locality method, which allows you to kind of like use the computer where you need to use IT. Is is that a good annawan gy for the .

kind of work your dick? Yeah I ve I love that analogy. I think a captures really the essence where you can think of the number of pixels you have as a given kind of representations power or maybe compute power that you have at a moment to represent something, to make a certain prediction.

And kind of in this inductive paradigm where you want to represent the the picture of entire world at ones there's only so much resolution that you can afford to spend on any individual kind of prediction, let's say, off off like the city of sick something. So that will maybe be a single pixel, if not even. And the power that you get if you zoom in kind of is evident to anyone who has ever used google maps.

If you were to kind of zoom in by just making the pixels larger and of not changing how you use your resolution to represent the local information, if you were to just make the pixels larger as they are prety pretty quickly, google mess would be completely unusable, right? And that's actually super powerful. So if you if you, let's say, work on a computer with like thirty eight p display and you have like your friend who has a 4k display but has not figured out that you can actually zoom by reallocated the computer that each pixel does， but just things you can only zoom by making the pixel larger or getting closer to the display, right, then you can pretty quickly out. Smart tim, right? While he maybe initially has a little bit of an advantage because he can represent more information at once, you can, if you want to go, you know, a little, a deeper, I think I want to look at kind of zorc from from from, from space, then you have much higher resolution already than he does if you, because you're just using your representative capacity or you using your computer a smutty way.

Yes, this is one of the reasons why i've been to buy this concept of active influence. So you know ccl friston is one of my heroes and he's got this wonderful model of agency. And I guess the the stock difference between an active influences agent and a lot of traditional machine learning is that it's actually doing kind of situation computation, right? And that seems so incredibly important, may and I feel that you you're introducing that but in a slightly more internalize way I can you can you explain what 明白的？

Um sure. So the recent work that we have done is all about kind of locally zoom ming in to the distribution that language models learn um so locally kind of learning that data manifold Better. And the way IT works is that you have your you you stay with your kind of Normal language model that you that you are working with going to fix parametric model.

And then on top at test time, you can look back into your memory or big data set and find examples that describe how the manifold looks locally around the prediction that you are trying to make. And then just refreshed mory, right, spent in a little bit of additional compute to to use your representative capacity specifically for the prediction that you are trying to make. So instead of trying to have still all at once learn the entire data enford, now we're in the game of just of making predictions specifically target to a certain task that we are faced with at a time. And it's really about kind of using compute and using representative capacity to its fullness to make a certain prediction.

So there's a beautiful figure. I think it's figure to in in your shift paper and what kind of like to talk about that a little bit as as we go. But you've got the data space OK.

So the data space, I guess this is like the ugly, an space of of of the selected data and then you got the data manifold. So so the data manifold is this kind of surface or this structure which is actually learned from from from the data. And then and then you do um you let's say you have an example that comes in and you wanted do a really good prediction for the example.

So I kind of seems logical that you should go in sample a bunch of the information in the neighborhood of that sample because that will improve the the statistical power of you of your of your model. Now there's a few things here. I got the first question is where do you get that data from? I'm guessing use something like you know this face and as there's like a whole bunch of things for doing like a nearest neighbor or you know kind of look up and they might do vector quantization or or some kind of like you know log retributive time method. And then inside that you you do an improved such as well. But maybe maybe we'll start with what's the what's the difference between like you could have a really, really big base model and there's a ratio between how big the base model is and the cone of your selection when you do information material, what's what's the what's the relationship betwen those.

So I think what you are losing to the fact is that with larger base models, we are able to capture the data manifold in a Better way. So even if these larger based models try to essentially make all predictions at once or amatis these predictions so that we can essentially access them at constant cost when we do in friends, um larger models mean we get kind of Better representation, more representation capacity and we can learn the statistics Better essentially.

And what we kind of are showing is that this local learning, a test time is an additional mechanism on top that regardless of what kind of representative capacity you start with your base model, if you add that on top, you just kind of get that additional kind of extra bit of representative power to make a Better prediction um essentially using your kind of representative capacity to its fullest when you when you make that prediction right because at any point in time, your retrained language model, if you will, has to encode all of this information right because he has to solve a lot of different tastes ones. And now when you are task with a making a certain prediction, usually IT can just, you know, forget to can ignore most of the information that IT has compressed. And that means I can represent the pieces of information that are critical to actually making good, good prediction at higher fidelity, right? So you can make them also a prediction at higher fidelity.

So i'm thinking with we are kind of assuming that there's one data manifold, one data space, why not why not partition up? So instead of having A A multi model embedding space, let's say we've got something that can do text and images and video. Um why why why not separate out and and have different modalities, different data spaces ah .

and what sense do you think that might you are you are you meaning one could train like a separate model for each of them and and then for kind of you pull your data manifold and into multiple sub manifold and then just train a separate model on each of them and then be done with this.

I think so I think like one of the things we are playing around here with is we're in this regime at the moment of having a single model which is trained on every data modality. And your work is hinting that specificity helps a lot. Look, you know having a local situation method helps a lot.

That couldn't you say the same for even the modality of the data? I mean, like you know couldn't you just keep partitioning and create some hybrid system? And would that give you more or less statistic?

Um so I think like we have to separate a few things here. So the first thing is, of course, if you are of making a certain prediction in some specific modality, being able to really focus on that modality is useful. But of course, if there is, let's say, information or if you if you kind of entire data and afford that is not of um of cut into its subparts contains information that is necessary to make a good prediction, but it's kind of accessible only in a different modality that you need to have this cross modality to be able to access that information.

So there's really two key elements here as the first the first key element is to to be able to use the entire representative capacity to make a certain prediction and kind of over fit to the to the test problem at hand. But the other key aspect is that we need to have the right attractions, or we need to have learn the right representations to be able to decide what data is actually useful, what data contains the necessary information to make a good prediction, and also the right abstractions, or the right kind of learning algorithms to turn this kind of this information into a good prediction. And generally, what we have seen over the last couple of years is that if you scale these kind of deep lanning models up, and if you show them a wide variety of data and kind of during pre training, then they get Better at kind of finding these similar patterns across modalities and also across problems to come up with, like more like solution algorithms, or like a simple mechanisms that allow them to kind of synthesize. Think so, for example, if if you ask ChatGPT to write you a you know uh kind of a song about a certain topic in a certain style, IT will do that um but that's certainly not possible if you have never shown IT you know any examples of songs in its training data. So it's I think this cross modality is really key and that doesn't necessarily detract from the fact that you want to, at test time, user representations power to after having seen a little bit of everything and understanding how kind of everything behaves to then focus on those things that giving you attractions and your knowledge you think are important to make a prediction.

So I think we should come trust your work to active learning. So in I guess like in the the olden days of machine learning, we would um produce a decision function and we would have like a test train split or maybe like A A validation split and so on. And then active learning came along.

An active learning said, what actually um you might be faced with a shifting data distribution. So when this thing is used next week, the distribution might be different. So why don't we continuously retrain the model on on the shifting data distribution and active learning, which select, let's say, sort of like diversion and new training examples. But what you're doing is, is a type of active learning, but it's deliberately honing in on the specific rather than looking at the general exactly.

So I think um I think you have put IT very well. I would kind of extend the characterization of active learning a little bit in the sense that a lot of active learning has just focused on how can we kind of sub select the training data that we use in the standard kind of retraining and then evaluation paradigm, so that we can use less data, but kind of compress already the information that is contained in the data such a way that the model will still be as powerful as if we had trained on the entire data.

And people have shown in some cases that that is that that can be useful and powerful um but that in in those cases, random is a really strong baseline and and random has often, you know, I would perform a lot of these methods. Now what I am working on is a very kind of different, but similar in some ways. Setting they are setting that i'm interested in is local learning.

So making these specific predictions and in that case is only some kind of small sliver of the entire information that is kind of in your memory, that is only that actually relevant to making that prediction and actually searching for that is a key aspect. But you also want to maintain diversity in the information that you obtain. So actually, the methods that we end up working with are some kind of mixture between methods that people have looked at traditionally in the in the kind of literature on search, and the methods that people have look at in the literature, active learning, which are all about can finding diverse, diverse samples.

Very so a kind of blending information, a trivial and active learning in in a in a very kind of like specific way. So maybe we should start with the search.

So you said in your paper very beautifully that the naive ve way to do this is I have an input example and I I have know like a training diet or something like that, and I should just go and retrieve the nearest neighbor's again, using some kind of like vector quantization retrieval system, like face, so I can retrieve the whole bunch of the nearest neighbors. For this input example, find tune my model on those nearest neighbors. What could possibly out of?

Um there are a lot of things that can go wrong. And the main thing is that news sneider bor of has really been designed as a search method. So if you have like a big bag of possible kind of solution candidates that you're looking for, like a needle in a hastie problem, IT tries to give you kind of as many candidates as much as closely as possible to your description um as you want. What you realize is in local learning, you d actually want to learn something, and learning something require synthesizing different pieces of information.

So for example, if your task is not a simple retrieval task, in the sense, in the sense that the information is exactly encoded in your data, and you just have to find this one piece of information and then return to the user when instead of there's a lot of different pieces of information that are kind of disseminated in your data and you need to find all of the realm pieces and can return to the learning algorithm at once um then neighbor not work because news neighbor kind of just focused on the dominant frequency or kind of the most kind of most similar aspect that kind of some sub cluster in your data has to whatever you are asking IT about. So for example, if you give of your your engine a question and we have this example with um when someone was asking about the age of ichael Jordan, how many kids he has essentially kind of the data in in your memory you can think about as just representing some cluster. So some part of the data is just information about his agent. Some part of the data just um about his kids and in practice, what happens is in this kind of latent abstract abstraction space, um just one of the classes will be closer to your question that combines both and then your kind of nearest neighbor search will return all pieces of information that about one of the topics I think in that case of what's the age um first before IT ever finds any of the pieces of information that about the number of kids .

yeah so will show this figure on on the screen but you know the prompt was what's the age of Michael Jordan and how many kids does he have and the nearest neighbor process basically doesn't even address the number of kids that he has because it's almost like IT .

latches on to the experiment actually and this was really we did IT in a very simple set up wait that was really just four pieces of information in the in the memory to work about his age, to wear about the number of kids he has. Think we just request that kind of the closest to and um Normal nearest neighbor or would just repeat redone in information because there's nothing stopping like there's nothing in the objective of news neighbor that's actually quantifying how informative the set of examples should be that are returned. It's just caring about really the marginal close proximity to to what you asked about.

okay. So this is really interesting. So the way we do the information material is using like a bunch of ebell ings. And I think in your P P, so you're using Robert ebel ings, that may may not be relevant.

But you're saying rather than just searching using some similarity metric in utilize space and getting the nearest neighbors, we should have the concept of information gain with respect to some task. So in this particular cases, the task is I need to know this and I need to know that. And we're saying, retrieve me the examples that actually give me the most information for that thing. I want to .

do exactly. And the funny, the funny thing and is if you just try to retrieve one piece of information from your data store, those two kind of views turned out to be pretty much equivalent, right? Because if you if you can only access one piece of information from your memory, the best thing that you can do kind of your best shot is to just take the thing that is most relevant.

But as soon as you have picked that kind of piece of memory and now you're looking for the next piece of memory, just um finding something that is as close as possible uh or as as related as possible to what kind of your task is, is not the best thing anymore because that might be exactly the same thing that you've already seen, right? In in case this little literal exact application in new data. So instep, what you should do, uh, is to both look at how relevant a new piece of information is, but also how non redundant IT is relative to the piece of information you have already assembled. So you are essentially trying to solve some trade off between um finding examples that are as related as possible to your prediction and finding examples that are kind of IT diverse representation of um all the information that is encoded in your memory that makes sense.

So so devils advocate, you're saying that we do some activity around the cone and if the coin is very small, so if the specificity is high, nearest neighbors will just produce a relative redundant information and will miss a lot of IT IT won, be sensitive enough to capture all of the information in the prompt.

But what if we just increase the code? So what would if we select nearest neighbours, but we select, let save two thousand nearest neighbor, something like that set. So then, yeah, we've got a fine tune on two thousand examples now instead of five but probably some of those examples would have information about how many kids that Michael Jordan had.

That is that is definitely a fair point and I would say that really um the multiple things to say here one is IT is actually not just important. What you see is also important how you see IT and IT turns out that it's actually, you know, IT IT is sometimes useful to see the same piece of information repeated.

But IT depends on really that piece of information and how a dual learning system, how kind of how good you learning system is, how strong you learning system is. So for example, if you um if if you like, let's say you work kind of with some ask question and you you fetch you face like with the problem where you have to compute the derivative of of some equation and then you look into your memory and look into simple similar problems that you have h soft before. Now maybe there's a cluster of a lot of other kind of problems that were about finding derivatives, but all of them use kind of some trick to find the derivative that is not applicable here.

And now you will just train on them a lot and you will over fit to them a lot. So what's IT turns out to be very important um to be careful with how many great and steps you do on a certain example. And IT turns out that the kind of solution method that we propose, which is called saved, takes care of that explicit.

We have this one figure in the paper, which I think is quite uh in siphon that regard, where we actually look at the all the examples where safe decides to just find tune on the nearest neighbor repeatedly for fifty steps because there is no notion and shift that all pieces of information have to be separate, like in there is neighbor. So shift is perfectly fine with just find tuning on the same piece of information if IT is helpful and in all those cases, IT is drastically Better than neus neighbor because news neither would just explicit take this piece of information once and then move on to other parts even though they might be less insightful. So I think IT, there's really a spectrum. And as you said, um IT might be IT might be good to to train on some examples a lot um but nears neighbors kind of this horrible that doesn't take this question into account where it's actually good to train on examples more or less.

So another thoughts that ocs is, I guess, representation bias and and also the the robust of models to find tuning. I member I read in francia lace deep planning of five and work, he basically said you're going to be careful for a fine tuning because these things like you know that fine tuning can quickly overwhelm a model. And if you turn the learning way up too high, for example, you do too many gradient steps that is almost like you at at some point, the model kind of forget everything that knew about previously, and that leads too much into your examples that can cause over fitting and the representation bias might be, well, you know actually kind of like you almost you don't want you to be overweight by the thing you you're achieving. So is there is some trade of that .

um there is some trade of them. Of course, if you if you were to use a very high learning rate, um your a kind of tune locally tune model in network well anymore um I think it's really a very natural thing. So what you want is you want to somehow take kind of the slow information and kind of use these examples as a way to show your model, is find to a forget some information and fit this new information more closely.

But in a similar vein, you don't want to just encode this information. And then, I mean, if you do a lot of five tuning steps, eventually IT will just literally memorize specific or just try to predict the specific that you gave that you retrieve from your memory. So there is some trade off and you want to be somewhere in between. Yeah, that's what we say.

yes.

yeah. Think one of the key things that came out of our work, and this is i've part to how we actually do the information retrieved if if you want to frame IT that way, is that you need to kind of get a hold of the uncertainty of the model. So and that's really key aspect.

And the uncertainty tells you how kind of based on the current models representations, how would the model change or how more certain would the model be if you gave IT a new piece of information. And now you can use that to say, okay, this piece of our information is is very relevant to the press, is very informative. It's kind of needed to make a good prediction, then i'll show to the model, and at the same time, maybe there are some pieces of information that are not relevant to making a good prediction, and then you can exclude them. And so you can in in some capacity, determine for a given example, if you have important information in your memory, in that case, you should use IT was if you don't have relevant information in your memory, then you should not do functioning at all maybe or or not as much.

And how does the embedding function effect the result?

Um so the embedding are crucial, right? So the ebel ding in the sense describe the data manifold. So the embedding describe really essentially how informative a certain piece of information is.

Do something else. And this is really closely linked to this lennar representation hypotheses that is widely studied, especially in interpret ability. And essentially this linear representation hypotheses says that abstract concepts with no complex enough, no networks, elms um are represented as kind of linear directions in some representation space.

And that representation space is accessible so we can tap into that. Now you can think about kind of these representations as describing the data manifold. So things concepts that are very aligned in the late in space are very relevant to one another.

So if you want to make kind of a smart statement of one of these concepts, you you Better know about this other concept that is very closely related to IT. And in a sense, if your abstractions are not good, then you will not identify these related concepts, just concepts that are to us, seem very related and we know are very related. The machine will not have identified as related and then it's useless. And that is kind of part of the story by twenty years ago when the representations were not as good, people did not talk about the linear representation hypothesis because um machines were not yet able to actually identify these concepts which we as humans now are related.

Yeah, neil nanda was tellme about this when I spoke with them last time. IT is an amazing going in mcinturff. And yes, so so this idea essentially mean we've been speaking about a manifold, but you know, concepts and so on can be thought of as having some linear direction in in space.

And there are examples of using lindy approves. You know there is that a thero board game example is A A linna probe and you can kind extract board date. And so but also I think is useful for surrogate models.

So there there's loads of examples in machine learning like data modeling, a spoke with Andrea as or even line, and indeed your work as well. You create a very basic linear kind of data model. Life is a good bb describing where you learn a bunch of parameters and and like a bias term in respect of an input example and your data find correctly yeah .

so I think generally the idea of using a sarc model that is using a simpler model to understand the behavior of a more complex model is a very powerful idea. And in our case, we just kind of we use the simpler model, we use a linna model to that kind of are we assume that kind of the lodges of the speak Allen um are Ellena function in this kind abstract representation space, which is highly non linear in the inputs, but kind of treating this attrition representation space fixed.

So treating all these embedding as fixed and then just treating kind of your final head that's producing the lodges, which in the end after you percent through the soft max and next token probabilities, just treating that as a linear function now allows you to analyze IT as a human ride. Now things become tractable. And that's very important if you want to make special al decisions and want to optimize an objective.

So that's why these are good models are so powerful. You're right in saying that in other domains, people have also looked at these good models and also found them to be very powerful. So for example, uh, kind of data models where you want to learn how um kind of certain data influence is a certain prediction or in line where you want to fit a linear model that is actually interpreted les or not in some abstract representation space, but in a space that you actually understand. So then when you have trained this linna model locally for your prediction, you can look at that Linda model and interpret the importance of the weights, right? Because, you know, this weight that is describe to a certain input has this specific mini IT.

Isn't that crazy? So as as you say, like these are highly nonius models, and I guess line muc is a kind of influence function where you you have a Linda model to kind of tell you what the what the waiting of of the component features were for a given input example to make that prediction. And you know you get the example of update model IT just IT seems crazy to me that you could do you could a simple linear interpretation surrogate models.

but I think it's really a wide spectrum. So with the line, the reason that works is actually fundamental to the reason why local learning works is because. If you again, if you want to make kind of predictions that generalize across this entire data, ana fold, you need a very non lenna function versus if you just want to explain one local prediction of the the hypothesis is that you can get away with a linear model because you need to encode much less information, right? And you need a much less complex model or representation capacity and needs like doesn't have to be as big to explain this one, uh, this one prediction.

Now this is, of course, some limits to IT and in some cases, IT doesn't work so well and and line will tell you, okay now we found the slender m model, but the Linda model doesn't actually track so well what the big model did. So that kind of line um in the in the data models case or in our case, we don't use that model, that linear model to make predictions, right. And also we training that linear model in this non lending a representation space. So essentially, I think fundamentally you can think about virtually any real network as a big encoder right until the panel ultimate layer and then like one kind of linear uh of one final linear layer that projects to your output dimension. And I mean, in that way, that kind of known networks are just sequences of of linna functions that are composed with non linearities .

yeah this this makes a lot of sense. So so yeah, line is sure for locally interpretation model agnostic explanations. So for a given input example in in that local area, you can have A A linear model, I guess is the same thing with with data modeling.

Know you basically saying for this simple example, you know what's the influence of of of the data set? And actually even maps are the same thing. So in spine theory of of neural networks, it's saying that, you know, models are locally affined.

So for a given input example, you can actually represent the behavior, the newer network with a simple I fine transformation, which is represented by the kind of like the cut out. So yeah, that's how deeper learning models work. Essentially they just like, you know a massive confection of like Linda transformations. For a given input example.

exactly. You are trying to essentially move to some space like move data into some kind of space. Way is represented in such a way that you can make predictions as linear functions um and and that can be very powerful in the sense of actually designing your model, your representations but also your functions and shut away that they will make good predictions right because you .

know the structure yeah love you.

So data models, I think um they are people also tried to use them to select data. But kind of a fundamental uh assumption or some limitation that people have found is that if you have use these kind of Linda models to learn how is certain of selected subset effects your your test loss, let's say, or your prediction, then IT implicity assumes that the influence of data adds up linear arly.

And this is kind of so there is this in interpretation service, this big line of work on influence functions, and maybe we should explain what that is. So essentially, influence functions try to estimate the loss on a test point that you would have had if you had changed your model size or if you had changed your model on that certain data point. So you can use IT to understand that, uh, this data point that I have in my data set influence that lost kind of on this prediction to to that degree, right? You can ask these kind of counterfactual questions that that's where attempts to to be helpful.

No, a problem to is this is how to make tractable, of course, and and one kind of assumption of people typically do as they use the first or tailor proximately of the loss to get this working. Or with data models in a similar way, people use kind of first order order methods. And what that does IT IT ignores the um mutual dependence of data relative to your prediction.

So IT just looks at the singular individual importance of one data point about a prediction. But of course, you train your model on just one data point. You train your model and multiple data points and how you compose your training set is very important, not just in terms of what is the data you give, but also what is the data that you showed relative to the other data has IT has already seen and um in that sense, these methods are wait when you use them for data section. You just end up with something like news neighbor you still you you are gone to do some newer neighed retrieval in some ebel ding space um but as we discuss this news, neither or retrieval will completely ignore the fact that um you can end up with completely redundant information that if you show repeatedly the same example to the model, IT will think that IT improves because IT will think that every example has the same value. But in fact, IT doesn't end your the value, the marginal game that you get from each example quickly, uh diminishes.

This is really interesting. So so you're saying that actually there are these first and second that maybe further order interactions between the the dat examples. We shouldn't think of them as as as being separate.

And this reminds me of the M. L. Security literature, you know. So there's like data poisoning, for example, and says that you can manipulate just the order of the data that that gets trained on on, on machine learning model, and you can manipulate the model to do anything. You can even put back doors in the model. So we need to have a method of sort of understanding the interactions between the data examples as well as the dating examples .

on on the right. exactly. Um I think that is really critical and you see IT empty and I mean also we see IT day to day in our life how we Operate, right? So the decisions of the information we access today will influence what we find interesting tomorrow yeah and this this is not this doesn't happen independently, right? We don't start every day independently from scratch and just sample a new point from the state contribution we think about.

okay. Now I have kind of knowledge in this field, in this field, and maybe there are some field in between that kind of would combine this knowledge. And now I would kind of sample knowledge from the third field as opposed to one of the first two because I already know this quite well.

So the million dollar question is how do you do retrieval taking into account the interactions between the data point?

Um this is actually quite straight forward. So what we do is we build a simple tractable surrogate model uh over which you can estimate the uncertainty that that model has relative to making a certain prediction. So you you get uh some kind of um tractable quantity that describes how good your are prediction would be if you would have shown the model a certain kind of set of data points. And that quantity can be optimized. So in close form, you can optimize you can minimize the uncertainty that this model has relative to making certain response. And IT turns out that this borrows really important ideas from news neighbor or tribal s in particular the first example that you take will be the nearest neighbor um but as soon as you look for the second example, you will try to find examples that are and of as aligned as possible with your task in this late in space where a similarity is encoded kind of in the Linda way and at the same time tried to fight examples that are as a toggle as possible to the pieces of information that you have already accumulated.

So we will get to the close form solution in the mune as some beautiful mathematics around that. But just stating IT just as an objective function, what what are you actually trying .

to do um yeah so fundamentally you can think about IT from from multiple different perspectives, but fundamentally you're trying to minimize some intrinsic objective of this on machine of of the L M, which is to this uncertainty about making a good prediction. So you will come up with some measure of uncertainty that describes how good the prediction will be.

And then the machine tried to go out in its memory and find the relevant data that makes IT, uh, end up with a good prediction. And you can friend this in multiple ways. I think I like really a framing that is more of a probability tic from a politics and point where you think about your L M S.

Having epidemic beliefs about what could be the right function to describe your prediction and you are slowly kind of by showing more data, you manipulating these epic of the machine is changing its epidemic beliefs through basing updates, through progressing updates. So it's computing posterior epistemic belief. And now what kind of this objective is saying is it's saying the element should take those examples from memory that lead to the most certain posterior beliefs relative to the prediction that is trying to make.

okay. So what does certainty mean? Does that mean you are trying to sort like minimize that the variance .

of the posterior in act so in the in a of in this? So of course, when you are working kind of in this prophetical model framework, a key aspect is making that tractable.

So that's kind of one of the key limitations of of of based inference of publicity inference is that computing this poster ia is in general very hard problem but using this linna surrogate model um and essentially gaussian treating uh the initial random variables as scouting and using a gaussian observation model, this posterior update distractable so you can computer and close form and then you have kind of this big distribution um over kind of possible predictions that the model can make and there's one prediction that you care about, right? Uh so you are trying to minimize of the kind of some measure of uncertainty related that prediction and for guessing, that's usually the variance. So you just minimize the variance of your one of the epidemic variants of your model, uh, relative to the prediction that you're making.

So this is really exciting, right? Because I love easy and analysis and unfortunately, we don't have access to a hyper computer that can like do all of the computations in the universe. So neural networks, they they just produce bare predictions, you know they like maximum likelihood estimators that they just they say this is the thing is most likely we don't have the kind of know the confidence intervals.

We don't have like all of the uncertainty that we had had with a basic method. So what you're saying is we have a linear arrogate model. And in that linear arrogate model, we can model the uncertainty.

We can we can do like you know confidence estimates and and so on. And then the other thing is you can use canoes for this and gosa processes. Well, in some in some degree.

yes. I think the simplest st form with the slander arga models just basing liner regression. so. This is like the first basin model that that you are introduced to when you when you start working kind of with basin machine learning, which is just and if your a kind of standard in a regression.

But on top of that, you have a gaussian prior over the weights, and then you have a gaussian likely. So you have kind of an observation model that says that whatever observation you you get is really the ground truth function plus some gaussian noise, that is I D. And in that framework, you can do this plus ii based in france, exactly in close form, so you can write down the close form solution, uh, to this. And this is fundamentally what we use when we when we use a linna stargate model, is we kind of approximate this very complicated neural network by kind of amazing linear regression, which is right IT IT is bad approximation in terms of making good predictions, but IT turns out to be good enough prediction relative to deciding, okay, what information is relevant to improving the prediction of the election?

And can you explain how is possible to data and close form? Because unique, you you think of basic on analysis as needing to do like a complex solving and mark of chain monticola and all this kind of stuff. But but you can you can just do IT as as a simple mean. Maybe you should explain what I mean by place form.

So what we mean by close form is generally that we can describe the the probability distribution, the posterity probability distribution overweight as a mathematical formula, and that can be derived directly from the formula for the prior and the likelihood. So the way standard kind of basin basin inference works based rule is that you kind of computer posterity um probability as being proportional to the likelihood of your data time surprise divided by some kind of term that Normalizes this to be true probability distribution. And the building is that if you do this with gases, if that is, if you use a gaussian prior and a gaussian likelihood, then this kind of probability distribution will still be a gaussian. That is fundamentally because one of these gaussian only half kind of these first order and second order terms, and they are kind of the one probability distribution that if you have this second order and first order term, you know you're I with a goin and that's why you can do this in in close form.

very call. So we have described how we have a way of estimating the court the confidence for given a prediction. Now what we wanted do is use as much influence time computation as is required because if you think about that, we could just use like an unbounded amount, but that's not very good. Wouldn't be cool if we could link the amount of inference computation .

to the confidence. Ema, absolutely. And we looked at this a little bit of in a recent because it's you can think about this as being kind of a powerful tool when we have kind of knowledge of how certain a model would be if I had seen certain data.

And then we can think about, okay, how much do we actually want to pay in terms of how much compute do we want to pay to get a certain improvement in our model. So in that sense, we can use these kind of projections of how uncertain model b to um stop computation early stop computation early. When are of uncertainty reduction stops being proportional to the amount of computer that we are that we're pain. So usually for any type of prediction, you get this kind of sub module er on a curve that whether marginal gains are diminishing over time because you slowly accumulating information and at some point you will have accumulated all the information .

that is required. Mean, use intervention to make such.

So I mean, I like to think about this framework as having like this controller and this memory, right? And essentially, what of maybe in, like the standard computational framework, people think about the same to components, for example, with turing machines. So you have this head right, which is the final tomatoes, and and then you have this memory, which is this unbounded tape.

Now what we can do today, which is quite amazing with these learn abstract representations, is that we can essentially jump to any kind of peace of peace of content that is stored in this memory um just using kind of the intuition of the abstractions that we have learned um where is kind of in the senate computational framework, you would have to move left and right on the tape of the turing machine. Now essentially in kind of log lenna time, we can access the entire memory at once, or the relevant bits at least so and and the controller has two important functions here in this framework. The first one, as I said, is being able to decide what pieces of memory should Operate on.

And IT can do so by leveraging these abstractions and representations to essentially move in this memory space, kind of using shortcuts instead of going left to write. And the second key component of this controller is that IT learns representations and instructions that allowed to ingest information from the memory. So IT reads from the memory and then updates its weights.

If you think about in terms of no on network, and we want of a controller that is very good at making these gradient step updates, right, so you can think about kind of Better models, stronger models in this frame, stronger controllers as being of more robust to certain types of information that would read from its memory and leveraging its abstractions, leveraging its knowledge to actually use that information to make good predictions, right? It's like, I think a good analogy is that if you give A A kid, you know, some complex math textbook, IT will probably not learn too much from IT immediately because the representations and abstractions to make sense of that information are not there. So um what we want with a strong controller is that whatever is whatever piece of information that fetus from its memory is able to interpret that and kind of use the information to, it's full potential relative to making a good prediction.

yes. So so the F, S, A is the controller, and then the two push down, you know, stacks essentially are the memory. But this is interesting, right? Because you know, elements are emphasize, so so they just represents the and and kate dog, my co host, is is big on like two machines.

And he always says that know there is a class of algorithms which could run on a final state automatic. But you know, actors, actors, a controller for a term machine. And that class of algorithms, even though they could in principal, run on an L.

M, they're not reversible vice to cash gradient sent because, of course, this is a thing with a fix amount memory can expand its memory. But the broader point though is that the turing machine class of computation can do essentially any type of computation. And we're not just talking about an improved method to improve predictive performance, you know, through a specificity. We're actually talking about a new paradise of turning complete computation.

absolutely. In in this sense, you can think about this memory as potentially being unbounded and getting rid of this of limitation that prevent elephants have that they have to compress all the information at once into this kind of limited format that you cannot extend afterwards that that is kind of there. And if you want to come up with something different, you have to retrain from scratch, uh, basically. And now you can augment that controller that has learned these kind of strong intuitions, strong representations that allow us that allows you to quickly kind of adapt to new information. You can use that in conjunction with this potentially unbounded memory, which kind of as you one of as you evolve that system over time, you can extend or um yeah at pieces, remove pieces of information as .

you please yes, so so it's a much more powerful modalities. It's not something which is trained in to end with statistic in the sense. So we we learn the memory and then we handcraft the controller and it's I guess it's a tiny bit janki because like that we we have a fixed in bedding space and we like retreat from a certain date space.

And so let's use the art chAllenge just an example. So the guys who are the winners of the art chAllenge, good friends of us, the minds AI team that they're doing, test time act, you know, like transducers active fine tuning that, that's what they're doing and how general is IT, right? So what they are doing is they saying, okay, well, we know what the prize are.

We can build a data generator. And then when an input example comes in, we can kind of like lean into that example, we can generate data and we can find you in the language model. And then we've got a path on verified. So it's IT is kind of touring complete, but IT seems like IT doesn't seem general, IT still seems are specific, but it's certainly significantly more powerful than any other architect you know that we know about so far. So we kind of moving in that direction a little bit, I think.

So I think if you if you think long term about where I think the trajectory of these types of archives is headed, I don't really see so many kind of fundamental limitations, whether the current implementations are definitely a lot of fundamental limitations, as you said. So these are kind of representations which are learned ones, kind of the controller is learned ones and then kept fixed.

Ideally, what you want is systems that learn over time, open ended systems that learn from their mistakes and improve their representations and improve kind of the ability of the controller to uh, ingest new information, adapt to a certain task over time. And certainly that is what truly intelligent systems do and what we don't really have at last scale right now. Um so that granted, I I would say that if you scale this up as a system that has kind of this, if you will, kind of finite working memory that is kind of part of that controller and that can add to its memory and remove from its memory and and doing so efficiently, so efficiently finding relevant pieces in its memory. Um I think kind of philosophically speaking, that is like a general mode of computation that can be quite powerful.

Will IT always be specialized though? I get I guess IT comes to the philosophy of of doors. So france was relay things that there are primitives of knowledge. And if you if you have some kind of like you know, set of primitives that you vice in into the model, then you could, in principle, deal with any form of novelty. But any system that does retrieval, presumably IT, still needs to there still needs to be some kind of manifold or some sketch of a future situation.

which we could then lean into. Absolutely, I think I mean, to the degree that any kind of any intelligence system constantly learns from its environment and learns kind of what is actually of what is what are the right abstractions to make good predictions in that environment, any intelligent machine has to do the same. so. In whatever environment we put IT.

And if we keep the environment static, that's kind of what we've doing now, then that is so much simpler task kind of in in what if environment we we put the machine, the machine has to figure out what are the kind of right abstractions so that is able to find pieces of information and then combine these pieces of information to make good decisions. And I think there's nothing like that in this paradise. There is no fundamental limitation that you cannot do this in an open ended system, but it's certainly has not been done yet, and that's a super exciting direction.

So another thing that springs to mind is you, i'm excite about active inferences, or maybe in the future will have a very distributed acing, Chris nexus of agents that are like, you know, do IT doing something very simple to what you're describing. So very situated active inference and and describing right now is we have a monolithic language model which is updated every six months. And and then we we have some kind of like, you know, information retrieval store, which is periodically updated. And at the point of doing prediction, we basically do inference, right? And then and then I think we threw a way off to which so so then we can get into this federated paradise where rather than throwing IT away, we kind of like remember IT and initially we remember IT just for me, but maybe I start sharing IT with you, and maybe eventually IT gets goes back to the mothership.

Or we could remember the prediction. What we could also do is just remember the iran prediction, right? So whenever and that's, I think, what a truly learning system does is at test time, IT does medicinal computation and then eventually IT cees.

Okay, to what degree was the computation useful, fruitful? To what degree? And did I end up with a good prediction? Um and then you can kind of change the learning mechanism, right? You can update based on that prediction. You can you can change what you do next.

And would do the cash of value of that be like just updating the manifold in the origin inal model? Or you know there's this kind of M I T approach, you know like the dream coder type approach. We do some kind of abstract library learning like that. The do you think there's some way of introspective useful abstracts knowledge? Or do you think that I should just go back into the region, your own network?

Um I think the multiple ways to go about this. So I think certainly what you definitely want to do is you want to improve you of representations and abstractions over time uh but fundamentally, I see those as kind of strong intuitions that allow you to fetched the right piece of information and then combine these pieces of information to come up with a strong prediction. And for those, these attractors are necessary. Um of course, what you also do, and I think pretty eventually, what we also do as humans, as we store continuously store patterns to our memory that we encounter for evarin kind of degrees of fidelity and you know machine can probably, if you give IT sufficient memory, store these patterns at much hier fidelity than a human could ever do. So I think both storing information and updating kind of your representations to account for the fact that your environment is changing under the month, for the types of information that you're dealing with, our changing are both key aspects to getting really, truly intelligent system.

Yeah do you think you know the o one model does the the way made? We should talk about this concept of a scaling, you know like a inference time skating with that introduced. But do you think they do something like this? Do you think that they estimates their confidence or knowledge and then they do a variable amount of computational? Or do you think it's still like quite basic?

Um it's really hard to comment on on what OpenAI does or does not do with a one because unfortunately really do don't don't know much about what they're doing eternally.

Um all that they are saying is that they are spending some amount of compute at test time ah to kind of change their model or at least kind of work their model around to end up with a different prediction then they would have otherwise ended up with if they adjusted kind of one forward pass through the model. And to that degree, I think he is kind of related and it's part of this paradigm of doing and spending computation locally. I'm kind of changing the resolution locally, the the accessible resolution locally around prediction that we're working on. Um yeah, but I I don't think that they are um necessarily kind of doing this in this kind of intrinsic uncertain demineralized of kind of that is related to active inference or are these types of things because I haven't I haven't seen them talk about this um yeah but .

they they kind of should be in that thing. So you know when you use language models all the time, you get a fail for them, you know and and you can feel when they go out of distribution. And because what's actually happening is when you lean in, know when you ask IT about something where it's like it's on a well sample ed dense part of the manifold, you get rich answer. But but that manifold is actually as it's like a landscape and sometimes you're in no mans land and IT just gives you the most ban like almost nonsensical answers. And wouldn't IT be cool if he knew that.

yes. So I think now really touching a very important point, and that is, can we tell whether we kind of have information that is relevant to making a good prediction or not, and that's actually some kind of piece of information that we can extract from these uncertainty estimates. So kind of getting a handle on our certainty bound making a prediction can be very useful to tell is kind of the information that is stored in memory.

Can that be useful given my current abstractions um and this task that face with to actually solve that task or is IT not useful in the Michael Jordan example, if the age h of Michael Jordan and his kids are not actually stored in memory then um and also not in your in your weights, then there's no way to make a good prediction, is no way to produce a good output. And we actually to discuss to some degree that you can use these of uncertain ty estimates to kind of providing insight as to what your models capable of doing and what your model is not capable of doing. And um one of the key aspects where this shows up is actually in improving on convergence guarantees.

So one of the cool things that you can do is if you are minimizing kind of if you taking this principal approach of selecting the most informative data or selecting the data the maximum reduces your models uncertainty, your models epidemic uncertainty, then you can show that relative to the data that is available and relative to how good you ales abstractions are, eventually you will kind of make the best possible prediction. Eventually you will, you uncertainly, you will shrink and of as shrink to be as small as possible. And and that is something that is not possible if you if you select data using in his neighbor search um but but going back to the other points that you made, uh, relative to do we in these dense plants of the data manifold um do we actually have a models that are as good as they can be? And I I wouldn't be so confident about that, right? Are clearly in those kind of very space parts of the determinate for the model hasn't see much that is not good and and we realize that today.

But I think a fundamental problem is kind of in this inductive paradise, when you free training the models to be good everywhere, clearly in the in the pots at the most well represented in your training distribution, IT kind of if IT went doing well, that its loss will be very high. So and ended kind of faces trained to be very good there. But still, if you kind of zoom in to these parts with whatever kind of representative capacity you have, you can still represent them in a kind of entire fidelity. 嗯嗯。

i guess like one thing I was asking earlier, but maybe I doing I asked IT in a very good way, is right now we've got this modality where it's two extremes. So we've got the most general possible thing, which is the induction, and then we've got the most specific possible thing, which is the transduction. And could there be a middle west? Like, for example, could you have a kind of pyramid scheme where you do the most specific thing, but you also kind of like have an on sambo of varying level, like, and course.

this could that be Better? I think to some degree, this is happening already. So if you look at how models are trained today, models are trained by first in the retraining face just fitting out of this big mass of human generated data and and you try to give to IT as much data as possible um but then people realize, okay, I don't actually care about a kind of specific data that I find on the internet in some obscure forum.

What I actually care about this may be math or is coding and then what people do is they accurately of data that is very specific to that, a part of the data manual, and they find tune the model on that. And you can think about that as some amatis ed form of transducers learning um and actually I like to think about transducers learning is kind of this entire spectrum all the way. So there's a special case of transducer learning, which is inductive learning, which really you care about entire data manifold.

And then there are all these kind of cases, uh, where you care about some sup pots of the data of manual. So you kind of carve IT up into the actual region that you find interesting for the task that you are trying to solve. But still you train your model once, and then you freeze IT and then you deployed IT.

But just on that, sup part of the data in a fog. And what local learning is doing is, is really going to the extremement of that is pushing into the extremement, saying that whenever you use one of these language models, let's say you are always interested in just making one prediction. So you might as well train a specific model just for that one prediction. Then for the next prediction, you can again train a specific model. Yeah I mean.

the first voice quite interesting that we're almost going back to where we started, which that I mean, google search is really good and that's pure information material. And then we kind of went one eighty and we did pure, you know, inductive inference. And now we're kind of like meeting in the middle.

But I think that has the potential to solve a lot of the the kind of the failure, emotion and problems that, that we have in the. Doing general coding or something like that. The problem is always situations where you're dealing with ambiguity.

And yeah, you can do a lot with prompt engineering. You can say, no, I meant this. No, I meant that. But I think there's a lot of that kind of ambiguity resolution could be automated by having some kind of light somewhere on that transducers spectrum.

Ah I think in practice, uh, when we spend compute at free training time usually and this is kind of what what people do they do with scaling laws, is they say, okay, I am able to spend this amount of compute and then I like scaling loss, tell you I have i'm able to train a model of that certain size, right? But of course, if you had more compute, what you would want to see would train a bigger model right on more data.

And instead, instead of just training bigger models, what you can do is also to increase your representative capacity, at least you effective representative capacity to making a certain prediction is not to just increase the of model size, but instead is to use that model size in the smart way, way, right? Instead of just using that model size to kind of trying to solve all problems at once, you can use that model size, but using kind of essentially duplicating that model, training a separate model for every prediction. Now you don't have to train a separate model from scratch, and that certainly would be very inefficient. So instead, what you can do is you can train a model that is advertised that is still trying to solve all problems at once, at least to the degree that is capable off with its representative capacity and with the amount of computer given and then of leverage that as an initialization to learn a specific model where then all of the representations capacity is available to fit the information or compress the information um that you need to make a certain prediction.

So let's talk about how this might pan out practically. So I don't think these huge language models are going away anytime soon, right? So now we're going to have to sn IT three point fires for doing coding.

And so I see a future where interesting things might start happening. So for example, that tops are getting ridiculously fast. Now i've just ordered a new m4 macbook pro。 It's going to be amazing. And I mean that could in principle fine tune allama model as i'm going.

So you know like i'm on my repo IT fine tunes on the repo, and I can imagine there being some kind of a hybrid thing where like we get the small model doing the active fine tuning and then we kind of generate a completion with that. And then maybe we stand us to check IT back over with claud and then we do some information. Do you see some kind of like practically all hybrid use .

case of this coming up? Um I think that is certainly possible. I think what what seems certainly interesting is that with the model that is smaller and that kind of compresses kind of more of this abstract information into a smaller package, that is it's easy to do of to learn a test time with such a model just because back prop through that model, but also forward passes through that model.

So leveraging ing that intuition to search things is much easier um what ever less. What we've ve seen is that at all scales, kind of learning a test time can improve performance. So even if you kind of work with more state of the ot models still doing this local, learning a test time improves performance.

So what I estimate is that where you spend compute and how much computer you spend will in future be very dependent on the certain problem, on the concrete problem that you're facing because, uh, there are certainly some amount of fixed compute that you will have on your machine. But I believe that also these kind of providers that provide these big elephants over cloud services will, over time, move to a setting where they allow variable amounts of inference time compute. And you can essentially tell how much computer I do I want to spend on the certain problem.

So there might be a very nasty kind of research problem that you want to offload to an alem in the future, and which which seems kind of obvious to you. But like a lot of her done in work, and I believe in the future, uh, you will just be able to tell IT, okay, use whatever kind of compute is necessary to solve that a test time and that might be more expensive than what your local computer can handle. But uh, I think so I think I think we will see this expansion of computer.

And in two scales, we will see at one just leveraging and computer that is available in our machines that would otherwise be unused, right? But I think we will also start paying so like actually acquire additional compute that would otherwise be used for something else for that, say, pre training, we will start paying for this additional compute to solve more complex problems. Um just because we have seen that at all scales, you can kind of get Better performance if you spend the .

traditional compute. yes. And IT also fixes the fundamental problems with with the monolithic adiga, which is that right now it's not open ended, right? And when we actually do have disability for people to explore different parts of the landscape and then we can kind of take those lessons back to the collective, we've now got a diverging creative AI system as a whole, which which, which but on this has been amazing.

So first of all, i'm really excited about your work and it's great that that you are telling the audience about IT because festival, I think uncertainty y and confidence estimation is really important. So I hope people i've learned about that, I think intelligence is about being able to do a commenter amount of computation based on your uncertainty. That's really important.

And I think the transducer active fine showing is gna be absolutely huge over the next five years. So thank you very much for joining us today. It's been amazing .

so much for having me on really excited about this.

Jonas Hübotter (ETH) - Test Time Inference 01:45:56 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

Jonas Hübotter (ETH) - Test Time Inference