We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Test-Time Adaptation: the key to reasoning with DL (Mohamed Osman)

2025/3/22

Machine Learning Street Talk (MLST)

AI Deep Dive Transcript

People

Mohamed Osman

主

主持人

专注于电动车和能源领域的播客主持人和内容创作者。

Topics

Mohamed Osman: 我认为测试时微调是深度学习的一种新范式，它超出了深度学习的传统范畴。我们通过将所有提示一次性输入前向传递中，来最大化模型的动态性和上下文理解能力。我们训练模型成为一个较弱的上下文理解器，然后对其进行微调以提高推理能力。我们的预训练目标是一个元模型，它学习推理模式而不是精确的转换。微调元模型比微调只接收一对输入输出的模型更容易。我们的方法类似于Brandon Lake的MLC工作，都是通过在测试时进行训练来提高模型的泛化能力。我们使用预训练的编码器-解码器模型，因为它更强调上下文理解能力。我们的预训练方法包含代码和合成任务，但新概念的数量并不多。即使预训练的概念数量有限，预训练过程仍然很重要。在测试时，我们将所有实例都输入前向传递中，并对推理过程进行微调。通过测试时微调，我们对模型内部的隐式转换函数进行调整，使其更接近正确的解决方案。将问题表述为元问题能够让模型学习更多信息，并减少测试时的微调工作量。我们的前向传递过程只是将问题的表示输入模型中，不进行任何过滤或采样。我们使用投票机制来选择最一致的解决方案，该机制适用于ARC，因为它只有一个正确的答案。我们使用多种方法进行采样，包括束搜索和温度采样，并通过多数投票来选择最终结果。束搜索适用于ARC，因为它只有一个正确的答案，并且在做出错误决策后很难恢复。预训练的目标是将正确的先验知识和核心知识编码到模型中，并使模型能够对新的难题进行泛化。我们需要在模型的灵活性和正确性之间取得平衡。我们使用简单的文本数字表示来编码问题，避免使用特殊的表示方法。避免为ARC问题创建特殊的表示方法，因为ARC的重点在于原始数据输入和网络的灵活性。ARC的重点在于原始数据输入和网络的灵活性。为了理解ARC，需要理解将原始表示输入Transformer的重要性。即使是强大的多模态模型，如果缺乏灵活性，也不适合ARC。为了解决ARC，需要关注模型的上下文理解能力和灵活性。通过前向传递来评估Transformer的上下文理解能力，并对推理过程进行微调，是一种解决ARC的新方法。ARC数据集的信息泄露程度很低，我们正在努力使其算法化。刷新数据集并移除暴力破解方法是一个好主意。我相信ARC v1可以通过增加算力和时间来解决。在我们的实验中，计数类任务的准确率最低。解决Transformer的架构问题，例如softmax函数和多层处理，可以提高其计数和复制能力，从而解决ARC中的许多问题。在单层中进行所有处理会导致过拟合，而分层处理更符合算法的本质。主持人: 神经网络并非缺乏抽象推理能力，我们的研究结果对此进行了反驳。ARC 难题本质上是感知推理问题，我们通过将优化器直接集成到评估过程中，使模型能够在测试时发展新的抽象概念。我们的方法主要包含两项技术：测试时主动微调和增强推理反向投票。反向投票机制将性能提升了260%，测试时主动微调额外提升了300%，最终取得了约58%的ARC最高分。模型架构规模比预训练对构建新抽象的影响更大，更大的模型更具表达能力，能够在推理过程中实现更好的抽象和推理。与Kevin Ellis等人的观点相反，我们支持解空间预测，不创建中间Python函数。神经网络默认情况下不具备组合性，需要付出大量努力才能使其具备组合性。通过深入的偏差和设备，可以在特定领域获得组合性，但这不是一个优雅的解决方案。无论是在Python空间还是直接输出，只要能达到正确的抽象级别，模型就能在输入和输出之间建立关联，并确保其泛化能力。即使模型没有显式生成代码，也可以通过在代码上进行微调来诱导神经网络实现近似的组合性。代码预训练已被证明可以提高多个领域的推理能力。问题的表述方式非常重要，测试时训练赋予模型泛化能力，而高效测试时学习的关键在于如何学习。Francois Chollet对测试时计算策略持悲观态度，因为他认为神经网络在组合性方面存在固有的局限性。DreamCoder 的输出空间过于严格，并且没有关注感知问题。生成Python程序很困难，而直接在游戏中行动则更容易。Kevin Ellis 现在使用语言模型而不是DSL，因为语言模型能够编码我们无法言喻的知识和先验知识。未来我们将探索更多不同的角度来解决ARC，并研究Transformer的上下文理解能力。我们将进行更多实验，探索不同的角度来改进Transformer在ARC上的表现。我们没有开源我们的方法，是因为开源的要求过于严格。我们已经为社区做出了很多贡献，并且在激励机制方面存在不足。我们获得的奖励是25000美元，这与我们付出的努力不成比例。我们现在在Tufa Labs工作，拥有充足的资金和算力，将专注于ARC的研究。我们计划在解决ARC之后，研究大型语言模型的组合性问题以及其他系统二目标。ARC v2 将采用与ARC v1相同的格式，但难度会更高，并且包含更多独特性更强的谜题。即使ARC v1保持不变，仍然有很多角度可以改进Transformer的泛化能力，并且我相信随着规模的扩大，ARC v1最终会被解决。我相信ARC v1可以通过增加算力和时间来解决。Transformer 即使在简单任务中也无法进行计数或复制。

Deep Dive

Shownotes Transcript

Test time fine-tuning is a new paradigm to deep learning, right? It's something completely outside of the deep learning paradigm. What's the most efficient way to learn at test time? That's a very interesting question. How did you encode the problems? The whole point of ARC is they're going to trick you. Whatever specialization you put in the input, you can create a problem that's adversarial to that tokenization scheme or special representation scheme for ARC problems because it's so arbitrary.

The problem is so different, right? It's a very new problem. But the really cool thing that you get out of this is you train this model to be a very weak dumb contextualizer maybe, right? But that's what you're tuning for. Transformers, even in a trivial sense, cannot do counting or copying. They just can't do it. You count up to 100 and you just say, can you count these numbers up? And it just fails abysmally. What we do, which is really interesting, we prompt everything into the forward pass, all at once.

We're looking for people that are interested in changing the paradigm, going into test time compute, that like working in small and nimble teams, tackling really big problems. Two for AI Labs is a very, very exciting new research lab that's just started in Zurich. They are looking for amazing ML engineers to join their team. It's a very small team. If that sounds like you, go to twoforlabs.ai.

You might remember when we interviewed Mohamed last year. He's part of Minds AI, along with Jack Cole and Michael Hoddle, a couple of legends. Of course, they were acquired by Twofer Labs. They got the highest score on the ARC Challenge, about 58%, and they just released their paper where they spill the beans on how they did it.

It's called "Don't Throw the Baby Out with the Bathwater: How and Why Deep Learning for ARC". Now the prevailing view has been that neural networks lack the necessary capabilities for abstract reasoning tasks and at least to some extent they

they proved that wrong. Consider this particular arc puzzle. The task requires inferring complex transformational rules from minimal examples, a challenge where vanilla LLMs like GPT-4 get no better than around 10%. Now, Mohamed thinks that arc puzzles are fundamentally perceptual reasoning problems. They incorporated the optimizer directly into the evaluation process, which allowed

the overall approach to develop new abstractions during test time. And their methodology introduces two principal techniques. Number one, we've spoken about this a lot on the show, test time active fine tuning or what we've been referring to as test time active or transductive fine tuning, where you generate synthetic training data derived from each puzzle's examples and you fine tune the model as you go. The second approach is what they call augment inference reverse votes.

where you apply transformations to input puzzles, generate predictions, reverse the transformation and implement a voting mechanism to identify a consistent solution. They found that the latter improved performance by 260%,

with test time active fine-tuning providing an additional 300% improvement, which is how they yielded the highest score on ARC, about 58%. Now, another thing they found is that the model architecture scale has a greater impact than pre-training for building new abstractions. Larger models are simply more expressive models

enabling better abstractions and reasoning during inference. Now, we are counting down the days until the version two of the ARC Prize. We are launching it on MLST on Monday, and I hope you're excited because we've got Francois coming over and he's going to tell us all about it. Suffice to say that all of the Frontier models are going to be going down to negligible performance on it, and I'm excited to show you. See you on Monday and enjoy the show.

Mo, you've been on the show before, but basically you won the ARC Challenge with Jack and Michael. Technically, you didn't win because you didn't choose to share your solution, but certainly in terms of the leaderboard, you guys did the best solution. Welcome back to MLST. Thank you so much for having me. It's always a pleasure. It's really nice seeing you here in beautiful Vancouver.

I feel like you're kind of a guest to me in Canada here. It's really nice seeing you. Amazing, amazing. So congratulations on the incredible result. Tell me more. Thank you so much. We've been obsessed with the ARC Challenge for a very long time. Me and Jack have been working on it for two years as part of the same team. Michael has been working on it

And we're now joined for also for two years, by the way. So like we've always thought that this benchmark was going to get more important and more important. And this is the case now in Europe. Lots of people know about lots more people know about the ARK challenge. And yeah, there's been lots of great popularization of it happening. Really happy to see that.

Yeah, we continued developing our methods based on kind of similar philosophies that we had. And I'm really excited to kind of dive deeper into those and also, yeah, give you a little bit of some of our results that we're planning to share very soon in a paper. Very cool. So by the time this goes out,

You probably would have released that paper, so you can probably tell us some of the headline. The paper is not going to be super interesting technically because we've shared our ideas before. We've shared them on MLST and we always like sharing. And if you look at out of the top 10 current leaderboard on ARK, I don't know how many, but maybe close to 80% were using MLST.

similar ideas, very similar ideas in certain cases. So we were always like happy to share and yeah, we've been

pretty open about that and we're happy to see everyone kind of converging on test time fine tuning and and the voting mechanism that we found to be uh super useful very cool well why don't we break down like what what are the key innovations and maybe you can't talk about all of them but what are some of the key innovations that led to your to your win yeah i i i'll preface this by saying like lots of

the current methods are using very similar things. Right. So it's like, and again, I'm very happy to see that everyone is kind of leveled up and, you know, I'd be remiss to say it's not because of the amount of sharing we have.

we were doing so very happy to see that and I'm happy to see what new innovations are gonna come up now that we've all Leveled up here. Let's break it down. So the innovation so I think we should start with test time fine-tuning because it's really the big one and I'm I I like framing it in this way. Okay, so just to get the general idea so there is a sense that you could say that test time

fine-tuning is a new paradigm to deep learning, right? It's something that's completely outside of the deep learning paradigm, right? You know, you're changing parameters at test time. That's not really, that's a fair thing to say, right? But there is a way to look at it in which it exactly fits the deep learning paradigm. And that's how, that was how we arrived at it. So the idea is we really see Arc as this

perceptual problem right you have an interpretation problem it's it's kind of subjective in a way because you have these biases right uh you're looking at the problem and you have an inf like an almost infinite set of possible transformations from inputs to outputs right and um

it's really hard to narrow down what you should look at. It's really hard. Imagine a riddle that there is a box and inside there is another object and there are different types of objects and it's really hard to find the right level of representation.

to start to even search around the solutions, right? So you kind of have to take it in all at once and hope that something pops up. It's very similar to looking at an image. There is an almost infinite set of different colors and different colors

lightings and they're all the same thing but you have to contend with this infinity before you can kind of proceed to abstract over the image itself to say like okay now there are four apples you start counting or you first you have to identify the apple in infinite lighting infinite coloring right and that perception part is really important in images but we think it's really important in arc also so what like you could easily imagine once you have the right level of abstraction

Search becomes much easier, right? Searching for the right function is very, very easy. If I know this is the relevant object, I need to count or I need to do something around this specific object or this object to object mapping. So very, very important to get the right level of perception. So, okay, so what's the best way that we know of to tackle a novel perceptual problem?

it's the deep learning paradigm right if you have um if if you want to gain it like learn a new skill right uh a perceptual skill you want to classify mugs start with an untrained network and you train it on a bunch of images of these mugs right um and so we take that idea but apply it uh on arc so the claim with arc it's completely novel uh at test time examples so uh

the logical thing to do. And, and, you know, you have to really, uh, take seriously this idea that arc is, uh, the, the most important or, or, or one of the very difficult, uh, parts of it, not the most difficult is this perception problem. So then you just apply, uh,

apply that paradigm but in test time right and you know it's known to acquire skills really well so you apply the whole paradigm to acquire skills at test time right so that's certainly true so that there are things that neural networks can do that we cannot write programs to do

Just because they are... Sholay kind of distinguishes between perceptual problems and type 2 problems or whatever. So many of those problems are perceptual problems. One really interesting thing, though, is that you guys counter to people like Kevin Ellis. You are fans of solution space prediction. So you don't create intermediate Python functions. Is that the case? Yeah, that's true. So that's a pretty huge...

kind of like strategy right there. So I have this intuition and Sholay does that there's something special about Python programs, right? And the special thing is mostly that they have this kind of compositionality.

which means they can be composed together, they can be decomposed into small parts, you can construct a library and you can take bits together and so on. And we intuitively feel that neural networks, for whatever reason, they don't have this compositionality. Maybe we can make neural networks that do in the future, but right now they don't. So people who are really bullish about neural networks, and certainly I think you guys are and Tufa is as well,

think that no, actually they can do this kind of compositionality. They just need to be coaxed in the right way. Yeah, that's a very poignant point that it's absolutely true that neural networks by default are not compositional and you need to do a lot of work to get them to be compositional. And that's...

you know not easy to do and we did do that work and it took us a lot uh a lot of time and uh to to really understand that point right that uh you know by default they're gonna just learn statistics and it's not elegant because they're not they're not composable but i think uh and i'll speak to this later that uh you know you can for a certain domain you can get the biases

deep enough and devices are really important such that you're able to tune the reasoning part really easy and and in that case in that sense you do get compositionality, but it's not an elegant solution that you

you know, without the test time aspect, without putting the biases deep enough, you don't get that. So now what do you do and how do you do that more efficiently? These are all things that we want to explore in the future. And we're going to do that at Tufa Labs. With the Python programs, there's something, again, that's really important. You can

perceive the problem, if you perceive it right, you can take action in Python or in the neural space. I think that's fine. As long as you see the right level of abstraction, you have that dynamism inside the model to be able to

do some correlations from input to output and then match those correlations and force them in the other input to output and then uh you know make sure it generalizes and then you can after that you can either output in python or in a direct output manner right well let me let me press on this a little bit because i think this this is potentially you know the most exciting part so um we have an intuition that python programs are compositional

And I interviewed Laura Ruiz the other day. She's working at Cohere and she did this paper showing that

Basically, you train influence functions on neural networks, and then you get them to do fact retrieval or you get them to do reasoning tasks, and you can see how many of the source documents light up essentially based on the task. So when doing reasoning tasks, it had a very diffused activation in terms of the source documents. And what she noticed is that it was looking at code

So when calculating the slopes of lines and so on, it was looking at things on stack overflow and it was looking at procedures of like how to perform some reasoning. And it was applying that in a very diffused way to a specific problem. So the fascinating thing is that even though it's not explicitly generating code, you can fine tune it on code.

And code has some form of compositionality. And you can coax the neural network to do an approximate form of compositionality, even though you're doing solution space prediction. That is fascinating. Yeah. One of the things that we do for pre-training. So code...

I think there's something, there's another thing that's really another way of saying what you're saying about code. And by the way, I know about that work. It's really amazing work. So with code, it's really hard to get to predict the next token. You have to contextualize

Lot more like you have to know. Okay, what are we doing? What is the exact name of the variable? What is the exact relevant variable here with language? That's not the case with language. You can shortcut you can cheat easily, right? You can you can use like an accurate word and that'll probably be fine You know with code you have to be very precise and so you have to be very contextual right and I think that's that's like a

but another way of kind of looking at the influence of code and code pre-training has been shown to improve reasoning across many domains and in a few new papers now. So this thing has been reinforced, but yeah, training on code is really interesting. I also want to kind of go to this idea of attuning the reasoning. So, you know, you mentioned that the neural networks, while they're solving this reasoning task,

they're while they're solving this reasoning task, they're looking at code, right? I think framing the problem also matters in a very important way here. So, so yeah, so first of all, we established that test time training gives you the generalization ability. Okay, so, so what do you have to do to what's the best way to

what's the most efficient way to learn at test time? That's a very interesting question, right? And so what we do is we prompt everything in context. We're just speaking about contextualization ability, and that's a really important thing for us to maximize for Arc, right? Because if you're prompting things into Arc, like in the Arc format, you need the model to be as dynamic as possible, right? And so you have your inputs and outputs and

and inputs and outputs. And now I have this test input and the problem is so different, right? It's a very new problem. So contextualization and the ability of the model to kind of be steered is very, very important. And what we do, which is really interesting, we prompt everything into the forward pass all at once. So inputs and outputs, input, input, input, and then test input, uh, like new input, uh,

is all in the forward pass all at once. So it's one way to look at it, it's a measure of the novel contextualization ability of the transformer, right? But the really cool thing that you get out of this is you train this model to be a very weak, you know, dumb contextualizer maybe, right? The modeling ability in the forward pass is not that good, but that's what you're tuning for. It's kind of like a meta...

model in a sense right that's what you're pre-training for okay and so the model is going to be kind of it's going to learn reasoning patterns it's not going to learn exact transformations like clem's work right so clem's work is the function learns

just the transformation right but our model you know by by putting everything in the forward pass in pre-training and and and and you're training it over many uh different arc riddles right so you're kind of saying okay the thing to learn here is this

meta task of looking at the context and then generalizing from it or doing your best at modeling from it. And so you have this weak meta model and now that's much easier to tune than a Clem type model where it's, you just have, so for Clem's model, the encoder only receives one pair of input output. You get stronger

induction or regularization i don't know what what we want to call that but with uh with that model but you do but but it's like you know there are many possible like you don't get that that meta ability right there are many possible sets of uh transformations given one only one instance right uh you know you get you need two or three instances to kind of narrow that infinity down so you're training this

meta model and now tuning that meta model is a much easier task right like that's really interesting um you kind of like again the thing is you're you have a predictive model that's gonna be kind of wrong you can just tune tiny pieces of it uh to get the reasoning to click right uh so that's really really key and it's it's work that

is not super present in the literature. I didn't survey it super well, but there was one interesting paper, the MLC paper by Brandon Lake in Nature.

which was looking at reasoning. They don't do test time training or anything like that, but implicitly there's a way to look at it where it's exactly this, actually. It's exactly doing what we're doing. So for that paper, what they do is they have this input to output kind of task that they want to learn. What they do is they retrieve similar things. They put them in context.

and then they put the new thing and then they do that's their forward pass. Okay. And then they train over that and that helps them do much better. So they're kind of training over that. So that's kind of test time tuning. If you take that whole paradigm and put it at test time. So we train over that, but at test time and it's the framing of the model as a meta model. And now the tunings that you have to do are much smaller. So I think that's a, uh,

maybe one of the biggest differences between our work and Clem's. Can you just summarize that again, just so the audience understand it? So you said that you're doing more kind of training of the model before you do anything at test time. Can you explain that? Yeah, so we do pre-train the models on Arc. So you start with a language model, you know, like, I don't know what you use, maybe Lama or something like that. T5 variant, long T5. So you start with a T5 language model. Yeah. And...

What kind of a language model? It's not a normal auto-regressive language model. It's an encoder decoder. So it's an encoder decoder, but it's pre-trained. So we don't train it from scratch. Pre-trained on language, we think, has that contextualization aspect a little more. And we also train it on code further. So then it also gets...

stronger emphasis on this contextualization ability, right? You want to train the best kind of forward pass steerable model, basically dynamic model that can change a lot based on its inputs. And that is possible with transformers, but only to a certain extent. So that's what we start with. Even that is very interesting, though, because no one else has done that. So you're starting with an antique model from about 2020.

which is an encoder decoder which is something that doesn't really exist anymore I mean those were originally used for like machine translation tasks about a million years ago so and it's presumably trained on a tiny tiny corpus compared to modern thing I mean it might be trained on like

I don't know, like less than a billion tokens probably. Yeah, yeah. So you start with that. But the benefit of it is I'm guessing it's a tiny model. It's like what in the millions of parameters, like 340 million parameters or something. Exactly, yeah. So it's a tiny model. Yeah.

start with that then what so then we have this pre-training recipe that involves code it involves auto-generated arc tests in various ways and and jack did lots of really good work there jack cole um so there's this magic uh recipe but the the

the the bottom line is it has lots of code. It has lots of kind of synthetic tasks. By the way, the amount like the total number of new tasks is not big. Right. It's not like we we are able to sit down and generate so many synthetic riddle generators. I like the total number of new concepts in pre-training is low. But there is there is an important thing that we think is happening

during pre-training even with the few number of like uh number of concepts trained but we have okay so we have a recipe that's that that has some code it and it has some synthetic tasks and then we get our like pre-trained arc model right okay now we go into the test time where we feed everything into the in the forward pass like we were doing in pre-training uh and then

So all of the instances are in the forward pass and then we get the test input, the new input that we need to predict the output for that's also in the forward pass and we need to predict the output. And what I was saying about tuning the reasoning is the framing of the problem is really important, right? Everything in the input, you need to kind of learn

Okay, how do I compare the input and then kind of get at a very almost? Okay function, but probably not okay in very important ways that's Already very very close to your solution So now you can test time to you and and kind of search over the reasoning tune the reasoning rather than

a different way of looking at it, right? And then, okay, so first of all, that compares to like the MLC work. It compares to lots of ideas where you're kind of, when you frame the problem in a meta way, you give the model more to learn. It kind of is always the best thing to do in the forward pass. And then you have to kind of scaffold and tune and do everything else. But you're kind of,

you know, halfway there, you know, with that step. And so the tuning is much less work at that point. Okay, at test time, so you call it the forward pass. So you put the test instances in there, you represent it in a really clever way that kind of helps the language model work.

you know, the seek to seek model do its thing. And then you do some green blatting for want of a better term. You do lots of sampling until you get ones that fit all of the specifications. Do you do any augmentation in the forward pass? Yeah. So the forward pass is just like prompting the model with,

with the representation of the board and the representation isn't super special it's just a plain representation uh so just a plain representation into the model we don't do any filtering if that's what you mean by green blotting it's it's a very transductive like you just produce the output directly so we can't do any filtering actually right but but do you do loads and loads of um

sampling until you get ones that you think are good yeah so we have a method of voting that we also introduce in the paper uh and uh voting i think is especially uh suited to arc so uh and it follows this idea of there are many ways that you can be wrong about an arc riddle but there's only one correct solution so you know just uh try all of these ways uh

and then hope that you know, the the majority vote is gonna go to The only right one right way to do it, right? So yeah, we do that and tell me more about the voting. I think that summarizes it pretty well we do

augmentation. So there are many ways to sample things out of a model. One way to sample something out of a model is beam search, right? Or you could do temperature-based sampling, which

I don't think it's a good idea for ARK because, again, like if you just get bad RNG, you know, you're kind of done for. But Beam Search is really interesting because ARK is very special because it's only one right way to do it right. So you can imagine all of these beams and the model is kind of uncertain about a certain next token. Right. And that's that's totally fair. So we're going to take a beam. We're going to try all of them.

And then iteratively, if you've made the wrong decision, you're going to be kind of lost, right? Like in the pixel space, if you're outputting, if you've made the wrong decision, you're going to be very lost because you don't know, like...

Mistake is already done. Obviously these models can't backtrack. So how do you really continue after a mistake? It's unclear things are ambiguous so that the probabilities are going to disperse and then these are going to fall. These beams are going to fall. But if you've done the right thing, you're going to be more and more certain of the next token. So that's one aspect that

One way that we do, we get the samples, but we have a variety of ways that we get samples for voting. We go into it a little bit more in the paper that will hopefully be out.

and uh yeah so that's the voting that was the sampling how do you do the voting so it's just majority vote oh okay and you said that there's only one way of getting getting it right but aren't there uncountably many ways of getting it right yeah well there's only one way with the human priors and with the hardcore knowledge of getting it right hopefully otherwise it's an ill-designed riddle uh

Right. Like for an arc riddle, there's only one correct answer officially. But of course, you can have different systems that have different biases that they think this is the right way and so on. But hopefully it's very clear for the human biases and core knowledge priors that there's only one way of getting right. We should explore that. So the ridiculous example is I could...

I could have a program which says, if you see this specification, then give this answer. If you see this specification, then you give this answer. And if you see the third one, then give this answer. So it's just like explicitly encoding the answers without any generalization or whatever. And you're making the argument that because we have a certain structure of priors that can be composed in a certain way, you think that it's very unlikely that it will...

find a solution other than the ones which humans would agree on in that compositional space well that's what we hope to do with the training right so with the pre-training that's what we hope to encode right we hope to encode two things the priors really really well and by the way i really want to get into prior encoding also like uh and what that means because there's lots of

about that that we get and there's lots of people like saying okay you're memorizing certain reasoning patterns and so on so let's pin that for later but that's what we're hoping to do with the pre-training we're hoping to get the right priors and the right kind of yeah core knowledge into the model in such a way where yeah it only outputs the the

the relevant right solution. But the second thing we want to do is we want to make the model steerable, right? For a new riddle to not rely on the prior too much and to be able to actually produce the right, the right one that, that generalized. So it's kind of this very difficult balance. You want the model to be contextual, but, uh,

but also have the priors very deeply in there to be able to use them to search. Yeah, so there's a trade-off between flexibility and correctness. The reason I ask this is on Wending and Kevin's paper, they said that on the, because they did an ensemble with induction and transduction, but they favored the induction. And I say transduction, they're both transduction, but

they said they had a nine percent false positive rate on the you know the explicit function generation which meant that it was creating functions which gave the correct answer but was actually wrong yeah so it worked on on the test specification but it was actually wrong yeah yeah have you seen that so we don't output programs so we haven't seen that but you can have transductive

in my definition of transduct, like you can have intra-riddle transductive Python programs, meaning just like you said earlier, like if this is the question, give me this answer. If this is the input board, give me this answer. You can have that with Python programs. You can have over-fit Python programs, right?

which uh which just means that they didn't get the right prior from this one and this one and this one to be able to generalize to new ones exactly yeah exactly yeah yeah so is how much of an issue was that for you guys well we kind of uh just like sidestep this whole thing and we just kind of

find different ways of, uh, giving feedback to the model, i.e. test time fine tuning. So to kind of hope that the model, the, the, the built-in model kind of has the best chance to, to generalize to the new test, uh, riddle. And, and so basically the answer to that is everything we do at test time, right? We, uh, we test time tune, which means, you know, you start with, uh,

the initial kind of guess from the model based on the ICL, the in context learn or whatever you want to call it, the contextualization ability of the model. And then you tune that. And what's happening is there is an implicit model

inside like there's an implicit kind of transformation function you can say inside the model that is getting tuned right it's give it that implicit model is giving a guess and then uh the guess is incorrect in maybe major maybe minor ways and it's getting the feedback through gradient descent and it's updating that implicit model through the weights and

And then it's giving another guess, right? And another guess. So that's test time tuning. And then the hope is, you know, that ARC has a perceptual problem and that neural networks can learn these generalized perceptual problems if you put them in domain and if you tune the reasoning that way. So that's, I guess that's how we deal with it. How did you encode the problems? It's a very plain like numbers as text. So it's, there's nothing special there.

Okay. Absolutely nothing special. This is a very interesting question. You shouldn't, in my opinion, you shouldn't generate any

special representation for Arc. The whole point of Arc is they're going to trick you. Whatever specialization you put in the input, you can create a problem that's adversarial to that tokenization scheme or special representation scheme for our problems because it's so arbitrary, right? The main point of Arc is raw. You get the problem raw and then

Be flexible about combining it because it's going to be novel and it's going to be raw. So don't like, this is another thing, like VLMs are really bad for ARK.

Right. Maybe this is like giving a lot of alpha to people. But because I think like for some people, for a big majority of people, the first thing they try is, oh, yeah, we need this is a visual problem. Let's use a VLM. Right. Well, that's a really bad idea. And it gets and that gets to some of the stuff I was seeing at NeurIPS, which is

People are coming up with lots of data sets where VLMs do really bad, like the six finger problem, right? If you have a hand with six fingers, you give it to a VLM, ask it how many fingers, it'll tell you five, right? Why? Because the VLM is a fixed representation machine. This really gets to the core of ARC. Like if you want to understand ARC, you need to understand that you need raw representations into the transformer rather than

you know pre-encoding some stuff so the fix the frozen encoder in most dlms it will take some kind of perspective pre-existing perspective on the r problem but you haven't seen the rest of the the input output pairs yet you don't know what the right framing is you're doing that you're guessing and there are many ways that you could be wrong it's the same thing with the six finger right with

It just the VLM only knows a certain thing, certain things. And it's going to give you in the output tokens a representation of only that framing on certain things, like whether it's a hand or like it doesn't like it's not flexible enough to be able to recombine. It doesn't know what the question is also like with the VLM, like that the

You already encoded the image before asking the question, oh, I need to count the fingers in the image so that the transformer model and the VLM encoder can condition on that. A couple of things on that. I mean, there's the notion of, I mean, you know, Greenblatt did GPT-4O, the vision model, but...

I'm not entirely sure how that works, but the best way to do a visual model is to have it multimodal. So it's a language model and it's a vision model all in one and it can do some kind of crossover between them. So there's the notion of whether having a vision model actually helps with ARC or whether it's better to think of it as a reasoning problem and whether to kind of like skip that.

the the interpretation part and give it a structured representation that maybe we should we should start with that i mean do you do you think in principle whether we had a strong multimodal model which could transfer between the reasoning domain and the visual perception domain it will work or the other point of your of your statement was that it's not flexible enough but could we not do test time training and all of this transduction stuff with with a vlm as well yeah yeah you could i just don't think it's elegant i think it

it could get you to like 60 or whatever but it's not going to get you to 100 of the way there unless you change the architecture so like i'm making a thing about a specific architecture just to be clear like multi-modality might help the thing that i'm advocating for is you need a flat representation and an intermixing so you could have vision patches

intermixed with language tokens and they all condition on each other that's that seems reasonable and uh but if you kind of have a pre-encoding that is frozen and you don't test time tune that's just uh that's just a bad idea if you do test time tune maybe that works but again it's it's the scratch it's not reliable it's better to just really focus on uh

contextualization and on being flexible because that's the point of arc. And it's a very good benchmark because of that, because it found a way to test that. And I think one thing that's really good about our approach is there's a very different way of seeing it. Like, okay, if you want to measure the contextualization ability, the true contextualization ability of these models, of these architectures,

then one thing you could do is just prompt it in the forward pass and see if it's able to solve it. So that's a really cool way of attempting to solve Arc. Obviously, that's what we're doing, so I think that. But that allows you to ask certain research questions that I'm really excited to answer at Tufa Labs.

So yeah, like, so it's not just about solving arcs. You could start framing things about measuring what transformers are doing or the contextualization ability. I hope more people start doing that as well. - So one of the really cool things about test time compute, I don't know if you saw Hugging Face have just released this kind of like O1 type thing.

and they're showing that they can make a 1 billion parameter Lama solve the same types of math problems as, you know, like a vanilla 8 billion Lama model. So what it seems to do is it gives you this ability to make a small model behave as if it's a big model. And that seems to be a superpower. I am not familiar with that release that Hugging Face did, so...

I can't comment on it. I can give you some more details about how our models scale on the hidden test set. So I think that's really interesting, something people really want to know. So, yeah, so what do the scaling laws look like for the hidden test set that is just completely uncontaminated, right? By the way, the public... Not completely. The hidden test set?

Well, yeah, I think there is a potential objection that because we have now hit the hidden test set a lot of times that there's some unintentional information leakage. Could you comment on that? I think the information leakage, like from an information theory perspective, is very, very low. But I'll tell you what we are trying to do with the information leakage. We're trying to

make it algorithmic, right? Like, where there's information leakage about the type of things that work on Arc. And that was kind of the hope. But I don't see it as significant at all, honestly. So the Arc...

data set was actually available for 100 submissions a day for like a couple of years and people didn't seem to kind of like gain too much from that we didn't gain too much from that we weren't honestly using it that much I think it's I say it's very minimal yeah because Sholay thinks that

The reason he needs to make the next version of the ARC Challenge is that if it's not overfit now, it kind of is, but just in a latent way. So he says that when you look at the ensemble of all of the various different approaches to ARC, even back in 2020, you know, the ensemble was getting about 49%.

So he said that, you know, if you did a targeted, you know, imagine we're like security researchers. That's what they do. They like, they look for sources of entropy and they mix things together in a targeted way. And so he said, even though it hadn't really been attacked in this way yet, it's only a matter of time until it was. Yeah. Yeah. I mean, yeah.

I think that's fine. I think it's good to refresh the data set and also maybe to remove the brute forcible one. So, yeah, I mean, when I spoke to Francois a little bit and he mentioned that even in his talk,

He also mentioned the aspect of, okay, I'm going to find out the brute forcible ones, right? Because that was the main approach in 2020. And I'm going to remove those. I think that could be a good thing that you do. I do think the domain is so huge and the possible variation in the arc readers is so big that it's totally fine to kind of use it. But maybe, yeah, like again, from an information theory perspective, the bits are not that much, but yeah.

i mean yeah who knows i'm totally fine with a new arc data set that is harder uh i think uh we're up for the challenge the refreshment is good the the guys i spoke to them a little bit they're doing really good uh calibration with human testing to make sure that uh

The data sets are well calibrated. So yeah, it's going to be fun. Sholeh is quite bearish about some of the test time compute strategies that are being used. And maybe could you just reflect on why that is and what you think he would rather people did? Yeah, he spoke about that a little in his talk. So he's bearish on test time compute strategies.

or the specific strategies that people are using for test time compute? I think both. And I asked him like, what would you do? Because he's doing his own startup now. And he has advocated almost for something that resembles the original Dreamcoder. So he thinks that we should have... So first of all, he's a big program space guy.

So he doesn't, he thinks that there are inherent limits in compositionality and whatnot using neural networks. So we should do the program space thing, whether it's a DSL or an actual programming language. I can't remember exactly what I said. Let's say programming language. And he says that we should have a neurally guided search, which is what the original DreamCoder thing did. The reason I don't like

uh, dreamcaller type approaches is not because of dreamcaller specifically. And I spoke to like even Kevin Ellis about this and they, uh, like, uh, it's, it's interesting their current approach, right? When, when you consider that, that, uh, he was the first author there, but so I think there are two kind of things that you have to focus on, especially when talking about dreamcaller, the, the output space, that's what Kevin Ellis mentioned. It's, it's too restrictive.

right uh this lambda calculus uh it's not it's very uh inflexible right and and that's something also that i think uh like in my opinion generating python programs

is hard. If I ask you to write a Python program for certain articles, it could take you up to like 30 minutes, even if you're a good programmer, maybe 10 minutes to 30 minutes. But if I actually color them in, you can do that instantly. And that gets to like Piaget's

a theory of like incremental kind of development so you can be in a game without being able to perfectly describe the rules of the game but you can act in the game so you can color the things you can represent the thing and and act in it but the level of being able to fully specify the rules of the game and write them down you don't have to have that that's a higher level according to pha so uh

So that that exactly like hits it home to me. So that's the first problem with dream code. The second problem is it just doesn't focus on the perception. Right.

It could, and if it did, maybe it would do well, but it doesn't. And I think the main thing should be, okay, how are you tackling being flexible in the perception space? And then you can go on to do whatever you like to learn programs in the sleep phase and compose them and all of that stuff is fine. But the first question is that, so...

So that's exactly right. And what's also interesting is that Kevin has abandoned using DSLs. So he has, with open arms, embraced language models. And why is that? Well, it's because, as you were just saying, Python code is too incomplete. You can represent any concept. It's incredibly flexible. The reason why he didn't generate Python code in the first place was because there was this intractable search problem. It wouldn't have been possible. Now with language models, we can.

And it's because language models encode our knowledge, our notions of what's interesting, even just basic priors that we couldn't even put into words, like the complexity of a program and the intuition and creativity and all of this kind of stuff. So now we can generate these programs and we can mix them and remix them and do all of this kind of stuff. Everyone in NeurIPS is doing the same thing. I guess the question to you is,

This was a restricted compute benchmark. So of course you made certain trade-offs that you wouldn't make otherwise. Like now at Tufa Labs, you've got all of the compute in the world, all the time in the world. What would you do differently? That's a great question. Okay, I think the most interesting thing to me right now is the thing I mentioned about prompting everything in the forward pass at once and measuring the transformer ability, innate ability, and then tuning the reasoning, these types of lines. So there are so many angles to tackle arc. So it's not about...

it's the compute to do many experiments and to tackle the different angles and to get people on that will help and they're really interested in tackling these things and interested in thinking along these same lines right like transformers are really bad at arc that's the statement that that or like language models really bad at arc and they are but let's measure it and let's

try crazy ideas to get them to improve. That's what I'm after. Tell me about this paper that you're writing with Jack and Michael. Me and Jack are working on a paper to outline

kind of the test time tuning. There's been papers now that outline the technical content of what we have, but we wanted to put the paper out there for people to cite the original kind of source. And people right now, by the way, are citing our MLSTE podcast. No way. Yeah, like lots of papers. Really?

Yeah, the test time paper by MIT by Eakin cites the podcast. Like lots of people cite that MST video podcast, which is not great, right? Good for me. It's probably because you didn't have it in writing anywhere. Yeah.

Folks, you can continue to cite the MLSD podcast as much as you want. I mean, that's actually amazing. You should have like a kind of a counter. I mean, I hope people do that more, but I also want to put a paper out there so that people can cite the written work and we can track it better. But yeah, I mean, I think...

I mean, this is a great format. We shared a lot about our method and people got inspired by it and implemented it and put a lot of papers out. So we want to put our paper there. But technically, it's very similar. Like with the stuff we're doing now, it's not far away from the stuff everyone else is doing. Yeah.

And some people are complaining about us not open sourcing. But again, the stuff is already out there. And I'd be more worried about someone. Some prof told me this. I'm more worried about open AI, not open sourcing rather than Minds AI.

So I think that's the key thing. And the other key thing is we're working on more papers, not just that one, that explore different angles. And we're going to be putting lots of stuff out there in the very near future. And yeah, I'm super excited about that.

Quick point on that. Why did you not open source the approach and already put more out there? We were in a very tough position from many different sources.

So the guys, there was the requirement for open sourcing was a little too extensive. So they required weights, but also training a code or training examples and the entire test time code. So there was a lot of things. We felt like...

you know with Michael's DSL that so many people are now using and have cited for everything ARC related for test and tuning and voting that everyone is using to get the top scores we've contributed a lot to the community and

And yeah, there was kind of a little too much, especially for us targeting that 85%, right? So it's just kind of the incentive structure really didn't make sense. If you look at the prize money and the potential gain. So it was kind of, I think lots of people were making the comment, the incentive structure, if you don't get 85% from the first competition, just wasn't great. But yeah.

you know, we're super excited to do that moving forward. And also the competition team have told us that they're working on that. Absolutely. And they're just taking that feedback. Just to be clear to the audience, what was the incentive given that you got? So was it around 56, 57%? Yeah. What was the incentive? I mean, if you did open it up, what would you have got?

Yeah, that's a very good question. So, by the way, just for reference, we started out at 33% at the beginning of the competition. So we did all of the work to get it to 55.5. We even got 58 score on the hidden set, but it wasn't on the leaderboard because the time had passed. But so we moved up a lot. And what we would have gotten was $25,000 US dollars.

Right. Like after taxes, you know what I mean? Like not, not a, not a huge incentive, but again, like I'm,

Just want to say it was a great competition and the guys did a lot to put the word of Ark out there and into the world and they've taken that feedback and we have also up leveled the community and shared a lot. So I feel like we're both happy and we don't need to like kind of dwell on it too much and we can

just look forward for next year. You guys are now working with Tufa AI Labs in Zurich. As I understand, Tufa have acquired Minds AI. Yeah. Is that, yeah, tell me the story. Yeah, we are now, so the whole team is now working out of Tufa Labs. We have a lot of funding. We have a lot of compute coming in.

So, okay, so we're going to be purely focused on ARC for the first around year and different angles to ARC. Again, we're working on not just this paper, but another one and a couple of other angles also that we hope turn into paper. So we are going to be putting stuff out there and we're hiring and we have compute coming. So do apply if any of that sounds interesting to you. We are going to...

We have also plans for things to do after Arc. Again, we spoke a little bit about this with the compositionality of these large language models. It's not there. How can we get them there in a more general format? Lots of system two goals and test time, obviously, is a very ripe area more generally for

for research, so we're going to be exploring all of that beyond ARC. Yeah, all of that is very, very exciting. So yeah, the compositionality problem, Clem was saying the same thing. I mean, that is the golden ticket. If we can solve the compositionality problem, I think reasoning opens up. But presumably though, the approaches that you're going to be working on now are going to be slightly broader than what was appropriate for this particular benchmark.

So you're going to be doing a few things differently. What particular kind of strategies and approaches are you going to be looking at now? Different creative ways of doing test time is really interesting. I'm also interested in, okay, first I'll say you always have to target the 100% with ARK. You can't, it's very hard to optimize, you know, for a certain competition for something like ARK or it's a really bad idea. It's not going to,

give you lots of score because it's because of the, like how private the data set is and you can't make guesses about it. It's just formulated in a really nice way where you have to kind of go for broke

Whether you are working in a small team with very little funding or with a lot more funding, I think it kind of is the same. But what we get now is different angles. Right. And research angle and the competition angle. And these are like super well aligned, by the way, like they mesh together really well. So, yeah. So just more angles of that. I'm also like.

Yeah, just thinking about different benchmarks around ARK, I have, I think, some really good ideas there. ARK is a special format, I think, that allows you to benchmark certain things. Maybe we can go into that at another time. What do you think will happen when Cholet releases the new version of ARK? And by the way, one thing that Kevin and Zena said, which I didn't predict, was that

has arc become a lame duck benchmark now that sholey has said he's going to invalidate it next year invalidated meaning he's making a new version of it so it's almost like are you wasting your time working on arc one when everything changes with arc two so i think arc two is going to be the same format as arc one right so i think what what's going to happen is it's just going to get harder so what you said to me was that he's

employing loads and loads of humans to design and select tasks that are at the appropriate level of difficulty. So some tasks are insanely difficult and most humans don't get them right. And some are like too easy. And even then there's an interesting overlap between what's easy for a human and what's easy for a computer.

So Cholet, of course, is a big believer that they are a proxy for intelligence, that there should be some kind of G factor or generalization between them. The other thing I wasn't entirely clear on is, is he protecting ARK 2...

you know the second version of arc through diversity so just reducing information leakage through sheer diversity or is he talking about creating a dynamic benchmark which is almost entirely impervious to overfit it i will say uh this so i was talking to sholey and i asked him are you going to have a dynamic benchmark where you iteratively uh

Or you have some kind of iterative framework where you can ask for more. And he did say, no, it's going to be the same format, just different data and parameters.

possibly like uh very likely more difficult in in in the right in in whatever way they are calibrating for uh so i think one thing that chile speaks about related to your first question is uh the getting the right level and so on the this this idea of idiosyncratic riddles right uh meaning like one-off riddles right those riddles that are

uh so creative i think it requires real creativity to come up with those riddles right like they're so creative they're just kind of very novel right like you can't you can't label them if you were working on a labeling approach for arc and you're trying to find the right label it would be hard to categorize right they're one-offs and and i think what he what he's gonna do is have a lot more of these idiosyncratic riddles and i think that's interesting i think that's a good thing um

But I also think that the original ARC formulation was already good enough. And it's already enough for us to come up with new methods. And even if it remained the same, I think there's still lots of angles where you can measure generalization transformers, improve on it, even with this fixed thing. And I do think it gets solved with scale. I think ARCv1...

you know, give us four 3090s or two 3090s and more time on the problem next year. And absolutely, we get to 85%. So, yeah.

for ArcV1. So I do think it scales, and I do have some data on that with the hidden test set that it does scale, actually, in a very interesting way that you wouldn't expect. But yeah, no, I think it does scale, and I'm okay with more idiosyncratic riddles. I think maybe that would tune our signal for what is generalization a little bit better. So they're putting a lot of hard work into that, and I'm excited for it. I just...

Maybe if they do this, there's just one request or hope that they also open up the V1 again so that we can have a kind of just a benchmark for progress on the methods. Right. Like we know the scores really well on V1. If we can still submit to V1, even though now there's a V2 and it's very different, we can still submit to V1. That'd be really good to have maybe as like a validation set or a kind of another benchmark. You know what I mean? Yeah. Yeah.

I mean, Cholet's measure of intelligence was so fascinating because he was talking about this adaptation, you know, this like and being able to create a new skill program in the response of novelty. And if you think about it, it's not a foregone conclusion. This is actually a very difficult thing to do from a psychological point of view, because you're trying to create a general benchmark

that given the kind of like the base knowledge of the average person that they would be able to like, you know, do this generalization. You gave the example of the one-offs. That would be why our manhole covers round.

You know, like these ridiculous riddles that they used to use at Microsoft to hire people and only one in a thousand people. So the information gain on that is basically zero. So you need to create a set of programs that like the average people with their base basis priors would be able to do. And Arc solved that really well. And I think like maybe Shole doesn't get enough credit that he did so cleverly select challenges that that would work well for. And of course, now he needs to diversify across that. One other quick question is in your experimentation,

Did you notice patterns like which types of tasks are you failing on and which ones are you doing well on? That's a good question. So Prof Melanie Mitchell has a great benchmark concept. Yeah, she's been thinking about this for a long time and she has lots of great ideas there.

One thing I'm noting is I've spoken to lots of people that are in the benchmark and are using neural methods and they all say, yeah, concept arc counting is the lowest, right? Yeah. Like it's just a thing. And you try to, if you try to add more counting based riddles or...

I don't think you should add priors. We haven't tried that, like priors in terms of like feature engineering and that kind of a thing is not good. But yeah, just counting for some reason is really abysmal for neural networks. Yeah, and I can tell you why. So I interviewed two guys at DeepMind on Monday and they've studied this in self-attention transformers. They actually say it's because of representational squashing.

I won't spoil the surprise, you can watch the interview, but there's a couple of problems with transformers that because of the way they're set up, almost all of the attention gets kind of like focused on, in the limit actually, if you scale transformers all the way up, all of the attention gets focused on the first token. And there's also another problem with the softmax function that like creates this kind of like directedness problem.

And for certain types of reasoning tasks, you want it to be directed. And for certain types of creative reasoning, you want it to be diffused. But they were basically saying that Transformers, like even in a trivial sense, cannot do counting or copying.

They just can't do it. Like you can give it a trivial example where you say, you know, you count up to 100 and you just say, like, can you count these numbers up? And it just fails abysmally straight away. Copying is an interesting example because even if you use tools, you know, like people say, well, it's okay. You can just use tools. You can just like, you know, stick it into a Python tool. If you can't even copy the tokens into the tool, then you can't do that. But the upside of this is all we need to do is fix this problem.

Right. All we need to do is whatever the problem is in these architectures, if we can make them copy and count, then maybe all of these problems will just disappear. Yeah, it's exactly that. You need to kind of dive really deep into the architecture and see where the problem is. And in this case, yeah, it can be very clear what the problem is, right? The softmax is a max, kind of a cheat code way to kind of achieve something similar to a max. And you can see how that would bottleneck things. You can also see how more layers...

You know, there is some really brilliant postdoc told me this. He told me you want to do the processing in a single layer. You don't want to do the processing of these things like in one layer. You want to do them as you go up the layers. And what he means is if you add or if you're counting layers,

All in one layer. That means you're overfitting, right? Why? Because you're not really running an algorithm, right? You're just kind of taking this feature, this feature, and you're doing it in this one layer. But if you just have one layer, right, that goes and attends to this thing and updates itself and then goes and updates this thing, that's the general algorithm, right, that you want. It's not this, like, heuristic thing.

kind of MLP in one single layer that will just kind of get this feature and this feature and this feature and do everything in one layer. That's not what you want. And so lots of ways of doing that that I, again, super excited to explore. And I think, yeah, you hit the nail on the head there with what you just described. Mo, it's been amazing to have you here. Thanks so much for coming on. Thank you so much.

Test-Time Adaptation: the key to reasoning with DL (Mohamed Osman) 01:03:36 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

Test-Time Adaptation: the key to reasoning with DL (Mohamed Osman)