Jan and Daniel, it's amazing to have you on MLST. Welcome, both of you. Thank you for having us here. So you guys are the official winners of the ARC Challenge. The results were finalized, I guess, about a month ago or so. And why don't we just start by-- tell us a little bit about your solution. It was a process with lots of ups and downs. We started with a simple LLM fine tuning and had pretty good results.
But during the competition, we tried to add additional computation steps outside of the LLM, which helped us to increase our score substantially. Our LLM alone reached a score of 41 points, I think. But with the additional external help, we could get the LLM to improve substantially to like 56 points, but they didn't count. So 53.5 points at the end. So can you talk us through the overall architecture?
Yeah, so we actually started with a basic LLM that was trained on language. We started out with a rather big model. It was a 12 billion parameters model. And what we essentially do is we just take the arc tasks, which are essentially located on a grid with pixels of different colors. And we just have one token for each color and we tokenize them line by line. And we then put them directly into a LLM. We do no program search or other pre-processing. We just
convert them to text and put them into the LLM. And that worked amazingly well, better than I did expect. And of course there are quite a few tricks we need to do to make it work better on these tasks because the reasoning capabilities of LLMs without essentially fine-tuning on the specific test are not so good. So we essentially did some test time fine-tuning. So we had essentially two training processes. We first did a long pre-training on the official Arc training set
We later replaced it by ReArc, which can generate more tasks, more examples. And this model did not perform so good on the validation tasks. So what we do is test time training. We do another training process on the examples of the validation set, which we get during inference without the final challenge, which we then try to predict with the model. And that gives a big improvement to the score. There were quite a lot of other improvements that we
tried during the challenge. Some did work very well, but there's also a lot that didn't really turn out well.
So Tufalabs is a new AI research lab. I'm studying in Zurich. It is funded from PASS Ventures, involving AI as well. And so we are Swiss version of DeepSeq. So a small group of people, very, very motivated, very hardworking. And we try to do some AI research starting with LLM and O1 style models. What we're looking for now is chief scientist and also research engineers. You can check out positions at tufalabs.ai.
MLST is sponsored by Sentinel, which is the compute platform specifically optimized for AI workloads. They support all of the latest open source language models out of the box, like Lama, for example. You can pay on consumption, essentially, or you can have a model which is always working or it can be freeze-dried when you're not using it.
All of the models that they deploy support the OpenAI API specification out of the box, which means it's just a one line change in your application to switch over to Sentinel and start saving money and make your application go faster. Yeah, it must have been a fascinating process kind of trying all of these different ideas and seeing what stuck and what didn't. I've updated so much.
So, I mean, Melanie Mitchell had quite an interesting paper out talking about her concept arc and, you know, what it means for LLMs to understand and all of this. And the common sense rationale was very much that language models will never be able to do this.
And the only way it could possibly work is if we generate an intermediate program because symbolic things are better. They're more compositional. They generalize better. And certainly, I think if you took GPT-4 and just did zero shot solution space prediction, you were looking at around 10%, which is non-zero. It's certainly much better than something like GPT-3.5. But many of us, myself included, just ruled out the possibility that these things could be reasoning.
What did you discover on that journey? I at least started with the same assumption. But we quickly found that the LLMs have far more computational capability than we thought. For example, all tasks seem like 2D vision tasks, right? They are on a 2D grid. If you see them, you use your perceptional knowledge to move blocks or move pixels around.
And my heuristic was an LLM is on 1D, like it takes text. We have slash N for go to the next line. And it seemed crazy to me that the LLM just functions in that space. Like it has to learn to infer the 2D structure of the problem without ever working in 2D. And
The more we trained the networks, the more it seemed like that just wasn't a problem. The LLMs were strong enough, had enough capability to infer this structure implicitly. And we did some experiments where we tried to help the LLMs, giving them more information about the structure, like appending dimensions or coordinates.
But it wasn't needed. It rarely improved the score. And if it improved the score, it did so by a very small amount that wasn't worth the additional compute time, basically. So I think one of the intuitions there is
It's a 2D grid. I mean, certainly in Transformers models, you know, we put in positional tokens and stuff like that. So the remarkable thing is that it generalized to different grid sizes. And you're saying without you needing to explicitly give it any, like, compasses or markers or anything like that. We actually tried to do a 2D positional encoding on the data, but it didn't really improve the results.
So it just learns to detect the end of line markings and to output lines with the correct length. It just worked fine. And we then stopped with the precision encoding approach because it was too complicated and didn't give us any advantage. You did fine tune the model. I think you ended up using a Lama 8 billion. Is that what you're using? That's what we actually started with. Then we went to a 12 billion model and then
Because this model was too big to do much things on Kaggle, we actually went back to a smaller model. But it was the Lama 3.2 3B model.
which we found to be essentially as strong as the 8B model, but being much faster in execution. So we could try a lot of more things. Yeah, I'm very excited about that new Lama model. It's amazing how much they've gotten out of it. So just with that model, out of the box, zero shot, you know, in solution space, what kind of accuracy are we talking about? Without training? Without any pre-training. Without any training, I think close to zero or zero.
Interesting. The model can't even predict the structure really. Sometimes it's just outputs and of sentence tokens indefinitely. Sometimes it predicts some numbers like it should with end lines, but then some lines are too long, some lines are short. It hardly produces any solution that is even possible without any training process. Very interesting. So you trained it on re-arc.
which is Michael Hoddle's dataset. Maybe talk a little bit about that. It was very useful to have a dataset that is basically unlimited. Like we have the 400 tasks on the easy dataset and Rearc helps to generate more examples for each challenge. It is in a way at some point you saturate because you don't get novel ideas in these tasks, but it was very useful to just have virtually infinite
generatable training data very cool so Michael huddles thing is a data set generator so given a task you can create a whole bunch of additional you know augmentations of that task so you had about 10 000 of those across all of the and the arc tasks in the in the public set and you fine-tuned uh let's say llama 3.2 billion on that and and then zero shot
you know, solution space out of the gate? What are we looking at? Let me think for a moment. The Lama model, I think we are maybe about 10 to 20% performance, I think. Interesting. I'm not sure because we did some multiple inference steps and something. So you could get the zero shot model with some tricks to maybe 20 or 25%. That was the max we could get out, I think. On the evil data set, right? Yeah, on the evil data set.
Well, that brings us to the magical test time inference. There are lots of different names for it. People have been using different names for it. But tell us about that. Yeah, what we actually did was after the model was fine-tuned the first time, we did an additional training step on the evaluation data. So not on the full evaluation data because we only had the examples and not the challenge itself, but the examples essentially looked the same than the challenge. So you could just pick one output from...
the example as the final example and that's just train on these inputs on the candle contest we essentially did a retraining on some of these examples and after that we did the inference step and this improved the results quite a lot i suppose you could argue that's i don't know whether you would call that cheating or not but
In the real world, I suppose you would be doing an online prediction and what you're doing is you're making your predictive model as good as it would have been at the end of a line of doing online prediction. You're correct on that, yeah. But we tested different versions and it's also possible to essentially retrain on every single task. So that would essentially be a fairer version. So the model doesn't see the other evaluation tasks.
That works just as good, but in our experiments it required more time to retrain than just putting in all the tasks at the same time. And because time was so limited in this challenge, we
choose the option to train on multiple tasks at once. Yeah, so there's this thing that is transductive active fine tuning, which is when at test time you have the specification and you generate augmentations from the specifications and then you fine tune on those augmentations and then you do the prediction.
Do you mean from the test? Do you mean from the new challenges? Yeah. So let's say you get three new specifications for a task and then do you augment that and fine tune on it? Yeah, we heavily use augmentation in basically all our inference steps, also in training. We also use it for the pre-training in VR actually.
but it helps a lot to get enough data. During the retraining or test time training,
We are heavily data limited, right? Because each challenge only has three examples. But we can use the symmetry of the problem because all the problems work on a 2D grid and there are some implicit biases in it. So we can rotate it, we can mirror it, we can even shift the colors in specific ways without changing the challenge actually. And since the model
is not rotation invariant, right? It's a 1D LLM. We can create the rotated augmentations and use them to train the challenge too. And it looks like a novel task for LLM because it's rotated differently.
which helps a lot with the amount of data we have. So from the test specifications you do symmetry augmentations and fine-tuning. Do you do any additional augmentations? We were thinking about a lot of augmentation stuff. Most of our tests didn't pan out or were too time intensive. For example, we have a selection step where we
take the generated example, like our solution, and then we do an augmentation again. We ask the LMM basically, does this solution look correct from a different perspective too? And theoretically you can do more augmentations there. You can shift into another space where the problem looks simpler, as long as there's some consistency in the solution. Like for example, you could many problems
have the same solution if they were in black and white instead of in color, for example. And in theory you could use that to
to filter wrong solutions very easily. But we didn't have the time to get it to work, basically. So other than the symmetry augmentations, there were no additional augmentations? Symmetry, color shift, and shifting the examples because the transformer is not invariant to the order of the examples. So we also shuffled the examples a little bit around. Very interesting.
So then we get to the, we're getting close to putting the, you know, the pin in the middle of the dartboard here, which was like the main thing that you guys did. So you do some kind of a search and evaluation process as part of your test time inference.
Can you tell me about that? There are actually different methods of sampling for an LLM. You could just do greedy sampling. This means also you always take the token with the highest probability when sampling, or you could do sampling in a stochastic way by just taking it according to its probability that's predicted by the network. And all these sampling methods didn't work really well if we want to generate lots of different candidates.
So what we actually did, we implemented a selection process later on, which we can maybe talk about later. So the focus here in this generation step was essentially to generate lots of candidates to increase the chance to find the correct solution. And yeah, we first started out with sampling and
We also tested beam search because often there are different paths through the network which give different predictions and one of them is correct. But in the end we settled for a depth first search which we implemented ourselves because we haven't found it anywhere. And that worked quite well and it had quite some advantages over beam search. So what we do is
We just treat the tokens that are predicted from the network as a search tree. And we then search through this tree to find all solutions which have a sampling probability above a certain cutoff value. So, for example, we set this probability to 10% and then we get a variable amount of candidates. This could be up to 10 for 10%, of course.
This search has quite some advantages. The first one is that it's very memory efficient. You only need to store exactly one path. So it's memory efficient as a single inference step and much more efficient as beam search, which needs one essentially the same memory for each beam. And the other advantage is that it gives us multiple solutions at once for the problem.
And the third one is that we have a cutoff. So if there are paths that are not promising, so in these cases, the search stops early and doesn't even follow them. And this gave us a big efficiency boost in
generating solutions or solution candidates, I would say. This is so fascinating. I interviewed these guys at the University of Toronto and they were talking about the reachability space of language models. And they had this famous Roger Federer game, which was like, you know, you try and you make the language model say, great, Roger Federer is the greatest
And you have to kind of find what prompt will reach that word. And their theorem essentially is that the capacity or the reachability of transformers is incredibly flexible.
There's a remarkable divergence between the way that we sample tokens from a language model and what they're capable of. And, you know, like a clearer way of saying that is that there is a misalignment between how we sample language models and the epistemic truth or the correctness of the program that we are sampling.
And I find that fascinating. It's also related to this creativity problem, which is that when we do this greedy sampling, which means we just take the best token and the best token and the best token, we're kind of missing out on all of these creative possibilities. And this is something that a lot of people aren't really thinking about yet. I think that's a very interesting problem. And we have to distinguish between like a language context and the ARC context.
that there's a substantial difference, namely that we only have 10 possible tokens that might be the next correct answer. And also we know there's only one correct answer. So in language you can have millions of ways of rephrasing the same idea and all of these are valid or good continuations. In our case, which is one of the reasons why the DFS works so well, we only have one correct continuation and we have to find it.
It's very easily evaluatable and it's very easily searchable also because the set of possible continuations is much, much smaller. We can even calculate how probable each solution is by simply taking the solution, doing a forward pass, calculating the logits, summing them, and then we know the cutoff needs to be this low for our DFS to find the correct solution guaranteed. And that makes the problem very tractable in this sense.
and all the sampling algorithms are built for language because LLMs work on language, obviously. Which is also why we didn't find the DFS because we had to reimplement it ourselves because it's not a valid way of sampling in language. It just can't work. Language has too much possible paths and each path, as long as it's valid, should roughly have the same probability. So if you have one million possible continuations,
each probability is 1 by 1 million. So it's very hard to use it in this context. But because our problem is discrete, small,
we can sample very very efficiently this raises so many interesting questions because you could argue then that the reason why these models are misaligned between program or knowledge correctness and how we sample them is by your argument because there are too many degrees of freedom in natural language so you know it's almost like they've we've they're dealing with this ambiguity in their training therefore when we sample from them we're not very likely to get the correct answer
You have to be careful about that because we have to differentiate between the probability of a path and the probability of the answer. You can have a very improbable path that leads to a specific answer and then you can have multiple variations of it. But the probability of the answer is not dependent on any path but on the sum of all paths that lead to this answer.
So in language you have this broad set of paths that all lead to the same solution. And when you use standard sampling techniques, you sample the final answer correctly. That's the use of sampling. In the ARC contest there are no alternative intermediate steps.
you don't need to marginalize all possible paths, basically. You need to get the correct path, basically. So in normal LLM, this is completely fine. But I think in the age of reasoning models we get into, like 01 or 03,
You also need the correct reasoning steps, the correct intermediate steps to get to the right solution, especially if you use it for code generation or answering complex mathematical questions. So this will be, I think in the future, this will be a much larger problem. But up till now, sampling was simply the correct approach. A couple of things on that.
So certainly, perhaps in your solution, the way that you traversed that sparse space, and the reason I say sparse is it's a bit like in reinforcement learning, that you need to take several steps without knowing what the value is going to be. And this is really, really difficult when we have some kind of traversal optimization algorithm where we rely on a monotonic signal. So we're getting better, better, better, better. In this case, the space is sparse and we have to take several steps into the unknown.
But the thing that really interests me though is even if we don't know what the reasoning steps are, Ryan Greenblatt's solution really impressed me and surprised me because I assumed that this is just an exponentially large problem and that using a language model to guide the search would not be significantly better than just exhaustively finding programs. And what he demonstrated quite clearly
you know, succinctly is that it's tractable. I mean, yeah, we still need to generate loads of completions and do some search, but a reasonable amount of search gets you to the solution. And what I'm saying there is that even though there is an apparent orthogonality between correctness and searching this space, the correct solution isn't that far away. That's fascinating. Yeah, I think the interesting thing about Ryan Greenblatt's solutions from my perspective is that
that he showed that code generation can work, like that the models can see the visual problems, like the 1D representation of the grids, and then produce code that works to move these 2D objects even though they were never seen, even without fine-tuning. On the other hand, code generation has a massive advantage
compared to our approach because you can just test your code. You can check: Does the code that's generated perform in the way I want it to perform? And if it does, then it's probably a good solution. Whereas our approach doesn't have that. We had to find a way to select which candidate we want to submit
without the possibility of having any guarantee that it's right. Well, this gets us onto the most delicious part of all. So as you just articulated, Ryan Greenbat's solution was technically a neuro symbolic architecture. So Sabaro Kambahati would have been delighted because he loves having formal guarantees. He loves actually knowing for sure
I mean, actually, in this case, we don't know for sure because many correct programs are not correct, of course. But at least, you know, we can run a program on a Python interpreter and we can get a yes or no answer. It might be a false positive. Now, you guys have done something very interesting. You're using the language model to verify its own value as you traverse through this tree structure. Can you tell me about that? Yeah. So...
Our generation process essentially generates multiple solution candidates. Usually it's up to 10 or 20 in practice. And now we need to find a way to select the correct one. And what we actually did is we used our model also for judging how good these solutions are. And the interesting part is it's difficult if you do it without augmentations because a model of course favors its own predictions because
what we use as score is the sampling probability, the probability if we got it in the first place. So what we essentially do at this point, we also use a lot of augmentation for the judging. So we essentially use 16 different augmentations of the problem and put them through the model to calculate scores for each augmentation. And then we average them. And this is an interesting part.
we tried different algorithms. One was just, for example, to sum up the actual probabilities and the other one was to sum up the logarithmic probabilities, which would essentially be equal to multiplying the probabilities. And the second one worked much better than the first one in selecting. And we really asked us what why this is. And so our hypothesis is essentially that the good, the correct solution
looks mostly correct from every angle, no matter how you rotate it, it gets maybe not a really high score, but it doesn't get an extremely low score. And the incorrect solutions, they might also look good in some perspectives, but usually there's one or two or even more perspectives where they get an extremely low score, like 0.01% or something in probability. And so essentially these low scores are important.
during the calculation. And you can't get these scores during the generation process because these things are never sampled. So we actually need the extra scoring step to calculate these low scores to sort out the false solutions. This is absolutely fascinating.
So you generate 16 augmentations for, let's say, a test specification. And as you said before, you had already fine-tuned the model on symmetry permutations, which means it has awareness of different symmetry transformations. And you're saying that using it to do self-reflection on the value of those symmetry augmentations gives it different perspectives on what good looks like.
and then you can kind of look at those, you know, look at the coherence of the different symmetry perspectives to give you an idea of which one is correct. Yeah, exactly. Ironically, it's very useful to us that the LLM does not have symmetry in its predictions. And the interesting part is that it's useful to us that the language model
does not have the 2D symmetries. Like, if it had 2D symmetries, we couldn't use our augmentation for scoring because the score would be the same every time. So in a way, we abuse the fact that LLM is not perfect at 2D tasks. Also because it's always generating from left to right, from up to down, right? So some parts of the puzzle get generated later with more information.
And some parts get generated earlier with less information. And if we rotate it, we generate other parts of the solution first. So a lot of the score differences between the different augmentations come from the fact that it can generate maybe complex parts or parts of the solution first or later. It helps that sometimes
a solution depends on previous input. And if we generate the wrong solution and then rotate it, it sees it much earlier. Like it can see, ah, it makes no sense that like the line that goes from left to right here ends at an empty pixel. It should never happen. And because we rotated the problem, it can see it from a new perspective and then decide that it's an invalid solution or very improbable solution. How much does this depend on the...
the symmetry augmentation before test time. So do you guys say that before we even get to any test time computation, we take all of the evaluation tasks and then we do some symmetry transformations and we fine tune? If you don't do that, how does that affect this evaluation? I think we have never tested that. So you mean not doing the augmentations during training? Yes, yeah.
Yeah, we have never tried that out. We always use augmentations from the beginning. So I'm a bit confused because you seem to be saying it's actually a good thing that it doesn't know about the augmentations. So doing the augmentations, it seems to me that that would be a good thing, but maybe it isn't a good thing. It's a trade off. Like we would lose a lot of training data and it is good if it can just generate the correct solution from every perspective, right?
But some problems are simply easier from some perspective than others. And in theory, the network should converge to a correct solution from every perspective. If it's deep enough, if it's strong enough, but sometimes it just can't. And then we can use that fact
by generating different augmentations and scoring them correctly i'm trying to understand as well so that there are different perspectives on on language models so um sabaro kambahati he called it jagged intelligence i think which is what andre kapathi called it you know that and that's this idea that point realistically they're actually very good which means they have these islands of knowledge
So you can say, I'm going to do a symmetry transformation and it knows. So given an example of something, it knows if it's good or bad, but it doesn't know how to get there. So there seems to be like these concentric circles where they are better at discriminating than generating. And they also are capable of knowing if they don't know how to get there.
I think in some ways, yes. It depends so strongly on the task, like always, right? There are problems where it's true that they have a much easier time of discriminating solutions than generating them. But at the same time, it depends how you measure it, basically. Like, we measure the score of a task or of a solution by calculating basically the probability that it would be generated from normal sampling. So...
In our case, generation probability or generation and evaluation is exactly the same. The only way we can then use the discriminatory abilities of the LLMs is by using these transformations because, like Daniel said, LLMs prefer their own outputs.
So we have to trick the LLM to think that it's an output it would have never done. There's also this thing that Thomas Dietrich was telling me that it is very interesting looking at the divergence between epistemic risk and alleloteric risk. And that's simply that there's a divergence between things that are true and things that the language model will kind of like, you know,
We'll give you if you stochastically sample them. And he said that before RLHF, there is much more of a correspondence between the likelihood of a trajectory and epistemic factfulness. So whether the thing is actually true or not, and RLHF deranges that.
But I'm thinking that doesn't really affect your situation because you're doing fine tuning on the top. So you're kind of, I mean, what you're doing is either overwriting or orthogonal to any RLHF training. So you're actually deliberately saying, you know, here is what good looks like. And if you can generalize from that example of goodness, you can use the likelihood of a trajectory as a proxy for, you know, correctness.
We actually used a LAMA uncensored model in our experiments and we also compared it to just the normal model and the uncensored model was a little bit better. So maybe that makes a little difference, yeah. Was it retrospectively uncensored or was it just never RLHF trained? Retrospectively it was from Schwann? Yeah.
Yeah, because I've played with uncensored models, you know, like the Gwen models. And I found that they are lobotomized.
And for whatever reason, like however they are, I think there are various different approaches to one to one century in them, but they just seem to degrade to the performance of a model, you know, half the size. So like a 72 billion model will now have the reasoning abilities of a 32 billion model. But that's just my experience. We are actually not sure if it's at all important that we use a pre-trained model. Like one of our experiments in the pipeline at the moment is just resetting all the weights and then training.
to check if we even use the pre-trained language capabilities or if it's just the architecture actually. That's very interesting. Yeah, I interviewed Randall Belastriero at NeurIPS and he made the same discovery. So he's got this really interesting paper out. And, you know, we are taught that we need to have these large pre-trained models and we're leveraging the knowledge inside the model and so on. And he apparently no one has really done this experiment that you can just kind of
And this is for discrimination, not generation, right? So if you're just doing some kind of classification problem, you can take a very large model, you can train it from scratch for doing some like discriminative task, and you can actually get better performance than like frontier models, pretty much with hardly any training and hardly any data.
Isn't that, and it almost goes back to those papers from years ago, you know, that there are so many inductive biases just in the architecture of the model itself that, you know, you don't actually need that much training. I think one of the interesting parts is that we, we lobotomize our LLM tool. Like at the end, we have an LLM that has basically only 140 tokens or something. Like we remove all language capabilities from the model, just delete it.
and really make sure that it can only argue in the space of arc. Like it can only output numbers 0 to 9, endline tokens and some letters.
And we did so mostly because of computation requirements. It saves surprisingly a lot of RAM. But yeah, we remove all language capabilities. So we are often asked, do we do any pre-prompting? Is it like chain of thought thinking? It can't. It's all gone. Well, that's really interesting because certainly the Minds AI team, I was talking with Jack, and they're very bullish on this idea of multimodal or cross-domain transfer. So they've
I think Jack was saying, for example, they might even give English language descriptions of the ARC challenges and they're looking for as many ways as possible to leverage the base model in the language model. And there are two views on this. One view is that these things are a general intelligence.
And the whole point of O3, for example, is that it's kind of like it's creating a novel skill program by doing some kind of compositional generalization. And it's like, you know, working in this novel situation, whereas this alternative view is very much that they are a jagged form of intelligence and very specialized form of intelligence. And they're mostly just working on specific instances of very similar things that they've seen before.
Indirectly, I think we could say that we found that the model performed much better at tasks where that seemed similar tasks. Like in our paper, we have claimed a performance of 76%, 72% on left out part of the evaluation dataset.
But its performance drops or dropped at the time noticeably if we never trained on any evaluation dataset tasks. And we believe that to be because there is some conceptual leakage, we called it, from some tasks of the evaluation dataset to other tasks. Like they're not the same, but they have similar ideas.
Which is probably also a reason why the additional datasets like ConceptArc or Bark helped, because they maybe introduced some novel ideas in some way. And if the LLM has never seen an idea conceptually in training, then it's very unlikely that it will perform well on it.
So, for example, if we train the LLM on Arc challenges that are, I don't know, a fill flood, flood fill algorithms or something like that, then it will not transfer its skills into a counting task or something like that. But as long as it has seen some counting tasks, it might be able to combine the ideas to form like novel challenges. It is very interesting that, I mean, just take multiplication, for example.
Frontier models, they don't generalize very well. I mean, they do a little bit, three digits, maybe four digits or something like that. But you can take GPT-2 and you can fine tune it up to 20 digit multiplication and it works better than any frontier model. So there's always this notion that doing fine tuning for a specific task is better.
architecturally that doesn't work very well right because you know we're in this world and we're subject to novelty all the time and we can't possibly be fine-tuning in every possible situation so maybe that's when this active fine-tuning like this test time computation comes into the play because increasingly we build architectures where we are learning online and maybe sharing that knowledge in a clever way to other models but it kind of feels that with this new online
version of prediction, we can kind of have our cake and eat it. So the interesting thing is for the multiplication, tokenization plays a big role because the models tokenize numbers differently. And the model we used was using for some digits just one token, but sometimes it combined digits two to one token. So essentially there were digits 0 to 9 it could tokenize, but then there were some two digit numbers.
and some three-digit numbers. And yeah, I think newer models do this differently or probably have all three-digit numbers integrated. But for us, this was a problem in ARC. So we also removed all this from the tokenizer. So we just have single digits because this would have been a big problem, I guess, if
the length changes of the outputs. Oh, interesting. So you modified the tokenizer to remove all of the compounded numbers. We essentially removed anything except numbers, end of line,
and end of problem and input and output prefix. Very interesting. Yeah. And of course, you weren't deranging it because you were doing the training on the top. So you're kind of like doing a higher resolution token version of training. Exactly. Yeah. And I think for the multiplication task too,
A common problem I see on Twitter is that people misunderstand how complicated the problem is for LLMs. I think OpenAI has almost every three-digit number as a token. So if you tell ChatGPT to multiply two three-digit numbers, it has to basically remember one million multiplication pairs. It's a very, very complicated task.
And then people are like, yeah, it can't multiply three-digit numbers, but it's a hard task for chat GPT. The same with the typical strawberry task. Like, my way of explaining it is imagine, like, every syllable is a
token in a letter in a language you have never seen and you have to remember how many Rs are in each of these tokens and you have to remember that for 130,000 different things like it's a complex task and then you have to learn it without ever seeing the letter R. That's hard.
LLMs are very, very strong problem solvers if you fine-tune them, but they are not the intended solution of Arc in the way that we use them. So in a way, it's a hacky solution. But because of the challenge, you have to use the best performing model. Like you could do a very creative solution that goes further into the direction of AGI, right? But you wouldn't win. You win with LLMs and abusing like fine-tuning speed and stuff like that.
So LLMs are too strong for this contest to enable any other solution, basically. Interesting. What about the thing that right now, this is quite an interesting coincidence, that Arc was designed to be solvable by humans.
And one of those restrictions is that they shouldn't be more than about 30 times 30 cells on a grid because otherwise humans would start making loads of mistakes. But that also just almost coincidentally happens to be the upper bound of what we can correctly do with an LLM. If it was 50 by 50 or even 60 by 60, no one would be using LLMs to solve ARK. We would have only solved the small problems here. Yeah.
Because even now, like when we look at the failure modes of O3, there's definitely like, you can see it tailing off with solution size. So it starts to make more and more mistakes, you know, when it gets to 30 by 30. But it does raise the question though of the complexity of the problem. So again, my intuition is that this problem has a kind of exponential complexity to it. And I would have said the same thing about language. It's interesting that language, if you think about
all of the ways that reference can hierarchically, you know, connect to each other. It seems like as the length of even a book, for example, the information complexity of a book should be exponentially large because there are all of these like connecting things between all the paragraphs and the sentences and the words and so on. And language models don't have any problems understanding and explaining books.
And I want to understand what that is. So there's the potential complexity of all of the possible connections between the terms, and then there's the actual embedded complexity of language. And I think you were making this argument earlier that you feel that actually the sort of the embedded complexity doesn't scale exponentially with the problem size. Yeah, it's in many challenges at least, like
You can see that many problems are mostly trivial. There are just a few pixels where you have to do the right decisions. We even measured that and we found that in an intermediate stage of our solutions, where we still did normal sampling, where we didn't use DFS, we found that we had to get the model to choose the second highest probability token
three times and then we would have solved 80 of the challenges but um we had to get it to choose these and we don't know which of the 900 tokens the second best is the correct one you know the space of
solutions the LLM generates is much, much smaller than the theoretical large space because a lot of the pixels are trivial. Like the background is often trivial. Like many problems are move this object here. And if you start object at the position, then the LLM generates completely the rest of the object correctly. But it has to decide where to put this object. And that's the hard part.
But the number of decisions we have to do to generate the correct solutions is surprisingly low. So does that imply that there's no reason in principle why an LLM wouldn't be able to scale to 100 by 100 or even 1,000 by 1,000? I think it's one of the reasons why Daniel's DFS solution worked so well for us. Because when you do normal sampling, each token has a probability of being 100.
selected wrongly. There are many ways of fixing stuff like that, like min-pay sampling, top-k sampling or something like that. But for every token we generate, we have a small chance of doing it wrong. And our DFS solution just generates all solutions that are above a certain probability. So we are basically guaranteed to get the most likely solutions and some below that.
which reduces a lot of the problems with the exponential problems of sampling, as long as our probability bound is low enough. How did you decide on the depth of the search and the threshold? We decided about the depth by just essentially just testing it out. So how deep can we go to find a
in most cases, the correct solution. And it's, of course, a trade-off because if you search down to 1%, you can get 100 solutions in theory. So in practice, you get much less. And if you search to 10%, you only can get 10 solutions and you get much less solutions. So in some problems where the solution is clear, it doesn't make a difference. You just find the solution. But when the solution is unclear, you get...
a lot of samples for going down to 1% and that would have been computationally infeasible on Kaggle. So in the end it was a trade-off that we just tested out and we ended up with different values between 10 and 17% in the end that we tried in practice. And the solution was mostly in there. Going down from 10% to 1%
gave an improve, I think, from 70 to 71% on the score. So not much. And still technically a variable computational budget, because if you think about it at 1%, when you do it 100 times, you could get a huge variance in the amount of searching that you're doing. But it's kind of vaguely budgeted. Yeah. Is that correct? So it's not clear beforehand how many solutions you will get from the DFS.
Yeah, so this can be more or less depending on the problems. And something that was very interesting is in the end of the contest, we switched back to a larger model because we fine-tuned a lot. And when you do normal sampling, switching to the larger model has a certain factor by how much more compute you need for the larger model because the inference is essentially just n times something like the compute. But for the DFS,
because the larger model got better at deciding what the correct solution is, it could actually prune earlier in the DFS. So it still took a little longer, but not as much longer as we expected it to take from our previous tests. And that was quite surprising, and that also enabled us to do more in the short time, short compute time on Kaggle. Did you guys just for fun at home try a 72 billion model or something bigger?
I tried it on the evaluation set after the contest, but I did just one pass. I don't think it was a 72 billion model, but a 32 billion model or something in that range. But I didn't have time to fine tune it really, so I ran it with the same parameters and it didn't make much of a difference in the setting. But of course that's not final because I hadn't really time to tune it.
Yeah, there seems to be an interesting relationship between the strength of the base model and how much test time computation you do. Which is to say, there's an argument that it's almost better to do, if the budget allows, it's better to do more test time computation on a smaller model. I think so, yes. I think going up in size is not necessarily always good in this case.
the more parameters you have, the more parameters you have to fine-tune. And for a fixed compute budget, I think the smaller models are much better. You can generate more, you can evaluate more, you can filter more. But also, I think, in some experiments, larger models just didn't perform better. Like, they
converged roughly to the same score. So then there's this notion of adaptability. So let's imagine that you were deploying this thing. Let's say you wanted to make some money and you deployed a web service called Arc Solver. And I'm guessing what you would notice over time is that if you persisted the LLM and you are always continuously fine tuning it and adapting it and so on,
you would probably notice it's improved knowledge by virtue of the fact that it was doing less tree searching because it found the solution quicker. Do you think that that would just be the case, that it would just get generally better at ARC and it would just know the solution more quickly? Or do you think you would get weird things happen where it might learn distractors on some tasks and get worse at other ones? It's a good question. Yeah.
I mean, part of the reason I'm saying that is I think that that's what's happening with the O1 models. So I think the reason why we're seeing a dramatic improvement even in three months is because they are learning from the users. So they're continuously adapting and augmenting their knowledge and the models are getting better quicker, needing to do less thinking over time. Yeah, maybe I have an interesting fact to share.
For the fine tuning, we tried different number of tasks we trained on at the same time for the second fine tuning. So we did something with one task just on single task, with 50 tasks on Kaggle, and we also tried the full 400 tasks of the evaluation set. And that led to a very much degraded performance. So up to 50 was fine, but maybe for more tasks it just didn't work. Maybe it was too much for the model to
store at the same time. I'm not sure. We also use LoRa, which also has limited number of parameters. So if we would do some online training and start with new examples, maybe it would have to forget the old ones to solve the new ones. That is quite possible.
yeah because you know we're always told in the theory of neural networks that there's a problem with catastrophic forgetting and continual learning and all the rest of it and you know even chole wrote in his book that when you do fine tuning you have to be really careful you have to like turn the learning rate right down because you'll just destroy anything it already knows and it will only learn the thing that you're teaching it and so on and we're just becoming quite red pilled on open ai now because that you know we we believe that
that these things are like magical memory machines and they're just sucking in all of this knowledge and the more knowledge you give them, the better they get. And that seems to fly in the face of practical experience. Transformers are just very, very good at learning facts. I think it's frankly incredible what they can store.
But there's also a slightly different thing that I find interesting. In our fine-tuning we use Loras, obviously, because fine-tuning on the whole network is very expensive. There's a small subtle thing with Loras. If you use weight decay on a LoRa on a network,
then it degrades the weight of the LoRa. But you always have the base network below it. So you can't fall below a specific performance. You're always around the correct full network and just move a little bit in another direction. So if you do test-train training or fine-tuning with a LoRa,
you can in a way have the safety blanket of your base model below it, which can be very useful, I think, in our challenge. Like we merged, I think, the first LoRa of the pre-training into the weights and then did a test time training on another LoRa. So at the worst case, it would degrade to the base model again.
like the base model from pre-training. And just for the audience, can you quickly, because Laura is a low order rank approximation, if I remember correctly, can you explain how it works? Is it approximation or adaptation? Adaptation. Adaptation, yeah. It's a very simple idea, very elegant idea. You have all these quadratic weight matrices in the tension layers, right?
And instead of fine-tuning all the weights on them, you fine-tune two smaller matrices that you can multiply to get the full rank matrix. And that's mostly enough. Like if you have a 1000 by 1000 matrix, instead of fine-tuning 1 million parameters, you say 1000 by 1 and 1 by 1000 matrix, that would be rank 1, and fine-tune those two small vectors.
And because you go down to a smaller problem, because you have a simpler problem you want to learn, it's enough to get the network to go in the right direction, basically. I think we had a surprisingly high rank, right? We tried out different ranks from, I think, 64 to 256. And the original model, I think, has a rank of 4096.
And anything above 128 didn't actually make a difference for final performance. We also tried full fine tuning once. It was memory, very memory consuming. And that also didn't make a difference for us. So this low rank is probably enough for the problem. And just to give folks an idea of how long it took, I mean, how long?
Are we talking like half an hour, two hours a day? So competing in the challenge? As in to do, let's say at the beginning when you're fine tuning the model, how long does it take? At the beginning, we had smaller models and we had also shorter training times, especially when we're still working on the original training set, which we couldn't put in too often without overfitting.
So then it was a few hours, maybe two to four hours. And it got longer and longer during the challenge. So at some point we were at two days on an NVIDIA H100. And the final model was essentially eight days on an NVIDIA H100. So that was quite long. And the interesting thing is when the HeavyArc dataset appeared, when we found it, there were only, I think, four days left till the end of the contest. So...
What we did there, we actually started with multi-GPU training to do the work of eight days in two days. So luckily we got a lot of compute. Lambda offered us to use a machine with eight times H100 GPUs, Lambda Labs. And yeah, so we could still train the model in time and submit it on the last day.
That was the model with the 56.5 score, which sadly didn't finish in time then. Yeah, they say that necessity is another important measure. I'm one of these guys that, you know, like what you're supposed to do when you're coding is you add one feature at a time, you check it, and then you like commit it and so on. And I have this kind of, not ADHD, not quite, but, you know, like I've got so many cool things I want to do and I want to just like stick them all in there. Where do you fall on that spectrum? Normally...
I would fall similar to you, but in the setting of the contest we just didn't have time. It didn't matter. We had to push new solutions, we had to experiment quickly and trying to clean code or trying to refactor in a way that makes it more flexible would have taken, I don't know, one or two days.
But that was too long. We had so many ideas we wanted to experiment on and try that we had to use the codebase that we had. So when we would have put in new features again, we would have to essentially refactor it again because we didn't know exactly what things we would try out during the time. So that was probably a little bit pointless to refactor it all the time during the contest. It felt a little bit like a mad dash.
So there were some refactors. So when it got too complicated, it's like two or three steps where I completely refactored it in between during the challenge. But in between these refactorings, we just used it over and over and just added stuff essentially.
And would you say your intuitions were fairly correct throughout or because you know, like we get a bit superstitious about things. So we think, oh, that thing's really important. We need to keep that in there. Were there any situations where you realized to your horror later on that something you thought was good was actually not good? Yeah, but sometimes you were constantly like testing different things, like leaving some stuff out and putting some stuff in. And we also used a lot of our submissions for testing. Like what happens if we remove this feature? What happens if we remove that feature?
Often our favorite ideas that we tried didn't pan out and then we had to decide, do we invest more time in making them work or do we just say, screw it, we don't have the time to investigate it deeper, we have to push the things that work.
Often it was the right decision to ignore our crazy fun ideas and just do the thing that works better. And what were the top two things out of all of this stuff that you did that you think moved the needle the most? So the things that were most important, I guess, in our implementation were first of all, the scoring process. So scoring with the augmentations, which allowed us to select the solutions reliably. So this was one thing that gave a big jump.
I think it was from 30 to 37 at that point when we implemented it. And the other thing, I think what this is a depth first search algorithm because it gave us such a performance gain that we could change a lot of things after like using a bigger model, doing more augmentations,
and things like that, that we couldn't have done without it. Well, that's interesting because part of experimentation is actually having a useful signal. And you were saying that certain features actually unlocked a signal on other things you could try. Yes, the DFS was a very natural extension of the scoring, I think. After we had a good working scoring and selection algorithm, the challenge became to generate good candidates.
And we tried a lot of stuff to do that. Some LLM stuff and some external stuff like combining snippets of different solutions to generate new solutions. But what really worked well was Daniel's DFS algorithm. It was very quick and gave us so many different candidates we could score and very few bad candidates. So yeah, it
The two approaches worked very well together. It was a very synergistic approach. Did you ever see any catastrophic failure modes with the DFS where it would just search for an unreasonable amount of time before hitting the threshold? No, not really because we limited the search depth too much.
10 percent. It essentially happened that it searched quite a lot of solutions there sometimes, but we set the limit in such a way that it didn't lead to a big problem. Because the probability mass has to distribute of all paths, right? So if the lower limit is 10 percent, then it can't be more than 10 paths that are valid at the same time.
Do you think it would be possible before you do the depth-first search to predict what the computational budget is or to predict what the search tree might look like? That's an interesting question. I think it might be quite difficult because you don't know until you have actually done some searching. So you could probably predict it in an estimation after you've searched a part of it.
Or you could go for iterative deepening, like you first go to 10%, then to 5% and something if you want to search deeper. It would also be a possibility if you want to limit the compute.
I'm just thinking you could take the, you prompt the model with that solution and then you could just use a classifier or something and train the classifier on different topologies of trees that you found previously. And you might be able to sort of predict a priori what, how much is this going to take me? You know, how much computation should I invest in this problem? If we have the solution, then that works. Like you can just check the localization
logits for each pixel and then measure basically the entropy, like how distributed is the probability. And if the probability is distributed enough, then that is a branch in the tree. But then you only have like the one path through the tree and then branches from there. Predicting the whole tree is probably basically equivalent to predicting the correct solution in some ways. Like you have to know the solution to know what the tree looks like.
We didn't touch on this before. So you were doing a minimum entropy search. But what would happen if you tried-- I'm not sure if we do a minimum entropy .. It's a little bit like Monte Carlo tree search. So it's not doing a minimum entropy. But if you look at the entropy distribution, it's searching the space of low entropy. The problem is-- the problem here is, I think, that we have-- you can imagine we have a rank order for each pixel for the next prediction.
Pixel color 1 is most probable, then pixel color 0. And for basically every problem there's never more than three possible continuations. So you have to be very careful if you like to tweak it or try different things that you don't get into the seven things that are basically impossible. Like for many problems there are, for example, only three colors.
And if you do a maximum entropy or very broad search, you would get colors that are just impossible. And you have to safeguard against that. But at the end, every smart algorithmic improvement we tried,
performed worse. Like the other lamps are just smarter than us. There must have been some relationship between you stick a candidate solution in there and you've got this entropy distribution and you're saying in some circumstances it's quite pointed. So it's like the model would say, I want to go in this direction and no other direction. And sometimes it's spread out. What can you infer from that? In general, I think the pointed ones, like if you have very few candidates, they are often just correct.
Like, the model has a very good idea of how the problem is structured. So if there are several solutions, then they are mostly because there's some link missing in the understanding of the problem. Or if the model just doesn't understand the problem at all. Like, we did a lot of testing and...
the problems that are hardest for models were often counting or I think flood... No, flood fill worked very well, but mostly counting or size estimation problems. And in that case, because the model was very unsure, it predicted often a large amount of possible candidates.
In the other cases, when there are few candidates, then they're almost always correct, I think. And how did you deal with false positives? So I guess there are some situations where you found solutions which, you know, now when you look at the sum of the log probabilities, you know, like it's telling you this is the solution. And it might just not be the solution because it's not the solution, or it might be a false positive because it's, you know, like the right solution for the wrong reason or whatever. What do those failure cases look like?
depends strongly on the problem, I think. One of the most interesting things I've seen in analysis was a task where you have color shifting. Like, you have four colors in the upper left corner and the goal is to shift the colors in an object clockwise direction. And the model solves that problem. The second best solution is also a color shift just in the wrong direction, which I found very fascinating because it means that the model
conceptually understands the problem and even the second best solution is a conceptually right solution, it just goes in the wrong direction. In the cases where the model has a false positive or doesn't solve the problem, it's almost always still conceptually in the right direction. Like it understands I should fill the space with something or I have to generate an object here.
But sometimes it just doesn't understand which object, sometimes it fills the wrong space, stuff like that.