The accuracy in the ARC-AGI Prize competition in 2024 rose from 33% to 55.5% on a private evaluation set.
The two main successful approaches were deep learning-guided program synthesis and test-time training, where models directly predict solutions based on task descriptions.
Test-time training allows models to fine-tune on demonstration pairs at inference time, unlocking higher generalization levels and enabling accuracy improvements from below 10% to over 55%.
The logarithmic relationship indicates that while more compute can improve performance, better ideas provide significantly more leverage, as seen in solutions achieving 55% accuracy with $10 of compute versus $10,000.
Induction involves writing programs to map input to output grids, while transduction directly predicts output grids. Induction is formally verifiable, whereas transduction relies on guessing, making induction more reliable for generalization.
Consciousness acts as a self-consistency mechanism in system 2 reasoning, ensuring that iterated pattern recognition remains consistent with past iterations, preventing divergence and hallucination.
Chollet envisions a future where programming is democratized, and users can describe what they want to automate in natural language, with the computer generating the necessary programs iteratively through collaboration.
The ARC benchmark is flawed due to task redundancy and overfitting. ARC-2 addresses this by increasing task diversity, introducing semi-private test sets, and ensuring difficulty calibration across evaluation sets.
Deep learning-guided program synthesis combines intuition and pattern recognition with discrete reasoning, allowing models to iteratively construct symbolic programs through guided search, which Chollet believes is closer to how humans reason.
Chollet defines reasoning in two ways: applying memorized patterns (e.g., algorithms) and recomposing cognitive building blocks to solve novel problems. The latter, which involves adapting to novelty, is more critical for AI advancement.
I've never been in the purely symbolic camp. Like if you go back to my earliest writing about, yes, we need program synthesis, I was saying we need deep learning guided program synthesis. We need a merger of intuition and pattern recognition together with discrete step-by-step reasoning and search into one single data structure. And I've said very, very repeatedly for the past eight years or so that
human cognition really is a mixture of intuition and reasoning, and that you're not going to get very far with only one of them. You need the continuous kind of abstraction that's provided by vector spaces, deep learning models in general, and more discrete symbolic kind of abstraction provided by graphs and discrete search. So why do your detractors see you as a symbolist when you're clearly not?
Well, I'm not sure. I've been into deep learning for a very long time, since basically 2013. I started evangelizing deep learning very heavily around 2014. And back then the field was pretty small. Especially with Keras, I think I've done quite a bit to popularize deep learning, make it accessible to as many people as possible.
I've always been a deep learning guy, right? And when I started thinking about the limitations of deep learning, I was not thinking in terms of replacing deep learning with something completely different. I was thinking of augmenting deep learning with symbolic elements. What is your definition of reasoning? I don't really have a single definition of reasoning. I think it's a pretty loaded term and you can mean many different things by that. But there are at least
two ways in which I see the term being used and they're actually pretty different. So for instance, if you're just, let's say, memorizing a program and then applying that program, you could say that's a form of reasoning. Like let's say in school you're learning the algorithm for multiplying numbers, for instance. While you're learning that algorithm, then when you have a test, you're actually applying the algorithm. Is that reasoning? I think yes, that's one form of reasoning.
And it's the kind of reasoning that LLMs and deep learning models in particular are very good at. You're memorizing a pattern, and at this time you're fetching the pattern and reapplying it. But another form of reasoning is when you're faced with something you've never seen before, and you have to recompose, recombine the cognitive building blocks you have access to, so your knowledge and so on, into a brand new model and do so on the fly. That is also reasoning.
But it's a very, very different kind of reasoning and it internalizes very different kinds of capabilities. So I think the important question about deep learning models and LLMs in particular is not can they reason? There's always some sense in which they are doing reasoning. The more important question is can they adapt to novelty? Because there are many different systems that could just memorize programs provided by humans and then reply them.
What's more interesting is can they come up with their own programs, their own abstractions on the fly? Broadly speaking, I think programming from input-output pairs will be a widespread programming paradigm in the future and that will be accessible to anyone because you don't need to write any code. You're just specifying what you want the program to do and then the computer just programs itself.
And if there's any ambiguity, by the way, in what you meant, and there will always be ambiguity, right? Especially if the instructions are provided by a non-technical user.
Well, you don't have to worry about it because the computer will ask you to clarify. It will tell you, "Okay, so I created basically the most plausible program given what you told me, but there's some ambiguity here and there. So what about this input? Currently I have this output. Does that look right? Do you want to change it?" As you change it, you know, iteratively, you are creating this correct program in collaboration with the computer.
Is there just a massive new type of architecture we need to build for this? Yeah, I think we're going to need a completely new type of architecture to implement lifelong distributed learning, where you have basically many instances of the same AI solving many different problems for different people in parallel.
looking for commonalities between the problems and commonalities between the solutions. And anytime they find sufficient commonalities, they just abstract these commonalities into a new building block, which goes back into the system and makes the system more capable, more intelligent. I think I've got it now. So what you're building is a globally distributed ARC on the basis that we find a good solution to ARC.
Well, I can't really tell you exactly what we're building, but it's going to be cool. Yeah, it sounds pretty cool. What's your theory on how O1 works? Well, you know, we can only speculate. I'm not sure how it really works. But what seems to be happening is that it is running a search process in the space of possible chain of thought.
trying to evaluate which branches in the tree work better, potentially backtracking and editing if the current branch is not working out.
it ends up with this very long, unsophisticated, and plausibly near optimal chain of thought, which represents basically a natural language program describing what the model should be doing. And in the process of creating this program, the model is adapting to novelty. And so I think something like O1 is a genuine breakthrough in terms of the generalization power that you can achieve with these systems.
We are far beyond the classical deep learning paradigm. MLST is sponsored by SenseML, which is the compute platform specifically optimized for AI workloads. You might have seen the interview we did last month with Gennady, their CEO and co-founder.
He spoke about some of the optimizations on Centermel which only they have done which makes it dramatically faster than the competition Anyway, if you're interested in hacking around with Centermel I'm going to be doing a live with one of their engineers in the coming weeks So feel free to jump on that live and you can ask their team any questions
2forAI Labs is a very, very exciting new research lab that's just started in Zurich. They are looking for amazing ML engineers to join their team. It's a very small team. It's very exciting. They are doing things like test time compute, inference time scaling. They want to get models to reason, to do this system two amazing thing that we've been talking about for years on MLST. If that sounds like you, go to 2forLabs.ai.
I love the show in general. I think of it as like the Netflix of machine learning. Well, you're going to learn about Arc 2. You're going to learn about what we've learned from running the ArcPrize competition in 2024. And maybe you're going to pick up on some of my current thoughts on the abstraction generation and how to implement AGI. And you're going to find out what Francois is doing next? Well, a little bit about it. A little bit. A little bit.
Wonderful. Francois Chollet, it's an honor to have you back on the show. Thank you so much. Thanks so much for having me. So we're now at the end of the ArcAGI Prize, and you just released a technical report about it. But can you reflect on the competition?
Right. I think we learned a lot and overall it's been a big success. I think, you know, in 2024, we've seen a huge shift in the narrative around AI where previously, you know, the mainstream narrative was that
we could just train larger models on more data, you know, 100x larger models, 100x more data, and get something that's basically a GI. And while more recently, in the past year or so,
there's been this realization that actually we were going to need something akin to System 2 reasoning, that it is not something that will simply just emerge from pre-training larger models on the larger datasets, that you needed to
added to the system somehow. And of course, you can use test time search. You can use program synthesis. You can use these sort of techniques to do it. And I think ArcPrize was really part of that narrative shift. And it's part of the reason also why ArcPrize has been very popular with lots of teams entering, lots of people talking about it, and using it as a kind of reference for whether we might have achieved DGI or not, which, by the way, is not what
ARC is intended to be. It's not intended to be an indicator of whether we have DGI. It's just intended to be a tool, a research tool that gets you to think about the right problems to focus on the right directions. And
Really, the reason why it's been successful is because there was a latent demand for something like this. Many people had this intuition that yes, like plain LLMs weren't going to get to each other, that we needed something more, either as a replacement to LLMs or some kind of superstructure around them that will implement system 2.
And there was this intuition floating around. And I think lots of people just latched onto Arkrise as a concrete sign that their intuition was right.
So there were two flavors of the ARC prize. There was a compute-restricted one and there was the main one. Can you reflect on the difference between the two in terms of the entries? Sure. So we had the main track for the competition on Kaggle, and this was only for submissions that were self-contained.
And the reason why is primarily because we need to keep the private test set fully private, right? So we cannot send it to a third-party server via an API. And so on this track, people are submitting notebooks effectively that are going to have to run on a VM in less than 12 hours, and the VM has just a P1 on a GPU. So this is equivalent to a total of roughly 10 bucks worth of compute per submission.
And then we had the public leaderboard, which was targeted at frontier models. And it was evaluated on a different set of tasks because, of course, we cannot evaluate on the private test set that would just leak the private test set to a third-party server. So instead, it's evaluated on an entirely new set of tasks, which we call the semi-private test set.
test set. So it's my private because it's not published anywhere, so it's not public. But it's also not entirely private since we are actually sending the task data via API to a company like Anthropic and so on. And each submission on that leaderboard can use up to $10,000 in API credits. So that's 1,000 times more
in terms of compute worth than the private leaderboard. And on a per-task basis, it's actually a little bit less because we're evaluating on more tasks. We're evaluating on the 400 tasks from the public eval and then the 100 from the semi-private eval, which is really what we're looking at.
And so on a per-tiles basis, that's about 200 times more compute. And what was really remarkable and frankly quite shocking is that the scores you end up seeing on the public leaderboard track what you're seeing on the private leaderboard. Like in both cases, we're at about 55%. And this tells you that really it
It's not just about throwing more compute at the benchmark. Compute is really a multiplier for ideas. And of course, if you have infinite compute, you can solve the benchmark in a very stupid way, like via brute force search, for instance. But having better ideas
really gives you dramatically more leverage for your compute, right? And so, which is why we end up with solutions consuming 10 bucks of compute that are doing like 55%. And meanwhile, solutions are consuming 10k worth of compute doing the exact same. They're just not nearly as compute efficient.
So which was the most successful method which worked across the board? There were really two categories of approaches that worked well. So one is deep learning guided program synthesis, which is really my favorite approach personally and what I've been advocating for for many years. And so most people nowadays are doing deep learning guided program synthesis using LLMs. They're using LLMs to generate code, LLMs to iteratively debug codes.
Some people are trying to do deep learning guide programs using building blocks from a DSL. I think this is a very under-explored approach, but I think it should be an effective one.
And the other category of approaches has been test time training, where you're going to be using an LLM to directly try to predict the solution given a task description. So you're looking at a set of demonstration pairs, and then you're looking at an input grid, and you're directly trying to generate the output grid, which is a process that we call transduction.
to oppose it to induction, like program synthesis, program induction, you're trying to come up to write down the program that will map the input grids to the output grids.
And in transduction, you're just trying to directly predict the output grid. And of course, if you try to do this with LLMs and you stay within the classical deep learning paradigm where you have this big model that's pre-trained on tons of data, and then at inference time it is static and you're really just running for the model, if you're staying within that paradigm, you really cannot adapt to any meaningful amount of novelty, right?
you are stuck to memorizing patterns and at test time fetching and reapplying the patterns that you've memorized. And to go beyond that, people have started using test time training, which is the idea that you start by pre-training this base model that knows about arc, knows about arc tasks,
And at inference time, on each new task that you see, you're going to try to fine-tune the model, the base model, on the demonstration base to basically try to recombine the knowledge contained within latent space into a new model that is adapted to the task at hand.
And if you don't do this, if you don't do this test time adaptation, then LLM-based models, transduction models, they are stuck below 10% accuracy, roughly. But if you start doing test time training, then you're unlocking a dramatically higher level of generalization, and you can go well into the 50%, 55%, probably even 60% soon.
So the big question is whether that's in the spirit of the challenge or not. So I've done some interviews on this transduction with active fine-tuning where you take the test instances, you do some data set generation, some augmentation, you fine-tune the model, you do really well. And certainly as a methodology to broadly generalize to lots of tasks,
It's good, but the problem is it still has human supervision. So you've stressed from the very beginning in your measure of intelligence that we need developer-aware generalization, which is simply that we can't have a human supervisor specializing the thing for every downstream task. We need to make a system that can itself generalize to tasks that the developer of the system wasn't aware of. So by that metric, do you feel that it's not in the spirit of ARC?
No, I think it's a completely legitimate way to approach the challenge. And I also think it represents a very significant breakthrough in generalization power and in the ability of these models to adapt to something they've not seen before. And I don't think the supervision that we are talking about is really done by humans.
you're using the demonstration pairs to fine-tune the model. This is actually fairly autonomous. Of course, this needs to have been programmed by a human aware of the task format, but the same would be true for a program induction type approach. So I think it's very much in the spirit of the challenge, and further, I think it does demonstrate a legitimate breakthrough in generalization.
So pressing on the legitimacy thing, I mean, it stands to reason that we do some kind of active inference. Of course. So we're always adapting to novelty, building new models and so on and so forth. So what is the difference between transductive active fine tuning and what we do?
Well, I'm not sure what we do exactly. Of course, we are doing active inference. What does that mean exactly? What algorithms, what data structure are we leveraging? We don't know. So I can't really tell you what's the difference. It does seem to me, I will say one thing. So when you're doing test time training with an LLM, you are letting...
a gradient descent process to the knowledge recombination. So to adapt to novelty, it is necessary to take the knowledge that you have and recombine it in some way. And there are multiple different things you could be doing to achieve that. You could be doing program search, right? Where the thing that is adapted to the new task is a program, and you are building this program via a search process. You could also
Something like what the O1 model from OpenAI is doing is a similar to that, where you are doing search effectively in the space of chain of thoughts and you are writing down this chain of thought, which is basically a natural language program for the model to execute. And you're doing this search via an AlphaZero-style search process, like tree search process.
So that's one take. You can also just use discrete program search to write down the program. The program is the artifact that models the task at hand. Or you can try to modify the weights of the model, modify the representations of the model, to create a new model that's adapted to the task. In this case, the artifact that's adapted to the task is the model itself. And that's what test-time-training does, and it does it via gradient descent.
And well, my take is I don't think humans adapt to novelty by recombining what they know via gradient descent specifically. I think the way, I think the level at which we represent knowledge, especially in the context of solving arc puzzles, is much more abstract and symbolic in nature.
and the way we combine it is much closer to function composition than it is to what you can achieve with gradient descent. In general, I don't think it's a good idea to try to use gradient descent as a replacement for a programming process. So I'm actually more of a fan of what a system like O1 is doing than trying to do test time training.
We'll get to 01 in a second, but there are folks who are really bullish on this transduction thing. And I think you and I agree that the reason it doesn't work in principle is because language models are just finite state automata. They don't have this compositional generalization even in principle, right? But we know there's evidence that transformers on their own, they can't do basic things like copying and counting and all of these kinds of things.
Some people are bullish because they think we could improve the architecture so that it could do those kinds of things. And then a transductive approach might work in the future. I mean, would you rule that out?
No, I think it's entirely plausible. As you point out, even given lots of data, there are many algorithmic tasks that you cannot train a transformer to do. Or even if you can, it will learn a solution that does not generalize very well. It will work on inputs that are pretty close to what it's been trained on, if you try an input that's very far away from the training distribution, it will just fail.
And, well, people who think we can move past these limitations, they're saying that we can make architecture tweaks. And, well, you know what? They are right. Like, it is always possible to take a deep learning model, modify the architecture to bake into it some strong structural prior about the algorithmic problem that you're trying to solve.
And now you can actually use gradient descent to find a solution that will generalize. But the way this works is by asking a human engineer effectively to first understand the task at hand and convert that understanding, that symbolic understanding, into a better architecture, an architecture that is in some important ways isomorphic to the causal structure of the problem.
And so, of course, you know, if you want to autonomously adapt to novelty, you cannot just require a human engineer to intervene and write in your architecture, right? The process has to be fully autonomous. And so the question is, can you create an architecture search or architecture generation machine that will take a problem, identify the
the key elements of the problem structure that you need to bake into in your architecture and then generate the architecture. If you can do that, then sure, maybe you can leverage that to achieve strong generalization. But I think that's a problem is at least as difficult as problem-solving in general case.
Yeah, it's also pretty overfit as it turns out. So you know on the public leaderboard, we are evaluating on the public eval set, but we're also evaluating on this semi-private eval set. And the reason we're doing that is to test for overfilling. Some solutions might be overfit to public eval. And it was actually the case for their solution. They scored something like 10 percentage points lower on the semi-private set. And other solutions
solutions based on program synthesis in particular, are not featuring that drop at all. They're actually scoring the exact same on both sets. So it kind of tells you that whatever they did was in some important way overfit to the data that they had. Can we touch on that as well? So you said when you take an ensemble of all of the original Kaggle 2020 competition results, it got to about 49%. That's right. Tell me more.
Right, so in the very first ArcHGI competition on Kaggle back in 2020, the highest score by a single submission was only 20%. So that was the winner, IceCuber. And he was doing just basic brute force program enumeration. But if you looked at all of the submissions in the competition and you assembled them together, you would see a high score of 49%.
which would have been until very recently state-of-the-art. And that was, again, like four years ago. And what that tells you is that there's about half of the private test set that's easily brute-forceable because every single entry in the competition back then, in 2020, was doing some kind of brute-force program of enumeration. That is not HDI. That is not the sort of solution that we are looking for.
And so the fact that doing this kind of stuff at scale could get you 49% is a very strong sign that the benchmark is flawed.
And well, today, you know, if you look at the 2024 competition, the state of the art for any single submission is about 55%, right? So you could say, okay, so we are very far from having sold the benchmark since the bar is 85%. And we also know that humans can solve very close to 100%. Like if I showed you the private test set, you'd probably do like, you know, 97, 98, 99% type thing.
And if you take an ensemble of everything that was submitted in the competition in 2024, you would get to a high score of 81%, right? Which is pretty close to what we are looking for. And I don't think, you know,
Anything that was tried in the competition this year is really close to AGI in the meaningful sense. And the ensemble of everything is still not close to AGI. It just shows that scale, just brute force compute scale, will eventually crack the challenge. And what that says is really that the benchmark is flawed and now it is close to saturation.
and that we need something else. And that's why we're working on ARC2. So ARC2 is not exactly a novel idea. It's not a reaction to the results that we got in 2024.
I first publicly announced ARC2 in early 2022, so a while back. And it was back then kind of a reaction to the 2020 competition results where I was aware that the benchmark had flaws, that it was not quite diverse enough in terms of task diversity. Not every task was quite unique. There was some amount of redundancy.
it might not have been quite challenging enough. And so I wanted to do V2. And so back in 2022, in partnership with Lab42 in Davos, we started crowdsourcing a bunch of new tasks. And since then, we've kept making a lot more new tasks.
We've started filtering them, trying to analyze which ones were difficult for humans, which ones were difficult for AI. We've collected a lot of human testing data as well. We hired people to actually try to solve them. That gives you a lot of information about how many attempts they use for different tasks.
how many times one task was solved by how many people. And you can turn this information into a kind of human-facing difficulty rating. Then you can try to cross-correlate it with what AI can do. So we are going to be releasing Arc 2 early next year. It's going to be addressing all of the flaws of Arc 1. So it's going to be slightly more data,
It's going to be leveraging three sets. So that's going to be the public eval, of course. That's going to be the semi-private eval, and that's going to be the private eval. And one problem with the first few competitions on Kaggle is that we've been reusing the
the same private test set across every competition and anytime anyone made a submission they could immediately see their score on the private eval. And so this can actually lead to some amount of information leakage about the private test set over time. And in fact there are very well documented techniques about how you can actually start to reconstruct, given enough submissions, how you can start to reconstruct the contents.
of the priority value. So we want to avoid this in the competition next year. And the way we're doing this is, obviously, that we're going to be evaluating on the semi-priority value during the competition. So when you submit something, you get your semi-priority value score. And only at the very end of the competition to create the final leaderboard are we actually going to run the submissions on the fully private test set.
And another nice thing if you do that is that this enables you to make direct apples-to-apples comparisons between the Kaggle leaderboard and the public leaderboard with all the frontier models. And of course, it's not going to be the same amount of compute, but the score is going to be apples-to-apples. You're going to be able to say, like, okay, so Minds AI on this test set, they're scoring as well as, I don't know, like, or OnePro or whatever, you know.
So you know the Kaggle 2021's 49% when you take the aggregate.
First of all, which ones were more brute forcible? Because I suppose one way of looking at it is certainly with the mosaic patterns that in the solution space there's an exponential number of combinations. But of course, in the compositional space, it might be brute forcible. But we're also starting to see with the induction transduction methods that they can solve different types of problems. And maybe when you do human evals as well, that humans can solve different types of problems. And there are all these overlapping Venn diagrams. How do you think about that?
Yes, it's actually one of the most interesting findings of the 2024 competition is that program induction and doing transduction with deep learning models, typically in LLM, that's going to lead to solving substantially different sets of tasks. And so this was surprising to me, but in retrospect, it makes sense, especially if you start looking at the task.
and analyzing what makes them different, there are some tasks that are very perceptual in nature. They are effectively pattern cognition problems. And this is the kind of task that you solve well with transduction methods.
And there are other tasks that are much more algorithmic in nature, much more discrete in nature. And you cannot provide an easy solution based on pattern recognition, but it is very easy to just write down an algorithm to produce a solution. And in reverse, if you look at
puzzles that are very perceptual, it is very very challenging to write solution programs for them, right? Because the program would have to formalize a lot of perceptual concepts that make intuitive sense for us but that are actually very difficult to express in program form, right? Kind of like imagine
trying to recognize, let's take something outside Arc. Let's say, trying to recognize the letter A, for instance. If you're just looking at letters,
you can immediately see it. If you have pre-training knowledge, pre-training visual knowledge of the shape of letters, this is a trivial problem. You just recognize letters. It's just what you do. It's pattern recognition. But if you try to write down an algorithm that would take a handwritten A and would tell you, you know, an handwritten letter and tell you if it's an A, for instance, it's actually a tremendously difficult problem.
Right. A couple of thoughts on that. I mean, first of all, on that Kevin Ellis paper, the induction transduction paper, maybe for political reasons, it feels like they wanted to find an inductive explicit function. So in their ensemble method, they first searched the inductive functions by greenblatting it, and then they failed over to the transduction. And the thought occurs that just because we can't write a Python program to recognize a digit,
Surely such a Python program could exist and shouldn't we be thinking about making models generate such a program? Maybe the reason they can't is because nothing like that is in distribution. I think the reason why it's difficult is that we are talking about an input space that is structurally continuous and where
decision boundaries are fuzzy, basically. So you're talking about a problem that is fundamentally a pattern recognition problem, and neural networks are just intrinsically a good data structure to approach this type of problem, and discrete symbolic programs are not. So I really think it depends on what problem you're looking at. There are problems where vector spaces are the right data structure, and problems where symbolic discrete programs are the right data structure.
I guess an interesting thought experiment is there must exist a Python program which does what an MNIST model does. And what is the simplest possible representation of that program? Would it still be ridiculously complicated? I think the simplest representation of the program will look a lot like what the company is doing, to be honest. Yeah, I think that that's the clincher, isn't it? So there's no way of kind of decomposing it into a much simpler version.
Yeah, I think it really depends on the nature of the problem. And for some problems, program synthesis is just a bad idea, right? And perceptual problems are certainly in this category. And the other way around, for some problems, trying to use a pattern cognition machine is just a bad idea. Problems that are algorithmic in nature. How do you think we could effectively combine induction and transduction methods?
Well, the way Kevin Ellis and team are doing it in their paper is basically they're starting with induction and then they're falling back on transduction when it doesn't work. And I think that's a very smart strategy because induction is actually formally verifiable. You can try to run your candidate program on the demonstration pairs that you have access to and see, well, first of all, does it run and so on.
does it get you the right result? And if it does, you can have a fairly high degree of confidence that it is going to generalize. Whereas when you're doing transduction, it's more like you're guessing where the answer might be.
and you don't really have any way to verify that this is the right guess. So one thing you can do is just increase the sample size. I can make many independent guesses, and then you can look at which answers come up the most often. But you're basically making the assumption that
Wrong guesses are all going to be wrong for different reasons. So you're going to end up with different wrong answers. But correct guesses are all going to be correct for the right reasons. So the correct answer will show up more often. But really, you have no way to--
make sure 100% that your guesses are correct. So it's much smarter to start with induction because then you can have a high degree of confidence that the solution you have is the right solution and then fall back when it just doesn't work. So basically induction is just the method you should prefer and you should only use transduction if it's not working.
Should we think of them as being completely different? So hypothetically, if you used a shared model for doing induction and transduction, do you think there could be some crossover between it?
Absolutely. In fact, this is something that some people in the competition have tried. They're using the so-called OmniARC approach, where the team is using the same model to solve a range of different arc-related tasks, including writing down the program, but also interpreting programs, doing transduction, generating more inputs.
and so on. So all these different tasks with one single model and that does lead to learning better representations for the concepts that you find in ARC. Can you give me some more intuition on that? So in a sense, you can get the network
to think about the symbolic version at the same time as the solution space? The basic intuition is that if you look at the same problem from different angles, you are more likely to come up with the true shape of the problem. And this is especially true if your data structure of choice is a neural network because neural networks have this tendency to latch on
noisy statistical regularities. And if you're only using, you know, one, if you're only targeting one problem and only using one input modality, you're much more likely to overfit two elements of noise within that problem. But if you're forcing the same representations to work across many different views of the problem, it acts as
Well, first of all, you get better information about the problem, right? Because there is some knowledge transfer, some information exchange between things like trying to predict the output grade and trying to generate more input grades, for instance, right? But also it acts as a regularization mechanism where the noise that you might be learning with one of the modalities is going to be countered with what you're learning with another modality.
Another approach from Clement Bonnet, I think his name is, is searching the latent space better. So not necessarily greenblatting, but actually searching the latent space for quite a while before you present a prediction. What do you think about that? Yeah, this is actually personally one of my favorite papers that was submitted in competition. It is this very, very original idea that you're going to be learning...
latent space of programs. And then at test time, so of course at test time you need some adaptation to the problem at hand, you need some recombination of your knowledge.
And while some people use test time training, some people use search. And with Clement Bonnet and his colleague, Mark Farley, now doing, it's very original. So they are learning this latent space of programs. And then at test time, they are doing gradient descent in latent program space to basically move around latent program space and find the point of the space that actually matches best the task.
I think this is a great idea. There are many ways, I think, to improve the idea, but it's a very, very original take on test-time adaptation that is not search, that is not fine-tuning. So I like it a lot. Would you still call that process thinking? Is thinking a system two process or would you call that thinking as well?
It is a form of test time search, except it's not discrete search. It's based on gradient descent. So sure, I don't see why you couldn't implement some form of system to processing with that. Yeah. I wonder whether that breaks the analogy with human thinking, that it's doing perceptual deliberation.
It's quite an interesting category. It's doing deliberation latent space, yes. So I think one way to improve the process is that you could also try to decode your latent programs back into a symbolic discrete form. And then you can start doing local discrete search around the decoded programs. And the benefit of that approach is that you would have the ability to actually run the programs and verify whether they work.
And as long as you stay in latent space, even if you're doing this gradient descent guided search within latent space to find the best possible points,
that represents a target program, you are very much limited to guessing. You have nowhere to assert that where the latent space is telling you matches the reality on the ground. So the ability to decode back into real program space and run these programs would be a very good addition to the system.
Are there any potential issues with that approach? So I guess it helps if the latent structure is quite homogenous and the modes are easy to find, could it be improved?
So what do you mean by that exactly? So he does gradient search over the latent space and finds some optimal position and then does the inference from there. But wouldn't that work very well if it was quite a convex space, but not so well if it was a very heterogeneous space? Of course. In order to be able to do gradient descent, you need a relatively smooth surface. But I think that's why they're using a VAE.
And I'm not sure if they've tried just directly learning a program embedding space where one point is one program. As it turns out, this is not what they are doing. They are using a VAE. And the reason why is because when you're using a VAE, you are learning much more structured, much smoother latent spaces. And I think this is key to making a test time gradient descent work.
So other unexplored avenues of ARC, maybe a better way to ask this is, if you yourself spent a year working on ARC, what would you do?
I would be doing a deep learning guided program synthesis. And you know, I think the way people are doing deep learning guided program synthesis today is wrong. So everyone is leveraging LLMs, which of course makes sense because there are these very powerful tools that contain a lot of useful knowledge that can be replied on any problem. We've invested billions and billions of dollars into creating this tool. So not using these tools,
which feel like you're missing out on a lot of power.
But I think it is not the right approach to treat program synthesis as just token by token code generation. I think the right way to think about a program is as a graph of operators and program synthesis is basically a tree search process. I think you're better off trying to use deep learning models and in particular LLMs.
to guide that research process. This is not something that many people are trying today, but I think this would be closer to the right approach. And another thing that people are not doing today, but should be, is that, you know, if you look at the way humans solve arc puzzles, they are not trying to
many different solution programs in their mind. They're only trying a few. I think humans have the capability to first describe a model with their thing and use this model basically describing like a grid, for instance, in terms of the objects it contains and their properties.
and their relationships with other objects, with a focus on causal relationships in particular. So you can use these descriptive models to constrain the search space when you're finally looking at input-to-output programs, and that's the reason why we're
we only need to consider a handful of programs before finding the correct one. So in that sense, it might be possible to do enough modeling of the task to almost make search irrelevant, to almost entirely remove the need for search. It's so interesting what you just said. We should meditate on that just for a second.
First of all, loads of people I've interviewed this week, especially in the neural evolution kind of space, you know, under Jeff Klune, for example, many of them are latching on to LLMs because they say we need to have a measure of interestingness or novelty. And LLMs, because they're trained on all of the data in the world, they capture our instincts really, really well, our intuition and so on. So they're a great way of generating programs. But
But you said though that it's not such a good idea just to generate the program. The next step of evolution is guiding the search. And I think we're starting to see this enlightenment in the use of LLMs in many commercial bits of software. So for example, the original LLM in an app was you just have a chat bot and you just stick it in there. Now things like Cursor, for example, they're exposing a low level API and they're using tool use and so on. And the LLM is actually...
guiding the low-level API interactions in the app. And so you're advocating for a similar evolution here where the LLM actually guides the discrete search process rather than just generating code. That's right. And the idea being that...
By creating your program via this iterative discrete search process, you actually have the ability to make targeted modifications to your program graph that would be significantly harder to make, I think, if you just treat the program as just a sequence of tokens. And also, you're changing the nature of the space in which you are making addition decisions.
And crowds are just the natural data structure to represent programs. Programs are not sequences of tokens.
Another thing we're seeing is an intermediate solution. So Kevin Ellis, he did this thing called remixing, where you have 100 handcrafted solutions. And for every single ask task, you have a generator which can generate new instances. So you know, they can generate sprites and layouts and so on. And then they did this kind of like expansion where they use retrieval augmented generation, and it was doing an implicit form of library learning. And it was like mixing all the solutions together. Is that an intermediate to what you're talking about?
Not quite. I think that's a separate avenue. The idea being that, yes, so in order to get LLMs to perform well on the dataset, you need to expose them to as much, as dense a sampling as possible of arc space. And of course, there's not a lot of arc tasks available, so you have to make new ones. And very easy to make new ones is to leverage an LLM to sort of like extract
the programmatic concepts found in, for instance, the training set and then remixed them into new tasks. But at the end of the day, this leads you to pretty severe overfitting, and this is exactly what we're seeing with their solution. Because there's been a bit of an interesting evolution where even Kevin,
in Dreamcoder there was an explicit concept of awake and you know asleep you know dreaming and in his newer work on this you know learning by examples by example pairs he's kind of made that dream sleep process implicit and it just feels that maybe we could achieve some of the same stuff by kind of coming up with a proxy or implicit version of it yeah yeah yeah
Any other avenues for ARC that you're interested in? By the way, I think you spoke with George from Symbolica and he had this kind of program verification approach that he said he discussed with you.
Yes, so what we discussed is, what you described to me is basically this idea of using a symbolic process to turn a problem definition, a task, into a deep learning architecture and then training deep learning architecture. I think that's a very, very original approach. I don't think that's anyone else, to the best of my knowledge, working on something similar.
I'm very curious about what he's going to be doing with it. It sounds fascinating. Yeah, absolutely. Do you think, you know, just talking about benchmarks in general, do we need to incorporate the compute budget in the benchmark?
Yes, absolutely. And I think this is going to be a very pressing need in the future and in particular next year for the public leaderboard of ARC because it's always possible with test-time compute like test-time search, test-time training and so on to buy higher performance at the cost of more compute. And typically you're going to see a logarithmic relationship at test-time between compute and performance
And it kind of means that if you want an apples-to-apples comparison between two systems, you have to look at the compute budget. You can only compare systems that are using the same amount of compute, right? And...
For instance, if you look at the O1 model from OpenAI, you cannot really attribute to it a fixed score on ArcGIS unless you're also kind of limiting yourself to a certain amount of compute. It's always possible to logarithmically improve your performance by just throwing more compute into the problem. And of course this is true for O1, but even before that, which was also true for brute-force program search systems,
Assuming that you have the right DSL, then extremely crude, basic brute force program enumeration can solve all at human level. It would just take hundreds of millions of dollars of compute to crack the entire data set that way. It's an extremely inefficient, extremely stupid idea, to be honest. But in theory, it's possible, right?
Philosophically speaking, do you think there's always a commensurate relationship between the amount of compute that is taken to do a task? What I mean by that is when we use language and when we use cognitive tools, we might not think we're using compute, but the universe presumably expended a lot of compute in order for those things to emerge. So in some sense, is it really possible to compress the amount of compute that we use?
I think so. I think humans are tremendously compute efficient and you see this in the fact that let's say you're sorting hard tasks for instance. You can sort the entire private test set and only expend a few calories basically. And you could say okay but it's because you know
we're just using extremely little energy per operation that our brain does. It's actually not true at all if you're comparing transistors and neurons, for instance. You find that neurons are tremendously more energy-hungry than transistors. It just so happens that we are managing to solve extraordinarily hard problems using a comparatively small amount of neural differentiations.
And we're just tremendously energy efficient compared to current AI. And we're going to have a GI when we get to the same level of energy efficiency. What's your opinion on using a programming language like Python, a Turing complete language, versus using a DSL in these approaches?
Well, I think using a DSL, like for Arc, for instance, is fundamentally limiting. No matter what you do, no matter what base language you're using, you should be able to learn the functions that you're applying from
the data that you have. In fact, you should be able to do this as a lifelong process. So every time you find a new task and you're solving it, in the process you're going to be coming up with useful abstractions, maybe abstractions that relate to problems you've seen in the past. And so you're going to want to
turn that into reusable functions, reusable building blocks, and store them so that the next time you come across a similar problem, you can reapply the same building blocks and save compute, like solve an equally difficult problem in fewer steps. So no matter what you do, you want to learn.
the language that you're going to be using. And of course, that could mean learning the DSL, that could also mean using something like Python, but within it, increasingly writing higher order functions and classes and other reusable building blocks to enrich your language.
I wondered if you're softening your position at all. So you tweeted, it's highly plausible that fuzzy pattern matching when sufficiently iterated many times can asymptotically turn into reasoning. And it's even possible that humans basically do it in that way, but it doesn't mean it's the optimal way to do reasoning. Is that a shift in your position? Compared to what position? I don't think it's a shift. Um...
I suppose it's saying that, first of all, it's possible that we think in this way. I think we do, yes. Oh, interesting, because I would have thought that because of the spell-key view on things that you would have thought that we do this high-level reasoning and we don't do the fuzzy matching. I think the fundamental cognitive units in the human brain is actually fuzzy pattern cognition. That's the core thing that you do. And then when you're...
doing something that's more akin to reasoning or planning when you're doing basically system to processing, this sort of like slow, logical, step-by-step processing, what you're really doing is iteratively applying your intuition, but in a structured form.
And by the way, this is exactly what deep learning guided program synthesis is about, which is the approach I've been advocating for since 2017. What are you doing when you're doing deep learning guided program synthesis? You are building a program, so basically a graph
of operators, but you're building step by step and at each step when you're choosing what to edit in your graph, what to add, what branching decision to make, you're applying your intuition, you're applying a guess provided by a deep learning model, right? So you're iteratively guessing to create this highly structured, symbolic, discrete
artifacts, so this program. And when you're running this program, that system too, I think this is basically the way we do system 2 as humans, is that we are iteratively guessing, iteratively applying fuzzy pattern recognition to construct an artifact that is in fact symbolic in nature. Like let's say you're playing chess for instance.
When you're calculating in your mind, you're unfolding some moves step by step, but you're only going to be doing it for a few of the moves that are on the board. And how do you do this selection of which moves to look at? You're applying pattern cognition. And then when you're sort of like,
simulating one move into the future, you're not going to be simulating the entire board, you're going to be focusing on some areas. So again, that's pattern recognition. And sometimes, by the way, any sort of pattern recognition is basically a guess at heart, so it might be wrong in some way. And while sometimes in chess you're calculating and some of your
intuition about future states of the board is wrong and then you play the move and then you realize, oh, oops. Right? And so I think this is basically how humans implement this thing. So this is not a shift in my position. I've been thinking about these ideas, you know, for quite a while. In fact, this is the basis of my current favorite theory for how to interpret consciousness. It's this idea that
Well, in order for something like system 2 to arise from iterated fuzzy pattern recognition, that iteration sequence needs to be highly self-consistent. So everything you add, you need to double check that it matches what came before it. If you're just iteratively pattern matching,
with no guardrails whatsoever, you're basically hallucinating, you're dreaming. This is exactly what happens when you're in dream state, by the way. You're just repeatedly intuiting what comes next, but with no regard whatsoever for consistency, for self-consistency with the past.
And I think this is the reason why any sort of like deliberate logical processing in the brain needs to involve awareness, needs to involve consciousness. Consciousness is this sort of like self-consistency check.
it's the process that forces the next iteration of your intuition, of this pattern cognition process, to be consistent with everything that came before it. And the only way to achieve this consistency is via these sort of back and forth loops that are sort of like bringing the past
into the present, bring your prediction of the future into the present. So you have this sort of like nexus point in the present, this thing you're focusing on. And that nexus is basically your consciousness. So consciousness is the process that forces iterated pattern recognition to turn into something that's actually like reasoning, that's actually self-consistent.
Yeah, so I think the issue was you haven't changed your position. It's just a case of people understanding what your position is. So
So I think people dichotomize system, you know, like symbolic people as thinking it's all like this discrete world. And then the alternative, you know, the connectionist approach is that it's all fuzzy matching. I've never been in the purely symbolic camp. Like if you go back to my earliest writing about, yes, we need program synthesis, I was saying we need deep learning guided program synthesis. We need a merger of intuition and pattern and cognition together to
with discrete step-by-step reasoning and search into one single data structure. And I've said very, very repeatedly for the past eight years or so that
human cognition really is a mixture of intuition and reasoning, and that you're not going to get very far with only one of them. You need the continuous kind of abstraction that's provided by vector spaces, deep learning models in general, and more discrete symbolic kind of abstraction provided by graphs and discrete search. So why do your detractors see you as a symbolist when you're clearly not?
Well, I'm not sure. I've been into deep learning for a very long time, since basically 2013. I started evangelizing deep learning very heavily around 2014. And back then, the field was pretty small. Especially with Keras, I think I've done quite a bit to popularize deep learning, make it accessible to as many people as possible.
I've always been a deep learning guy, right? And when I started thinking about the limitations of deep learning, I was not thinking in terms of replacing deep learning with something completely different. I was thinking of augmenting deep learning with symbolic elements. So you commented as well in this tweet that you were just talking to about consciousness. You suggested that all system 2 processing involves consciousness. Yes. Explain more what you mean by that.
So any sort of explicit step-by-step reasoning needs to involve awareness. And in reverse, if there's any cognitive process that you're running unconsciously, it will not have this strong self-consistency grant. It will be more like a dream, an hallucination.
It's basically the idea that if you're just iteratively guessing, unless you have this strong self-consistency guarantee, you will end up drifting, you will end up diverging. Consciousness is the self-consistency guardrail, basically. This is why you cannot have a system without consciousness. What is your definition of reasoning?
I don't really have a single definition of reasoning. I think it's a pretty loaded term and you can mean many different things by that, but there are at least two ways in which I see the term being used and they're actually pretty different. So for instance,
if you're just, let's say, memorizing a program and then applying that program, you could say that's a form of reasoning. Like, let's say in school you're learning the algorithm for multiplying numbers, for instance. Well, you're learning that algorithm, then when you have a test, you're actually applying the algorithm. Is that reasoning? I think yes. That's one form of reasoning.
And it's the kind of reasoning that LLM's and deep learning models in particular are very good at. You're memorizing a pattern, and at this time you're fetching the pattern and reapplying it. But another form of reasoning is when you're faced with something you've never seen before, and you have to recompose, recombine the cognitive building blocks you have access to, so your knowledge and so on, into a brand new model and do so on the fly. That is also reasoning.
But it's a very, very different kind of reasoning and it can realize very different kinds of capabilities. So I think the important question about deep learning models and LLMs in particular is not "Can they reason?" There's always some sense in which they are doing reasoning. The more important question is "Can they adapt to novelty?" Because there are many different systems that could just memorize programs provided by humans and then repriorize them.
What's more interesting is can they come up with their own programs, their own abstractions on the fly? And what would it mean for a system to come up with its own abstraction?
To start coming up with abstractions, first of all, you need to be solving novel problems. Solving a novel problem means that you're starting from some base of knowledge, some building blocks. Then you're faced with a new task, you are recombining them into a model of the task, and you're applying this model and it works. And in the process,
as you solve many problems, you will start noticing that some patterns of recombination of the building blocks that you had happened often, right? And when you start noticing this, well, it means you can tag them and
abstract them, refactor them into a more reusable form. And then this reusable form, you can just add it back to the set of building blocks that you have access to. So next time you encounter a similar problem, you're going to be able to solve it in fewer steps, expending less energy because you have access to this higher level abstraction that fits to the problem. Is there a way of measuring the strength of reasoning?
So again, you would need to start by defining precisely what you mean by reasoning. I think you can define, for instance, you can define precisely generalization power, which is basically the amount of novelty that you can adapt to. Yeah, it's quite interesting because I suppose you define it in terms of performance as in how good is my model.
Another way of describing it is let's imagine we think of reasoning purely as traversing the deductive closure, so just composing together knowledge we already have in new configurations and then we make that leap in solution space because we found a new model that works really, really well. Is there an intrinsic way of measuring the type of model rather than its generalization power?
No, I think you really have to observe what the model does. You cannot just inspect the model and tell you how strong it is at reasoning. So there's no intrinsic form of this is good reasoning? Given two models of a problem, for instance, which model is better? Can you just look at them and tell which one is better? I think it's very much goal dependent, right? You cannot really evaluate...
a model like a simulation of the thing, for instance, if you don't have something that you want to do with it. But if you do have a goal, then you just look at what are the causal factors required to achieve that goal. And then the best model is going to be probably the simplest model that retains these causal factors. François, what are you doing next? Well, so I just left Google a few weeks ago. And so I'm starting a new company, a new research lab with a friend.
And yeah, so I can't really share much for the time being, but we're going to be tackling program synthesis and in particular, the planning guide program synthesis. And we're currently building the team. Amazing. Are you looking for people to join? Yes.
Tell me more. What are you allowed to tell us? So, yeah, I mean, what I can tell you is, you know, I've been talking about some of these questions, like the best way to get to AGI, about deep learning guided program synthesis, about...
RKGI methods surrounding RKGI, but what the next benchmark after RKGI might be. So I've been thinking about this question sort of like on the side while at Google where my full-time job was developing Keras. And now I feel like the time has come to just focus full-time on the research questions. So make the research not a side project, but the main thing.
And is the focus creating the new benchmarks or beating the benchmark? I think both. I'm a pretty strong believer in the idea that you need to co-evolve the solution together with the problem. And that was actually the motivation for creating ArcGIS in the first place, is to have the right challenge that forces you to focus on the right questions, on the main bottlenecks to achieving strong generalization with AI.
And so I don't think, you know, ArcGIS is the last benchmark. And of course, there's going to be a V2 of ArcGIS, but that also is not going to be the last benchmark. I think we're always going to be needing new benchmarks, exploring new things that are hard for AI and easy for humans. Is it cheating in any way if you, you know, do work on your own benchmark?
I don't think so. The benchmark is meant as a tool for research. It's again like something like ArcHGI for instance is not really meant as a binary indicator telling you "oh do we have HDI or not?" It's really meant as a research tool. It's a challenge that forces you to work on the right questions that kind of
directs your attention to the right problems and help you make progress. So you could say that ArcGIS is basically a compass towards EGI. It's not like a test for EGI.
And, you know, there's a spectrum of possible solutions to ARK. So, for example, because you know what's in the private test set, you could just put the answers in directly. But I'm not going to be entering ARK price in any way. I mean, obviously not. I'm running ARK price anyway, so why would I answer it? But on that spectrum of generalization, there's the moonshot, which is going for extreme generalization, or there's one notch below that. Where are you going for?
I would like to build AGI. I would like to build something with human-level capability. And what would that mean? What kind of things do you think you could achieve with that kind of AGI? Well, the most obvious thing is solving programming, right? If you solve AGI, then you can just
describe what you want to a computer and the computer will build it for you. And if it's really a GI, then it will scale to the same level of complexity, the same level of code-based complexity that you can do with a human software engineer. And probably it's not going to stop there either. How long do you think, what's the role of humans in software engineering when we start to get there?
Well, we'll see. I think we're going to start creating entirely new tools, entirely new interfaces to work with this technology when it's ready. We're still pretty far from it. We're talking about something that doesn't quite exist yet. I don't think frontier models, not even O1, are quite at that level. Right now, programmers are very technical.
Do you think that programming might be democratized in the future? I think so, yes. I think in the future, anyone should be able to basically develop their own automation processes based on their own domain expert knowledge of the problems they're facing. Everybody should be able to program, not really in the sense of writing down code, but describing to the computer what they want to automate and how they want to automate it, and the computer will just do it.
A big thing in software is tackling complexity at different scales. It's certainly what you've been doing for your entire career. Do you think we'll always have this problem that we'll always be on the boundary of this incredible complexity? What do you mean by that? Well, as in, even if we democratize one or two steps up the hierarchy, wouldn't we still always just build software which is just really complicated?
Yeah, quite possibly. But the idea is that we're going to be able to offload that complexity to an external complexity processing AI.
So we will transgress into a future where we no longer understand the code that's being run in any way. Absolutely. I think to a large extent, this is already true. If you look at any sizable code base, there is no single software engineer that actually understands it all. We're always limited due to a fragmented understanding of what we are doing, which is fine as long as we have a good grasp on the high-level goals and constraints of the system.
So where should the source of agency be there? Because you know you're describing a blind man and the elephant type challenge where loads and loads of developers have their own perspective on a very small part of the system. But when we have the AGI version, how could that change? Broadly speaking, I think programming from...
input/output pairs will be a widespread programming paradigm in the future and that will be accessible to anyone because you don't need to write any code, right? You're just specifying what you want the program to do and then the computer just programs itself. And if there's any ambiguity, by the way, in what you meant, and there will always be ambiguity, right? Especially if the instructions are provided by a non-technical user,
Well, you don't have to worry about it because the computer will ask you to clarify. It will tell you, "Okay, so I created basically the most plausible program given what you told me, but there's some ambiguity here and there. So what about this input? Currently I have this output. Does that look right? Do you want to change it?" As you change it, you know, iteratively, you are creating this correct program in collaboration with the computer.
So these future systems will have program synthesis as a core component, an explicit component. But how will the humans interface with it? Are we still going to describe with natural language and gestures and images and things like that? It could be natural language. It could also just be drawing interface elements on your screen. It could be...
You could always try to generate a high-level representation of the program that's being generated that is at a level where it can be visualized and understood by a non-tech user. I could show, for instance, a kind of data flow graph as the user for input about it.
Very cool. So one of the characteristics of LLMs at the hyperscale is, in a sense, it's not that difficult because they just have to, it's like a CDN. They just copy all the weights and they move it all over the place. You're describing something which is very sophisticated. It might be a little bit akin to a globally distributed database where the skill programs move around all the different nodes and so on. Is there just a massive new type of architecture we need to build for this?
Yeah, I think we're going to need a completely new type of architecture to implement lifelong distributed learning, where you have basically many instances of the same AI solving many different problems for different people in parallel and looking for commonalities.
between the problems and commonalities between the solutions. And anytime they find sufficient commonality, they just abstract these commonalities into a new building block, which goes back into the system and makes the system more capable, more intelligent. I think I've got it now. So what you're building is a globally distributed ARC on the basis that we find a good solution to ARC. Well, I can't really tell you exactly what we're building, but it's going to be cool. Yeah, it sounds pretty cool.
How do you think that folks like OpenAI are going to start incorporating not only like test time inference, but some of your ideas realistically into their system? How might frontier models incorporate program synthesis, for instance? Well, I think something like O1 is already doing precisely that. When you look at what O1 is probably doing, that it's writing its own natural language program
describing what it's supposed to be doing and it is executing itself, this program. And the way it's writing this program is via a very sophisticated search process. So this is already program synthesis in natural language space.
there are other ways you could leverage programs. You could do programs in latent space potentially, kind of like what Clément Bonnet and friends are doing. You could also just be generating actual programs, like why do it in natural language? Sometimes, maybe you might want to use an actual programming language. So I think we are definitely seeing a
shift towards leveraging more and more test time compute and that's going to accelerate. It's a phenomenal trend, it is not. On the transductive active fine-tuning though, that
That's a little bit more architecturally difficult, isn't it? Because my model is always being fine-tuned. So I can imagine they might build something a bit like Docker, where there's the base layer and then there's my fine-tuning layer and another fine-tuning layer. And it's very kind of fragmented. So the difficulty in applying test-term training in actual frontier models is not so much the infrastructure. So it's definitely true that current-serving infrastructure is absolutely not set up.
for doing like per-task fine-tuning. But you could re-engineer it for it. The main bottleneck is actually the task format. You can only do test time training if you have pretty clear inputs and targets. You basically need input/output pairs, right? And for most prompts, you don't have that. You have that for Arc, obviously, but for most problems, you don't have that. Very cool. What's your theory on how O1 works?
Well, you know, we can only speculate. I'm not sure how it really works. But what seems to be happening is that it is running a search process in the space of Parseval Chain of Thought, trying to evaluate which branches in the tree work better, potentially backtracking and editing if the current branch is not working out.
It ends up with this very long, unsophisticated, and plausibly near optimal chain of thought, which represents basically a natural language program describing what the model should be doing. And in the process of creating this program, the model is adapting to novelty.
And so I think something like O1 is a genuine breakthrough in terms of the generalization power that you can achieve with these systems. Like we are far beyond the classical deep learning paradigm. So one school of thought, which I think you agree with, is that there's some kind of active controller at inference time. So it's actually doing multiple trajectories in an isolated way. Yes, it's doing search.
Okay, because some people think that there's process supervision and whatnot at training time, but at inference time, it's all just one forward pass. No, that is certainly not plausible because of the amount of compute that's being spent at test time. It is very clearly doing search at test time.
I think it is trained at training time to reproduce the best available chain of thought for the current problem, kind of like AlphaZero style training basically. But it's also doing search at test time in chain of thought space. And this is kind of obvious in the telling sign, this is the compute it's expanding, like the amount of tokens and the latency.
Are there any other telltale signs that that kind of thing is happening? For example, it might have explored a particular area and then after the consolidation that is now gone. So when you talk to the model, it's almost like it's forgotten part of its thinking. Yeah, honestly, this is a little bit too specific. I don't have any insider info about what Oran is doing, so I can only speculate. Okay.
So people like Noam Brown, they're really bullish on this new scaling law for test time. And certainly I love O1Pro. I think it's really, really good. It's qualitatively a big improvement. What do you think there? Sure. So the test time scaling law is basically this observation that if you expand more compute, if you search further, you see a corresponding improvement in accuracy.
And that relationship is logarithmic, by the way. So accuracy improves logarithmically with compute. And while this is not
really new, like anytime you do test time search, you will see this relationship. Like if you're doing brute force program search, for instance, you will find that your ability to solve a problem improves logarithmically with the amount of compute. If you have more compute, you can just search further into the space of possible programs and logarithmically you find more solutions. So anytime you do any kind of test time search, you will see this relationship.
What is your current go-to model and what do you use it for? For the most part, I'm using Gemini Advanced. And I've actually just started using the new Gemini Flash, like the latest one. Me too. So I'm paying for Gemini Advanced. I'm also using Cloud 3.5. I think it's very good for programming. So these are the two I'm using. What's your programming workflow with LLMs?
I don't use LLMs all that much when I'm programming, but typically, if I'm currently facing a problem that I feel might be a good fit for LLMs, I will just open my browser and prompt the LLM and ask it for a function that does X, Y, Z. And while it usually doesn't really work on the first try, but after a little bit of debugging and nudging, I think it's a big time saver.
What kind of failure modes do you see when you're programming? With Alalamps, you mean-- well, so the failure modes are kind of different based on the model that you're using. I think in general, SNES 3.5, close to 3.5 sites is the best one. But sometimes, you might have--
code that's there for absolutely no reason, like variables that are not used or assumptions that are being made by the code that are not verified by the data that comes in. So it's pretty clear that the code is generated in terms of statistical likelihood. There's no effort to actually make it self-consistent, make it correct, try to execute it beforehand and so on.
So I think there's actually a lot of room for improvement there. And you could imagine LLM-based software developer assistants in the future that actually do all these things as you're prompting them, that actually write you the code but then try to debug it themselves before actually showing it to you. What's your opinion on LLM agent systems, which are all the rage at the moment?
Right, so, you know, agents have been older age for quite a while now. People started talking about agents being the future something like, you know, almost two years ago, like a year and a half ago-ish. And so far, agents have not really taken off. So agents have some... So the fundamental problem here is that LLMs are not quite reliable. You know, if you look at one forward pass of an LLM, that's basically...
And you can think of the LLM as a guessing machine, right? And the guesses it makes are much better than random, obviously. They're very useful guesses. They're directionally accurate, but they have some probability of being wrong.
And when you look at an agentic workflow, you are chaining many of these guesses. And so the likelihood that you will end up not where you would like to be gets dramatically higher as you chain more guesses like this. And so this is the big bottleneck. Agents are just not reliable. They just don't have a sufficient level of autonomy. And so people say that with better models, this will get fixed.
I think it's an empirical question. I'm waiting to see when agentic workflows actually start working. I don't think we're there today.
I've softened my position a tiny bit on this. I agree because of the ambiguity problem, they are misdirected. And when you chain this, they're very misdirected. But there is something to be said for just having more computation at hand. So I interviewed the guys that did the AI scientist paper. And certainly if you take Claude Sonnet 3.5 and you say, generate me an entire paper, it will be banality beyond belief. It would just be a sketch.
And what they did was they decomposed it into lots of agent workers that just, you know, on the Google Maps analogies, just kind of zooming in and zooming in and zooming in many, many, many times over. And it produced dramatically better results. Yeah, this makes sense. I think this basically tracks the idea we were talking about earlier of
system2 being something like iterated system1 with strong guardrails. And the guardrails are very important. So in this case, the superstructure is provided by the human programmer. The human programmer is kind of breaking down the problem into the right subproblems and sort of like orchestrating the whole thing in the right way. And then each subproblem can actually be solved by guessing and producing good enough guess.
Amazing. When's ARC v2 coming up? Early next year, probably. So we are currently finalizing human testing. So as I mentioned, every puzzle is going to be solved by a bunch of humans that we know is solvable and we have some data to be able to tell how difficult it is for the average human. And the goal is going to be to have three sets that are difficulty calibrated. So if you get a score on the public eval,
and you're not overfit to that data set, you can be very confident that you're going to get a very similar kind of score on the other two sets. When you tested with humans, you know, you wrote about this in your measure of intelligence that, you know, one school of thought in intelligence is that there's this G factor and another school of thought is that it's very specialized. Did your experiments reveal that, you know, a group of humans did generally perform quite well between all of the tasks or did you see huge specializations?
No, it's absolutely the case that there are people who are just more intelligent and they are just better at solving R tasks. And you do see that in the human testing data. What about on the long tail? Do you see specialization in the types of tasks or is it fairly flat? I think it's pretty flat. I mean, either you're good at it or not. Interesting. Francois Chollet, it's been an honor to have you on the show. Thank you so much. Thanks so much for having me. It's been great.