We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

ARC Prize v2 Launch! (Francois Chollet and Mike Knoop)

2025/3/24

Machine Learning Street Talk (MLST)

ARC v2, the latest version of the benchmark, is designed to challenge frontier AI reasoning systems. Unlike its predecessor, ARC v2 is calibrated with human difficulty, ensuring that tasks are solvable by humans but remain extremely difficult for current AI models. The ArcPrize 2025 contest uses this benchmark.

ARC v2 launch and benchmark architecture
Human-AI capability analysis
OpenAI's initial performance results
ArcPrize 2025 contest details

Shownotes Transcript

This is not just more of the same that we've seen in the past. We have now an existence proof that computers are able to do something that they've never been able to do before in the history of humanity. Ark version 2 has just been released and even the Frontier Foundation models are failing spectacularly.

Today we are using ArcGi2, the next version of the benchmark. ArcGi and now ArcGi2 is pretty much the only unsaturated benchmark that is feasible by regular people and so it's a very good yardstick to measure how much fluid intelligence

these models have, how close we are to true HR. And alongside that, we're really excited to be welcoming everyone to ArcPrize 2025. The contest kicks off officially now. It's going to run all the way through the end of 2025. The structure of the contest is very similar to last year. We're going to have the Kaggle leaderboard running. We're going to be testing this year on the semi-private dataset and wrapping up and testing the final leaderboard on the private dataset. We've still got the big prize. It's unclaimed.

In order to get the big prize, you have to open source your solution, have high degree efficiency running on Kaggle. And yeah, we're really, really excited to see all the new ideas. I think there was a lot that came out last year in 2024 that really pushed the frontier. The next version of the benchmark

It's more challenging, it's extremely unsaturated. All frontier models are scoring effectively within single legit percentage. It's the first time where we've calibrated the human facing difficulty of the tasks. So we actually hired roughly 400 people. We tested every single task and every single task has been solved by at least two people. So we know it's very feasible for humans. It's extremely out of reach for any system today. This is the frontier.

The Arc benchmark forces us to confront an uncomfortable truth about our pursuit of artificial general intelligence that the field has been overlooking. Intelligence is not just about capabilities, it's also about the efficiency with which you acquire and deploy these capabilities. Intelligence is about finding that program in very few hops, using actually very little compute. Like, look at the amount of energy that a human expends to sort of one Arc task over, you know, two, three, four minutes.

it's almost zero, right? And compare that to a model like O3 on high compute settings, for instance, which is going to use over 3,000 bugs of compute. So it's never just an economics problem. Efficiency is actually the question we're asking. Efficiency is a problem statement. It's not capability. The goalpost is AGI.

That's like what we're here to do. That was the whole point of launching ArcPrize in the first place was to raise awareness that there was this really important benchmark that I thought shared, showed something important that like the sort of research community and I was missing about the sort of nature of artificial intelligence. You know, this is like one of the things I think makes Arc

special and very unique and important, I would argue. You know, there's a lot of benchmarks in the world today. And to my understanding and knowledge, pretty much every other benchmark, all the frontier benchmarks basically are trying to test for these like superhuman capabilities, right? These like PhD plus plus

type skills that you need to have in order to succeed at the benchmark. - It's not just compute, it's not just scale. You have to be scaling the right thing. You have to be scaling the right ideas and maybe you have them. - I personally just keep getting like surprised and impressed by the ArcPrize community.

how much folks are pushing the frontier. And I think it's really exciting too, because it means that individual people and individual teams can actually make a difference. If we're kind of in an innovation constrained world, an idea constrained world, which I think Arc AGI IT shows, that means you out there could actually make a significant contribution to the frontier of AGI. So if you're going to enter the contest, go to arcprize.org and good luck. Good luck. See you on the leaderboard.

This sort of like test time optimization techniques or test time search techniques, that's the current frontier of VGI, right? And there are many ways to approach this. Of course, you can do just test time training or in this case, you know, test time

You can do search, you can do search over symbolic space, you can do search over chain of thought space, over token space, or you can do search in latent space as well. So you have many fun ways to do it, but really the frontier is how do you adapt to novelty at this time by recombining what you know into some novel structure.

MLST is sponsored by Tufa AI Labs. Now they are the deep seek based in Switzerland. They have an amazing team. You've seen many of the folks on the team. They acquired Minds AI, of course. They did a lot of great work on Arc. They're now working on O1 style models and reasoning and thinking and test time computation. The reason you want to work for them is you get loads of autonomy, you get visibility, you can publish your research. And also they are hiring, as well as ML engineers, they're hiring a chief scientist.

They really, really want to find the best possible person for this role and they're prepared to pay top dollar as a joining bonus. So if you're interested in working for them as an M.O. engineer or their chief scientist, get in touch with Benjamin Krusea. Go to twoforlabs.ai and see what happens. Well, Mike, it's amazing to have you on MLST. Welcome. Yeah, thank you so much. We're very excited to be here today. Mike, I hear that you guys have got some very exciting news today. Tell me about it.

Yeah, super excited today. We're back. We're really excited to be launching both Arc AGI 2 alongside an updated Arc Prize 2025 contest. Both are going to be launching today and you can go to arcprize.org to learn more and enter the contest. Okay, and in a nutshell, what is v2 and how is it different from v1? The way I think about it is Arc AGI 1 was a benchmark that was designed to sort of challenge deep learning.

And Arcade GI2, in contrast, is really a benchmark that's designed to challenge these new AI reasoning systems that we're starting to see from pretty much all of the Frontier labs. And one of the really cool things about Arcade GI2 is we're basically seeing models, AI systems that are purely based on pre-training effectively scoring 0%.

And some of the frontier AI reasoning systems, we're in progress of testing them right now and we're sort of expecting single digit performance. So a really big update in terms of over Arc AGI 1 from 2024. So the original version of Arc was very much based at these kind of foundation models that didn't do reasoning. Version 2 is tuned for the reasoning models.

What would you say to the charge that you're moving the goalposts? I mean, how is ARK v2 sort of like meaningfully an evolution of the benchmark? Yeah, I mean, I think the way that I think about it is that the goalpost is AGI.

So that's kind of our goalpost. And the definition that I use for AGI and the one that the foundation adopts is assessing this capability gap between humans and computers. And ArcPrize's foundation is really to drive that gap to zero. I think it would be hard to argue that we don't have AGI if you look around and you can't find any more tasks that are very straightforward, simple, and easy for humans that computers can do as well.

And the fact is that we were able to find still lots of those tasks. In fact, all of the tasks in Arc AGI 2 dataset sort of fit into this category of things that are relatively easy and simple and straightforward for humans and comparatively very, very difficult and hard for AI today. Okay, cool. So I know you guys have done loads of human calibration and we'll talk about that in a minute, but the fundamental philosophy of the Arc challenge is focusing on human gaps

gaps, but at the same time, AI models are becoming superhuman in so many respects. So is the big story the human gaps or is the big story the expansion of capabilities that are superhuman? You know, this is like one of the things I think makes Arc special and very unique and important, I would argue. You know, there's a lot of benchmarks in the world today. And to my understanding and knowledge, pretty much every other benchmark, all the frontier benchmarks,

benchmarks basically are trying to test for these superhuman capabilities, these PhD plus plus type skills that you need to have in order to succeed at the benchmark. Humans can't solve the problems that are in these benchmarks, so you have to be very, very, have a lot of experience, a lot of education, a lot of training in order to be able to even get close to solving the benchmarks as a human.

I think those are important. Those are useful. But I think it's actually more illustrating about something that we're missing from the nature in the sense of artificial intelligence by looking at the gaps, the remaining gaps between what's simple and easy for humans and what's hard for AI. I think that's much more of an inspiring story. I think it's one where it's actually necessary to target this in order to actually get AGI that is capable of innovation.

I think this is one of the main reasons I got into AI and AGI in the first place was being really inspired and excited about trying to build these systems that would be capable of compressing science timelines.

And if all we have is AI that looks like we had at the beginning of 2024, right, based on pre-training, based on a memorization regime, you're never going to get to that because these are systems that are fairly going to reflect back the experience and the knowledge that humanity has sort of gained over the last 10,000 generations, as opposed to being ones that are capable of producing new knowledge, new technology, adding to sort of humanity's like colossus of sort of knowledge and technology.

If we want systems that can actually do that, we need AGI. And this definition that we've sort of used for this foundation of easy for humans and hard for AI, I think is if we can close that gap, we'll actually get technology that's capable of doing that. I wonder whether you think we are just about five discoveries away from AGI

because there's going to be version three of the art challenge. They'll presumably, there'll be a version four. Intelligence is multidimensional. And I can see this both ways, right? Because, you know, many critics of AI, they are almost gaslighting us. They're saying that this amazing technology that you're using, it doesn't work. And I'm like, well, yeah, it does work.

And would it be the case that the criticisms will become more and more kind of philosophical and they'll say, oh, you know, because it's not biological or whatever, it's not the same thing? Or do you think meaningfully we're about five steps away from AGI? I think this is why benchmarks are important. And I had a similar question, actually, you know, when I was starting to get back into AI in 2022 and trying to understand the world, like, are we on track for AGI or not? How far off are we?

And I find that it's really, really hard to get a sense of understanding of the capabilities of all these systems purely by using them.

You can certainly get a sense by just interacting with them. But if you really want to understand what are they capable and not capable of, you really need a benchmark to discern this fact. This is one of the interesting things that I picked up from building AI products at Zapier as well. It's very different building AI with AI than it is classic software. One of the big differences is when you're building classic software, you can build and test with five users and know, okay, hey, this is going to be like,

this product can scale to millions. It's going to work the exact same way. And that's fundamentally just not the case with AI technology. You really have to deploy it to a large scale in order to assess how it works. You need a benchmark alongside that scaling in order to tell you, hey, is this system working or not? What were the main lessons that you learned from version one that you moved into version two?

I think-- so ArcAGI2 has been in the works actually for several years. Francois started working on it, crowdsourcing some tasks for it years and years ago. There was a bunch of sort of inherent flaws we ran into with it that we learned as we sort of started popularizing the benchmark over the last year or so. One of the things we learned was that a lot of the tasks were very susceptible to brute force search. So if that's something that has zero intelligence at all and we wanted to minimize the sort of incident rate of tasks that were sort of susceptible to that,

We hadn't human calibrated it. We anecdotally, we relied on some anecdotes to say that, hey, RKGI-1 is easy for humans. We had a couple sort of STEM, two STEM folks who had taken the whole data set, including the private set, and were able to solve 98%, 99%. But we were relying on anecdote. We didn't have that calibrated across the sort of three different data sets that we had. And then we had all these AI frontier reasoning systems come out over the last three, four months. And we've gotten a chance to study these.

and learn what are the sort of qualities of Arc tasks that remain very, very challenging for these AI reasoning systems, which we can get into if you're curious. And so those are the main sort of insights and learnings that we took from ArcAGI 1 to try and produce an ArcAGI 2 benchmark that I think will be a useful sort of signal for development this year in artificial intelligence.

Can we quickly touch on the OpenAI situation? So in December, they didn't launch, but they gave you access to O3 and it got incredible performance on ArcV1, human level performance, something that we just didn't think really would be possible so quickly. Yeah, it surprised me. It came out of nowhere. I mean, can you just tell me the story behind that?

Yeah, yeah, I'll tell you, this is one of the reasons why I'm always hesitant to make predictions in AI about timelines. I think it's very easy to sort of make predictions along smooth scaling curves. But the nature of innovation is it's a step function, right? And step functions are really, really hard to predict when they're gonna come out.

I think the best thing that I can say, having spent some time with O3 and looking at how it performs at Arc, is that systems like O3 demand serious study. This is not just more of the same that we've seen in the past. We have now an existence proof that computers are able to do something that they've never been able to do before in the history of humanity, which is, I think, really, really exciting.

I think there's still a long way to go to get to AGI. But I do think that these things are important to understand and sort of even discern how they work from a capability standpoint in order to make sure that future AI systems that we're developing and building look more like this and not like the sort of pre-training pure scaling regime that we've had in the past.

So I still remember the, to sort of give you the anecdote, I still remember the two week period, the sprint we had on testing 03. It was right at the end of the contest. We had wrapped up ArcToy's 2024 in I think early November last year. And we had a three week, four week period where we were really, really busy on judging all the final submissions, the papers, getting together the technical report. And we were dropping it on all the sort of results on a Friday.

And I was really hoping and anticipating that I was going to have a nice relaxing holiday period in December. And the day that we dropped the technical report, we had to reach out from one of the folks at OpenAI who said, hey, we'd really love you to test this new thing that we're working on. We think we've got some impressive results on Arc AGI 1.

And so that kicked off a very, very hectic, fast, frantic two week period to try and understand, OK, what is this system like? Does it reproduce the claims that OpenAI had on testing it? And what does this mean for the benchmark? What does this mean for AGI? And I think we were able to show the final result was that O3 on its sort of high efficiency setting, which fit within the sort of

budget constraints that we'd set out for our public leaderboard got about 75% or so. And then they had a high compute version which used, I think, like maybe 200x more compute than the low compute setting, which was able to score even 85%.

And this is really impressive. I think this shows a system like O3 has this sort of binary switch. We've gone from a regime where these AI models have no ability to adapt to novelty to something like O3 where it's existence proof of now an AI system that can adapt to novelty in a small way. Breaking this down a little bit, there were some interesting caveats that you just alluded to.

First of all, they did some kind of fine tuning and people at the time joked that, you know, isn't it scandalous that they were training on the training set? So based... Yeah, this is like a very bad... This is a very poor critique. I think it misses the point of the benchmark. I think the folks who feel this way, it's just because they're so used to thinking about benchmarks and AI from the pre-training scaling regime where like, hey, if I trained on the data, you know, that's cheating them to test on the data, right?

And that's true in the pre-training regime. But Arc is a very, very special different benchmark where it explicitly makes a training set available with the intention to train on it. This is very explicit. This is like what the benchmark expects you to do. We expect AI researchers to use the training set in order to teach their AI systems about the domain of Arc.

And then what's special is we've got a private data set that very few humans have ever seen. The private data set does not look like the training set. It requires you to generalize and abstract the core knowledge concepts that you learned through the training set at test time. Fundamentally, you cannot solve like the Arc AGI one or two private data sets purely by memorizing what's in the training set.

This would be like, maybe a crude analogy would be, if I was going to teach an AI system on grade school math and then test it on calculus. This is very similar to the type of thing that we do with ARC where the training set is much simpler, easy curriculum to learn on.

And then the test is a much more difficult one where you actually have to express true intelligence. You have to have an actual capability of adapting to novelty at test time to solve it. Okay. Okay. All of that is fine, but there's a couple of things, right? Two and a half thousand dollars per task or more. That means they were probably doing samplings. They were doing a ridiculous amount of completions. They were doing a solution space prediction, which is very interesting. But the main thing, Mike, just deep in your bones, deep in your bones, do you think...

that they were training on API data or surely they were training on a whole bunch of data to do that well. And the extension of the question is when they release the vanilla version of it, what performance would it get compared to their tweaked version?

We will test that as soon as it comes out and I would love to report the results on that. They told us all they did was train it on the training set and I believe that's what they did. Okay, very interesting. And just comment on the solution space prediction. I mean, I was amazed that just predicting the output space directly, they could do so well. I mean, doesn't that almost take away from the idea that we need to have discrete code DSL type approaches if you can just predict the solution space so well? Effectively, what O3 is doing is...

It's able to use its pre-trained experience and recombine it on the fly in face of a novel task. It does this through a regime called chain of thought. This is all informed speculation, by the way. We don't have confirmed details. This is just my personal assessment of how these systems work, particularly things like O1 Pro and O3. If you compare them with systems like R1 or O1, these are systems that basically sped out a single chain of thought and then used that chain of thought in order to ground a final answer.

This is distinct from how systems like O1 Pro and O3 work, where they actually have the ability to do multi-sampling and recomposition at test time of that chain of thought. This allows them to build novel COTs that don't show up anywhere in the pre-training, not in the existing experience, and allows these systems to reach more sort of effective or more situation space effectively based on what was in the original pre-training.

fundamentally these systems are a combination of like a deep learning model and a synthesis engine that is put on top and i think the right way to think of them is these are really ai systems not single models anymore yeah and i agree with you it's really funny though how you see the critique in the community because you know gary marcus is now saying oh it can't draw pictures of bicycles and labels and label the parts whereas we see o1 pro and o3 and it really does seem like a dramatic move towards um you know intelligent systems

But Mike, can we just quickly talk about the testing methodology? So the real work that you guys did was you got a whole bunch of human subjects and you had the, I think you had, was it 400 test subjects and they all needed like at least two people needed to solve every single task. And you had to do this experiment design and you had to balance complexity of the tasks and so on.

How did you do all of that? Yeah, so this was one of the biggest things we wanted to fix with ArcAGI 1. We never had a formal human calibration study on how do humans actually do on these things. We relied on anecdote. So we had set up a testing center down in San Diego. We recruited tons of just folks from the local community, all the way from Uber drivers to single moms to UCSD students.

and brought these folks out to go take ARC puzzles. It was really cool. We'll have to share some of the photos. These testing shots where you have dozens and hundreds of people taking ARC tasks on laptops.

Our goal originally with the dataset was to ensure that every single task that we put in ArcAGI 2 is solvable by at least one human. And what we actually found was something even a higher standard, I think, which was that we found that every single task in the new V2 dataset is solvable by at least two humans under two attempts. And these are the same rules that we give to AI systems on the benchmark, both on the contest as well as the public leaderboards. I think this is a pretty good sort of assertion of the sort of

you know, a straightforward comparison we can actually use now between, hey, are these tasks easy and straightforward for humans? Yes. Are they hard for AI? Yes. Like I said before, frontier systems generally are getting close to zero or single digit percentages on these tasks now. Okay, but the idea though is this Moravec paradox, right? Which is that, you know,

Basically, while we can select problems that are easy for humans and hard for AIs, we haven't got AGI yet. But I was looking through some of your challenges and I felt that some of them were very difficult. Like it would have taken me five or six minutes of deep thought to get it. Like, are you finding that it's still easy to, you know, to find these things that are easy for humans and hard for AIs? Or are you kind of scraping the barrel a little bit? So I think easy for humans, hard for AI is a relative statement.

The fact is these Arcv2 tasks were solvable by human on a $5 per task solve rate budget. They were solvable in five minutes or so, and AI cannot solve these at all today. And so, yes, I do think if you look at tasks, you have to think about them. There's like, you know, some thought you need to put in to sort of ascertain the rule. But the sort of data I think speaks for itself that, you know, we've got every single task now in the v2 dataset from the public training set, I'm sorry, the public eval set to the semi-private set to the private data eval set.

Every single one is solvable by at least two humans under two attempts. And yeah, these frontier systems can't solve these things at all. Or if they can, with a really, really expensive budget to your earlier point, thousands of dollars per task,

So you guys have been cooking. You're already working on version 3 of Arc. What can you tell us about that? So the way I kind of think about the multiversions here, Arc AGI 1 again was designed to challenge deep learning as a paradigm. Arc AGI 2 is designed to challenge these AI reasoning systems

I don't expect that ArcAGI 2 is going to be as durable, right? ArcAGI 1 lasted for five years. I don't expect ArcAGI 2 is going to be quite as durable as that. I hope that will continue to be a very useful signal for researchers over the next year or two. But yeah, we've been working on ArcAGI 3 and I think the pithy way to talk about ArcAGI 3 is it's going to challenge AGI systems that don't even exist in the world yet today. Can you tell me about the foundation that you're setting up?

Yeah, so this is one of the big cool things I think from ArcPrize 2024. When we launched it, it was very much an experiment. Our ambitions were not quite what they were. I think when we went into 2024, our main goals were just to raise awareness of the fact that this benchmark exists. And I think what we found was-- or what I personally found-- I just kept getting surprised by the community around Arc.

I remember this really specific moment when 01 Preview came out and there was thousands of people on Twitter demanding that we test this new model on Arc. And that was not my mental model of what this benchmark was or what the community was. And that was so cool. And that moment happened again when we ended the contest. That moment again happened when we launched the results on 03.

And this kind of showed, I think, hey, there's a real demand for what Arc is providing. There's a real demand for benchmarks that look like this, that ascertain these like capability gaps between humans and computers.

And so we set up this foundation in order to basically be the North Star for AGI and continue to produce useful, interesting, durable benchmarks in the sort of spirit of trying to discern like what are the things that are simple, straightforward, easy for humans and still remain impossible or very, very difficult for AI. And we're going to carry that torch all the way until we get to AGI. As you can see now, all of the large AI labs, they're focusing on reasoning. And I'd like to think that ARC was at least a small part of that.

And you folks are very focused on open source as well. Mark Chen said specifically on the OpenAI podcast that they'd been thinking about ArcV1 for years. There you go. Well, yeah, exactly. But just tell me a little bit about that. So there's the industry impact, but you guys are really focused on open source as well. So how do you see those two things? So my sort of overriding philosophy at this point is AGI is the most important technology that humanity is going to develop.

And if it is true that we are in an idea constrained environment, we still need new ideas to get to IGI, which I think our IGI shows is true. If that's true about the world, then I think we should be designing the most innovative sort of ecosystem and environment across the world

That we possibly can. This is one of the reasons why we launched ArcPrize originally internationally to reach solo researchers, to inspire researchers again to go try and work on these new ideas to get past this pre-training regime, try something that we knew that it needed to be something beyond this and even beyond what we have today.

And I think if you look at a really healthy, strong innovation ecosystem, you're going to look at one that is very open and there's a lot of sharing and there's a lot of diversity of approach. And this is in contrast to an ecosystem that would be very closed, very secretive, very dogmatic, very monocultural.

And so those values of openness, those values of sharing are what the ARK Price Foundation stands for in order to sort of increase the chance that we can get to AGI soon. So talking about the version two of the ARK Challenge, can you just give us the elevator pitch of that?

Sure. So Arc 2 is basically a new version of Arc that keeps the same format but tries to address the main flaws that we saw in Arc 1. So for instance, in Arc 1, we knew that there was a little bit of redundancy across tasks. We saw that actually very early on, as early as the 2020 Cahill competition. And also, Arc 1 was way too brute-forceable. So back in 2020, what we did after the Cahill competitions that we tried to look

at all tasks that were sold at least once by one entry in the competition. And we found that

that half of the private data sets could be solved, in fact. Just, yeah, the sort of basic brute force program search methods that were deployed during the first competition. So that means half the data set actually doesn't give you very good signal about . So the other half was actually good enough, required enough generalization that the benchmark overall was still useful. And it still lasted quite a few years after that.

But it told you from the start that there were some pretty significant flaws, which is expected, by the way. When I started creating Arc back in 2018, 2019, I was flying blind. I was trying to capture my own thoughts, my own intuition about what does it mean to journalize, what is abstraction, what is reasoning, and that turned into this benchmark. But I could not anticipate what kind of AI techniques would be used against it.

And so yeah, as it turns out, a lot of it could be brute force. So ARC 2 completely addresses that. You cannot score higher than 1% or 2% at most using brute force techniques in ARC 2. So that's good news. And other than that, we generally try to make it a little bit harder. So what we saw with ARC 1 is that it was very easy to saturate for humans. If you're a STEM grad, for instance, you could very easily get 100%.

or within noise range of 100%, like something like 97, 98. And so that means that you were not getting a lot of useful bandwidths to compare AI capabilities with the capabilities of smart humans. And if you just make it a little bit harder, then you get more range, where if you're not very intelligent, you'll score lower. If you're very intelligent, you'll score higher. And you're not super likely to completely saturate it until you're at the very top.

the very top end of the distribution. So that's what Arc 2 is. Same format, same basic rules. So we're only using core knowledge. You have these input/output pairs of grids that are at most 30 by 30. But the content is very different. You're not going to find tasks where you only have to apply one basic rule that could be anticipated in advance, like some kind of gravity

things falling task or symmetry task. All the tasks are very compositional. So you have multiple rules, you have more objects, the grids are generally bigger, and the rules can be chained together or can be interacting together. And that makes it completely out of reach for brute force methods.

And as it turns out, it also makes it out of reach for the base LMP training paradigm. You're saying that you've made them more compositional and iterative and harder for humans. That's right. Could you give me a little bit more detail on that?

If you think about it, there are different dimensions of things that AI models can do and there are different dimensions of things that humans can do. Have you sort of quite diversely explored that? Or I mean, could you just give me a bit of a breakdown of the task characteristics? So in Arc One, you had many tasks that were very basic, where you just had one rule. Let's say, for instance, you have a few objects.

and you have to flip them, right? So this is an example of a task that's easy to brute force because flipping is something that you can acquire via pre-training as a concept or that you could just hard code in a brute force program search system. So if that's the only rule you have to apply and just apply it once, that's not compositional. That's actually pretty easy to anticipate. That's easy to brute force.

So a compositional task is going to be a task where you have more than one concept and typically they're going to be interacting together. Like an example of a very simple compositional task is let's say you have object flipping but also the objects are falling, right? So you have two rules to apply to each object at once. But that again is a kind of task that

that could still be found via brute force program search if you have as key elements in your DSL gravity and flipping for instance. And so you want to create tasks that are where the rules are chained to a sufficient level of depth that there's no way you could find a chain by just trying every possible chain where it will become too expensive.

Of course, humans can still do it because humans are not just trying every possible combination of things they know on every problem that they see. They just have this

very efficient, very intuitive way of searching for a theory that explains what they've seen. You know, my co-host Keith, he had this idea of doing a recursive version of ARC. But the thought occurred to me is that even though we do this systematic compositional reasoning, we still have some kind of cognitive limit. So if we nested, let's say, four levels of ARC challenges within the same problem, wouldn't you find very quickly that humans just can't solve it?

Right, so if you just like concatenate two arc tasks for instance, you get something that's much less but forcible, that's much harder because there are more rules going on. It's not quite what I would call compositional though, because even though you have two rules at once, they're not interacting with each other, right? You can solve them separately and they concatenate the solutions.

And I think it's not a bad idea at all. It will work as a way to make Arc more difficult with, again, this caveat that you're not actually testing for depth of compositionality. One issue though is that it will only really work once because as soon as the person developing the AI system notices that the task can actually be decomposed into subtasks, then it's game over.

So I think it's actually more interesting to have multiple rules at once, but they're actually being chained together or they're interacting together in some way where for instance, one rule might be writing some information on the grid that needs to be read by the second rule, right? What performance do the frontier models get on ARCv2? So what we saw was a big gap between models that don't do any kind of test time adaptation, like any kind of test time search or test time training and models that do.

And the base LLMs, even models like GPT-4.5, they're basically scoring zero. I think one of them, I think it was R1 maybe, scored slightly above zero, something like 1%. But it's within noise range of zero. So any model that cannot do a test-time adaptation, that is to say, that does not possess fluid intelligence does effectively zero. So in that sense, ARC2 is actually a very strong sign that you have fluid intelligence.

better than arc1. I think arc1 could already tell you that, but less perfectly. So in arc1, if you do not do test-down adaptation, you can still do up to roughly 10%. So on arc2, that's actually 0. So it's a better test.

Now when it comes to model that do test time adaptation, so we tried for instance some of the top entries from the Cahill competition last year in the models that we are doing test time training in particular or some kind of program search. And the best model, the model that actually won the Cahill competition can do I believe 3% on ARC2. And if you take an ensemble of the top entries from the competition you get to 4%.

So that's not very high. We also estimate that-- so O3 would be the current state of the art in terms of an AI model that does exhibit fluid intelligence. And so we haven't been able to test O3 on low-compute settings on all of the tasks that we wanted to test it on. But we've tested it on a subset. And so we can extrapolate which it would score on the entire set. And it sounds like it's going to be about 4%.

So not super high, there's a lot of room to go higher than that. And we haven't been able to test O3 on high compute settings.

So the model that was scoring 88% on arc one. So I can make a guess. I guess based on what we saw from O3 look and other models, I think you might get up to like 15, maybe even 20% if you were really maxing out the compute setting and spending like,

10K per task, for instance. But we'll still be like far below average human performance, which would be more like 60%. So that 4% that O3 gets on ARCv2, do you think of that as fluid intelligence or do you think of that as a potential gap in, I mean, presumably you could have designed ARCv2 if you selected the correct sets of human calibrated challenges, you could have found a set which was still 0103.

Yeah, absolutely. You could have adversarial selected against O3 and then O3 would do zero. It would be very easy to go from 4% to 0%, right? It's just a few tiles that you need to change. So we're not actually trying to do that. Yes, I do believe that 4% does show that you have non-zero fluid intelligence, which is also something that you could get as a signal from ARC1.

And I think the sign that you see fluid intelligence models is the performance gap between the huge pre-trained only models that don't do test time adaptation that score effectively zero, maybe one. And you could say that that 1% is in fact a flat data set, sure. It should in practice be zero. And the models that do test time adaptation and do non-zero, that 3%, 4%, maybe 5%, right?

And that means that there's something like 95% of the data set that will actually give you this useful bandwidth for measuring how much fluid intelligence the model has. And that's something you were not getting with ARC-1. ARC-1 was more binary, where if you don't have fluid intelligence, you're going to do very, very low, like below 10% roughly. If you do, you're going to score significantly higher, and getting above 50% would be very easy.

But because the measure would saturate very quickly, as soon as you start adding non-zero fluid intelligence, you did not get that useful bandwidth that you're getting with ARC2. So I think ARC2 should allow for answering the question, is this model actually as fluidly intelligent as the average human, which is something you could not get at ARC1.

I guess it's just an economics thing at this point. So if you spent, let's say, a billion dollars or half a billion dollars, you could saturate ARK v2. I'm not sure if you would agree with that. But if that isn't the case, I mean, what do you think are the specific things that are missing from O3 that are stopping it from doing better?

So it's never just an economics question because intelligence is not just about capabilities. It's also about the efficiency with which you acquire and deploy these capabilities. And sure, if you spend billions and billions of dollars, maybe you can saturate ARK2. But that would already have been true back in 2020 using extremely crude brute force program search.

If you have a DSL that's actually true and complete, then you know that for every ARC task there exists a program that may not in fact be all that long that will solve the task. And all you need to do to find it is iterate over all possible programs in order of length. And then the first one you find is the one that's going to generalize, right? Because it's the shortest, it's the most parsimonious. So if you spend unlimited resources

You already have EGR in that sense, just in the pure skill sense. You can always just try every possible program until you find one that works. But that's not what intelligence is. Intelligence is about finding that program in very few hops using actually very little compute. Like look at the amount of energy that a human expends to sort of one hard task over two, three, four minutes. It's almost zero, right?

And compare that to a model like O3 on high compute settings, for instance, which is going to use over 3,000 bugs of compute. So it's never just an economics problem. Efficiency is actually the question we're asking. Efficiency is a problem statement. It's not capability.

So intelligence is knowledge acquisition efficiency. O3 did very well on Arc V1 and now that it does so badly on Arc V2, the whole point of your definition of intelligence is that given some basis knowledge, you efficiently recombine, you produce new skill programs. You're saying that in the absence of the base knowledge in V2, there is no intelligence. Therefore, is O3 not actually as intelligent as we thought it was?

I think O3 is one of the first models, perhaps arguably the first model that does show fluid intelligence.

So now what the results on ARC2 are telling you is that it's not human level fluid intelligence, right? But still I would consider O3 as a kind of proto-AGI with two big flaws, two big caveats. One is of course efficiency. Efficiency is part of the prime statements, in fact, the central point. So as long as you're not as efficient in terms of, for instance, data efficiency, compute efficiency, energy efficiency, then

it's only a temporary solution. We'll find a better solution in the future. And also, it's not quite human level. If it were human level, you'd expect it to score something like O3 low should score like over 60% on ARK2. And we don't know what the exact number is going to be, but probably on like 4% to 5%, right? Do you think that general intelligence is a category or a spectrum? So general fluid intelligence, it's

I would say it's both because there's a huge difference between just having memorized a bunch of skill programs that are static and knowledge factor. It's versus being able to adapt novelty to a non-zero extent. So that is a binary distinction. Either you have fluid intelligence or you don't. Right.

And arc one could answer that question for any system. But once you have non-zero fluid intelligence, then the question is how much do you actually have and how it compares to humans. And that's related to the notion of recombination of the skill programs that you have, the knowledge that you have.

and depths of recombination. So if you do no recombination at all, you don't have fluid intentions. If you do some recombination, you do. But then the question is how deeply can you recombine? Like for instance, if you're using a program synthesis analogy, the question is how

how big of a program can you write on the fly to adapt to a new problem, right? And of course as well, how efficiently, how fast and how efficiently you can write it, right?

So it is a binary, but it's also a spectrum. And ARC-1 was trying to ask the binary question: Does this system have any fluid intelligence at all? And ARC-2 is more on the side of trying to measure how much fluid intelligence you actually have compared to humans. How long do you think it will take for V2 to be saturated? And do you think it will survive until V3 comes out? So that's a question where you have to take into account resource efficiency.

So if you're asking how long it would take before we have a system that can score higher than let's say 80% on Arc 2 using less than $10,000 of compute for instance.

I think probably around a couple years. So it's very difficult to make predictions here. I think if you're just looking at current techniques and scaling up current techniques, I think it could take a while. I think that ARC 2 is actually way out of reach of current techniques. But of course, we are not limited to current techniques. In 2025, we're probably going to see new breakthroughs in the same way we saw new breakthroughs last year.

This breakthrough is actually very difficult to predict. I was personally very surprised with the performance that O3 could get on ARC1 last year. That came as a surprise. So maybe we'll have new surprises this year. But I would be extremely surprised if we see an efficient solution that's human level on ARC2 by the end of 2025. I would basically rule that out. By the end of 2026, maybe, right?

which is why we have the ARC 3 coming of course. So on analysis of failure modes, I'm sure you saw the blog post that I read where it went through all of the different failure modes of O3 and of course it was solution space prediction which made it more surprising to me. My take on it was I was really impressed that even when it failed

it was because the solution space got too big or it was just getting minor mistakes, but broadly it got the direction of many of the problems quite well. Similarly, tell me about the failure modes on V2. Right, so we were not able to test O3 as much on V2, but I can tell you about failure modes based on what we saw on V1. And well, there are many, but generally

This is a model where reasoning abilities can decrease exponentially with problem size. If you have more objects in the scene, if you have more rules, more concepts interacting, you see this exponential decrease in capabilities.

It's also, you know, because it's a model that needs to, it works by writing a kind of natural language program that describes what it's seeing, that describes the problem and the sequence of steps to solve it. So in that sense, it's 100% a natural language program. And that means that in order to solve a problem, it has to talk about it.

using words and as a result if you have a task where the rule is very simple to graph for human but in a non-verbal way but it's very difficult to put it into words it has no verbal analogy that's actually much harder to solve for for this direction of thought model so other than that we saw that just one of the big challenges is you know compositionality having multiple rules interact

But there's also, it seems there's a bit of locality bias going on as well, where if you have to combine together bits of information that are spatially co-located together on the grid, that's easier for the model than if you have to do the exact same thing, but the two bits of information you have to synthesize are pretty distant.

So having to combine together bits of information that are separate, having to... So it seems as well that the model has trouble simulating the execution of a rule and then reading the result. Like for instance, you're solving an R task and you grok a certain rule and then you start applying it. Let's say it's like you're continuing

using line continuation or something. And then you have to take another rule and use that rule to read a bit of information that you have written in the process of existing the first rule. That sort of thing is completely out of reach for the chain of thought models. - How multi-dimensional do you think intelligence is? You know, one school of thought, and I think you might subscribe to this, is that the universe is kind of almost made up of platonic,

rules that are disconnected from the world that we live in. And then there's this kaleidoscope idea you talk about and they get combined together and that's what we see. But another school of thought is that there'll always be another dimension of intelligence. We'll always need ARC V4, V5, V6, and there'll always be something missing. Each step of generating that you cross

you're gaining a nonlinear amount of capabilities, right? And so after a few steps, you are so overwhelmingly superhuman across every possible dimension that yeah, you can say without a doubt that you have AGI, in fact, you have superintelligence.

But yeah, intelligence is in a sense multidimensional. And what Arc is trying to capture is really just this fluid intelligence aspect, this ability to recombine core knowledge building blocks.

In my definition of intelligence, intelligence is about efficiently acquiring skills and knowledge and recombining them to, well again, efficiently recombining them to adapt to novel tasks, to novel situations that you cannot prepare for explicitly.

purely the ability to take a bunch of building blocks and recombining them doing kind of program synthesis, that's one aspect of that. That's probably the most central aspect, which is why, you know, this is why we are focusing on with Arc. But it's not the only aspect, because this is assuming that you already have this pile of knowledge

available. So it's not it's overlooking the acquisition of these abstractions. It's also overlooking the acquisition of information about the task. In Arc, you're provided all the information about the task at once. But in the real world, you have to collect that information, you have to take actions, set goals, to discover what your environment is even about, what you can do within it. And you have to do these things efficiently, of course.

course. And that efficiency aspect is very important because intelligence was developed by evolution, it's an evolutionary adaptation. And when you're exploring the world, you are taking on some risk. You might get killed by a predator, for instance. And so you want to be--

you want to gain the maximum amount of information and thereby power over your environment by taking on a minimum amount of risk and expanding a minimum amount of energy. That's not something that you can measure. That's not something you can capture with ARC V1 or V2 alone. Can you just expand on the significance of the solution space prediction with O3? Because that rather suggests to me that

It's almost this rich, sudden idea where it's nearly a blank slate and it's very empiricist and we just take the data in and the neural network does all of the things. I always imagined that we would need to have some kind of structured approach which took the core knowledge into account. Do you think that it's actually simpler than we thought?

trying to directly predict the output versus trying to write down the steps to get the output. They're not entirely separate things because of course once you've written down the steps you can do what looks like transduction. And O3 is not actually a real transaction model because it's

It's much closer to program synthesis model where it's searching for the right chain of thought to describe the task and list the sequence of steps to solve it. And once you have the chain of thought, you can just use the model to execute it and it gives you the output. So from the outside, if you treat the entire system as a black box, it looks like

like transduction, but the same would be true of any program search system. What it's actually doing and the reason why it's able to adapt novelty so well is because it's synthesizing this chain of thought which serves as a recombination artifact for the knowledge and the skills that the model has. A recombination artifact is adapted to the particular task at hand.

So it's much closer to Brownstein's model. This is something that the community found very confusing because in the last interview you were describing, I think it was O1Pro as being a kind of explicit search process. And what seems to be the case is that it's, you know, there is some kind of reinforcement learning thing in the pre-training and then it maybe does some sampling at inference time. So, you know, we're doing a whole bunch of completions.

And are you saying it's as if it's doing a program search or are you saying it's somehow explicitly doing a chain of thought, you know, program search? It's searching over the space of possible chain of thoughts and finding the one that seems most appropriate. So in that case, it's entirely analogous to a program search system where the program you're synthesizing is a natural language program, right?

a program written in English. Okay, it just seems a bit strange doing auto regression on a language model, how that could be characterized as a search process. So a model like O1 Pro, for instance, or O3 is not just auto-regressive. It has actually this test time search step, which is why

it can adapt to novelty much, much better than the base models are purely auto-aggressive. And that's why, again, you see on-- so in general, ARC, even ARC 1, has completely resisted the pre-training, the purely auto-aggressive pre-training scaling paradigm.

Like from 2019 to 2025, we scaled up these models by like 50,000x, like from GPT-2 to GPT-4.5. And even on Arc 1, you went from 0% to something like 10%. On Arc 2, you're going from 0% to 0%.

And meanwhile, if you have any system that's actually capable of doing test time adaptation, like test time search, like O1 Pro or O3, then you're getting much, much better performance. There's this huge performance gap. So in short, you can tell the difference between a model that does not do test time adaptation and the model that does.

by looking at this performance gap, this generalization gap on Arc, also by looking at latency and by looking at cost. So of course, the model that does test time search is going to give you your answer. It's going to take much longer. Like if you look at O1 Pro, for instance, it's taking 10 minutes to answer your queries and it's going to cost you

much more as well because of all this work it's doing. So I could download the DeepSeaCar1 model and I'm running it on my machine and as far as my machine is concerned it's just a normal LLM, it's doing greedy sampling, autoregressive... That's right, which is why it does not adapt to novelty and it scores basically zero on arc or like 1% maybe. Oh, so you're saying there is something different about... O3 is qualitatively different, that's correct.

it is qualitatively different from all the other models that came before. It is actually a model that has fluid intelligence, it has non-zero amount of fluid intelligence, and R1, for instance, is not. Okay, so categorically it's doing some kind of active search processor inference? That's what it looks like. So of course I don't actually know how it works, but that's what I would speculate it looks like, yes.

And you see it in the latency, in the cost, and of course the ARC performance. Would you be shocked and surprised if it came to light that it was just doing like auto-regressive, greedy sampling? Honestly, I think it's very, very unlikely because it's completely incompatible with the characteristics of the system that we know of, that we were exposed to when we tested O3.

- Awesome, and do you think that there will always be human gaps? - Probably not always. Today, there are very clear, very significant gaps, right? Like we're not actually that close.

to AGI right now, but eventually, as we get closer and closer, there will be fewer and fewer gaps. And at some point, we're going to have AI systems that are just overwhelmingly superhuman along every possible axis you choose to look at. So I don't think that there will be gaps forever. All right, Tim, thank you so much for doing this. Thank you, Tim. Looking forward to seeing you in a couple weeks.

ARC Prize v2 Launch! (Francois Chollet and Mike Knoop) 54:15 Share

Machine Learning Street Talk (MLST)

Shownotes Transcript

ARC Prize v2 Launch! (Francois Chollet and Mike Knoop)