Hi everyone, I'm François. I'm super excited to share with you some of my ideas about HDI and how we're going to get there. This chart right there is one of the most important facts about the world. The cost of computers has been consistently falling by two orders of magnitude every decade since 1940. There's no sign that it's stopping anytime soon. And in AI, computers and data have long been the primary bottleneck to what we could achieve. And in 2010,
as you all know, with the abundance of GPU-based computes and large datasets, deep learning really started to work. And all of a sudden, we are making fast progress on problems that had long seemed intractable across computer vision and natural language processing. And in particular, self-supervised text modeling started to work.
and the dominant paradigm of AI became scaling up LLM3 training. And this approach was crushing almost all benchmarks. And remarkably, it was getting predictably better benchmark results as we scaled up model size and training data size with the exact same architecture and the exact same training process. That's the scaling laws that Jared told you about a few minutes ago. So it really seemed like we had it all figured out.
And many people extrapolated that more scale was all that was needed to solve everything and get to a GI. Our field became obsessed with the idea that general intelligence would spontaneously emerge by cramming more and more data into bigger and bigger models.
But there was one problem. We were confused about what these benchmarks really meant. There's a big difference between memorized skills, which are static and task-specific, and fluid general intelligence, the ability to understand something you've never seen before on the fly. And back in 2019, before the rise of LLAMs, I released an AI benchmark to highlight this difference.
It's called the abstraction reasoning corpus or ARC1. And from at the time back in 2019 to now with a model like GPT 4.5, for instance, there's been a roughly 50,000 eggs scale up
of basal alarms. And we went from 0% accuracy on that benchmark to roughly 10%, which is not a lot. It's very close to zero if you take into account the fact that anyone of you in this room would score well above 95%.
So to crack general fluid intelligence, it turns out we needed new ideas beyond just scaling up pre-training and doing static inference. This benchmark was not about regurgitating memorized scales. It was really about making sense of a new problem that you've never seen before on the fly. But then last year in 2024, everything changed. The AI research community started pivoting to a new and very different pattern.
Test time adaptation, creating models that could change their own states at test time to adapt to something new.
So this wasn't about acquiring preloaded knowledge anymore. It was really about the ability to learn and adapt at inference time. And suddenly, we started seeing significant progress on Arc. So finally, we had AI that was showing genuine signs of fluid intelligence. So in particular, in December last year, OpenAI previewed its O3 model, and they used a version of it that was fine-tuned specifically on Arc
and that showed human level performance on that benchmark through the versus time.
And today in 2025, we have suddenly moved on from the pre-training scaling pattern and we are now fully in the era of test and adaptation. So test and adaptation is all about the ability of a model to modify its own behavior dynamically based on the specific data it encounters during inference. So that covers techniques like test time training, program synthesis, train of thought synthesis, where the model tries to reprogram itself
for the task at hand. And today, every single AI approach that performs well on Arc is using one of these techniques. So today I want to answer the following questions. First, why did the pre-training scaling paradigm not get us to AGI? If you look back just two years ago, this was the standard dogma. Everybody was saying this. And today almost no one believes this anymore. So what happened?
And next, does this adaptation get us to AGI this time? And if that's the case, maybe AGI is already here. Some people believe so. And finally, besides this adaptation, what else might be next for AI? And to answer these questions, we have to go back to a more fundamental question. What is even intelligence? What do we mean when we say we're trying to build AGI?
If you look back over the past decades, there's been two lines of thought to define intelligence and to define the goals of AI. There's the Minsky style view. AI is about making machines that are capable of performing tasks that would normally be done by humans. And this echoes very closely the current mainstream corporate view that AGI would be a model that could perform most economically valuable tasks, like
Like 80% is often quoted as the number. But then there's the map-guaranteed view that AI is about getting machines to handle problems they have not been prepared for. It's about getting AI to deal with something new.
And my view is more like the McCarthy view. Intelligence is a process and skill is the output of that process. So skill itself is not intelligence and displaying skill at any number of tasks does not show intelligence. This is like the difference between a road network and a road building company. If you have a road network, then you can go from A to B for a specific predefined set of A's and B's.
But if you have a road building company, then you can start connecting new A's, new B's on the fly as your needs evolve.
So intelligence is the ability to deal with new situations. It's the ability to blaze fresh trails and build new roads. So attributing intelligence to us really a crystallized behavior program, a skill program, that's a category error. You are confusing the process and its output. So don't confuse the road and the process that created the road.
So to formalize this a bit, I see intelligence as the conversion ratio between the information you have, mostly your past experience, but also any developer-imparted priors that the system might have,
and your operational area over the space of potential future situations that you might encounter. And that's going to feature high novelty and uncertainty. So intelligence is the efficiency with which you operationalize past information in order to deal with the future. It's an efficiency ratio.
And that's the reason why using exam-like benchmark server-to-AI models is a bad idea. They're not going to tell you how close we are to AGI because human exams weren't designed to measure intelligence. They were designed to measure task-specific skill and knowledge
They were designed according to assumptions that are sensible for humans, but not for machines. Like, for instance, most exams assume that you haven't read and memorized all the exam questions and the answers beforehand. So if you want to rigorously define and measure intelligence, here are some key concepts that you have to take into account.
The first is the distinction between static skills and fluid intelligence. So between having access to a collection of static programs to solve known problems versus being able to synthesize brand new programs on the fly to face a problem you've never seen before. And of course, it's not a binary, it's not one or the other, there's a spectrum between the two. The second concept is operational area.
for a given skill. There's a big difference between being skilled only in situations that are very close to what you've seen before and being skilled for any situation within a very broad scope. For instance, if you know how to drive, you should be able to drive in any city, not just in a specific geofenced area. I can learn to drive in San Jose and then move to Sacramento and you can still drive, right? Again, so there's a spectrum there. It's not binary.
And lastly, you should look at information efficiency. For a given skill, how much information, how much data, how much practice did you need to acquire that skill? And of course, higher information efficiency means higher intelligence. And the reason these definitions matter a lot is that as engineers, we can only build what we measure.
So the way we define and measure intelligence is not a technical detail. It really reflects our understanding of the problem of cognition. It scopes out the questions we are going to be asking, and so it determines the answers that we are going to be getting. It's the feedback signal that drives us towards our goals.
And a phenomenon you see constantly in engineering is the short control. So it's the fact that when you focus on achieving a single measure of success, you may succeed, but you will do that at the expense of everything else that was not captured by your measure. So you hit the targets, but you miss the points. And you see this all the time on Kaggle, for instance.
We saw it with the Netflix prize, where the winning system was extremely accurate, but it was way too complex to ever be used in production. So it ended up never being used. It was effectively pointless.
We also saw it in AI with chess playing for AI. The reason the AI community set out to create programs that could play chess back in the '70s was because people expected this would teach us about human intelligence. And then a couple of decades later, we achieved the goal when Deep Blue beat Kasparov, the world champion. And in the process, we had really learned nothing about intelligence.
So you hit the targets, but you miss the points. And for decades, AI has chased task-specific skill because that was our definition of intelligence. But this definition only leads to automation, which is exactly the kind of system that we have today.
But we actually want AI that's capable of autonomous invention. We don't want to stop at automating known tasks. We want AI that could tackle humanity's most difficult challenges and accelerate scientific progress. That's what AGI is meant to be.
And to achieve that, we need a new target. We need to start targeting fluid intelligence itself, the ability to adapt and invent. So one definition of HDI only allows automation. So it increases economic productivity. Obviously, it's extremely valuable. Maybe it also increases unemployment.
But the other definition unlocks invention and the acceleration of the timeline of science. And it's by measuring what you really care about that we'll be able to make progress. So we need a better target, we need a better feedback signal.
What does that look like? My first attempt at creating a way to measure intelligence in AI systems was the ArcGIS benchmark. So I released ArcOne back in 2019. It's like an IQ test for machines and also humans. So ArcOne contains 1,000 tasks like this one here, and each task is unique.
So that means that you cannot cram for Arc. You have to figure out each task on the fly by using your general intelligence rather than your memorized knowledge. And of course, solving any problem always requires some knowledge. And in the case of most benchmarks, the knowledge priors that you need are typically left implicit. In the case of Arc, we made them explicit. So all Arc tasks are built entirely on top of core knowledge priors,
which are things like objectness, elementary physics, basic geometry, topology, counting. So concepts that any four-year-old child has already mastered. And solving Arc requires very little knowledge, and it's knowledge that is very much not specialized. So you don't need to prepare for Arc in order to solve it.
And what makes Arc unique is that you can also read purely by memorizing patterns. It really requires you to demonstrate through the intelligence. And meanwhile, pretty much every other benchmark out there is targeting fixed, known tasks. So they can't actually be solved or hacked via memorization alone.
That's what makes Arc fairly easy for humans, but very challenging for AI. And when you see a problem like this, where a human child can perform really well, but the most advanced, the most sophisticated AI models have their struggle, that's like a big red flashing light telling you that we're missing something, that new ideas are needed. One thing I want you to keep in mind is that
Arc is not going to tell you whether a system is already a GI or not. That's not its purpose. Arc is really a tool to direct the attention of the research community towards what we see as the most important unsolved bottlenecks on the way to a GI. So Arc is not the destination, and solving Arc is not the goal. Arc is really just an arrow pointing in the right direction. And Arc has completely resisted the pre-training scaling paradigm.
Even after a 50,000x scale-up of pre-trained basal alarms, their performance on arcs stayed near zero. So we can decisively conclude that fluid intelligence does not emerge from scaling up pre-training. You absolutely need test and adaptation in order to demonstrate genuine fluid intelligence.
And importantly, when the arrival of test and adaptation happened last year, Arc was really the only benchmark at the time that provided a clear signal about the profound shift that was happening. Other benchmarks were saturated, so they could not distinguish between a true IQ increase and just brute force scaling. So now you see this graph and you're probably asking, well, clearly at this point, Arc 1 is also saturating. So does that mean we have human-level AI now?
Well, not yet. What you see on this graph is that arc one was a binary test. It was a minimal reproduction of fluid intelligence. So it only really gives you two possible modes. Either you have no fluid intelligence, in which case you will score near zero,
like basal alarms, or you have non-zero fluid intelligence, in which case you will instantly score very high, like the O3 model from OpenAI, for instance. And of course, every one of you in this room would score within noise distance of 400%. So ARC-1 saturates way below human level fluid intelligence.
And so now we are in need of a better tool, a more sensitive tool that would provide more useful bandwidths and better comparison with human intelligence. And that tool is ArcGIS 2, which released in March this year.
So back in 2019, ARC1 was meant to challenge the deep learning pattern where models are big parametric curves used for static inference. And today ARC2 challenges reasoning systems. It challenges the test adaptation pattern. The benchmark format is still the same. There's a much greater focus on probing compositional generalization. So the tasks are still very feasible for humans, but they're much more sophisticated.
And as a result, Arc 2 is not easily brute-forceable. In practice, what this means is that in Arc 1, for many tasks, you could just look at it and instantly see the solution without having to think too much about it. With Arc 2, all tasks require some level of deliberate thinking.
but they still remain very feasible for humans. And we know this because we tested 400 people firsthand in person in San Diego over several days. And we are not talking about people who have physics PhDs here. We recruited random folks, Uber drivers, UCDS students, people who are unemployed, so basically anyone trying to make some money on the side.
and all tasks in Arc 2 were solved by at least two other people that saw it. Each task was seen on average by about seven people. What that tells you is that a group of 10 random people with majority voting would score 100 percent on Arc 2. We know these tasks are completely doable by regular folks with no prior training. How well do AI models do? Well, if you take Bazel alarms,
model slack, GPT 4.5, LAMA 4. It's simple, they get 0%. There is simply no way to do these tasks simply via memorization. Next, if you look at static reasoning systems, so systems that use a single chain of thoughts that they generate for the task, they don't do much better. They do on the order of 1 to 2%. So very much within noise distance of
So what that tells you is that to solve arc two, you really need test and adaptation. All systems that do meaningfully above zero are using TTI.
But even then, they're still far below human level. So compared to Arc 1, Arc 2 enables much more granular evaluation of DTS systems, systems like O3, for instance. And that's where you see that O3 and other systems like this are still not yet quite human level. And in my view, as long as it's easy to come up with tasks that any one of you can do that are easy for humans, but that AI cannot figure out, no matter how much compute you throw at it,
we don't have AGI yet. And you will know that you are close to having AGI when it becomes increasingly difficult to come up with such tasks. We are clearly not there yet. And to be clear, I don't think Arc 2 is the final test. We're not going to stop at Arc 2. We've started development on Arc AGI 3.
And Arc 3 is a significant departure from the input/output pair formats of Arc 1 and 2. We are assessing agency, the ability to explore, to learn interactively, to set goals, achieve goals autonomously. So your AI is dropped into a brand new environment where it doesn't know what the controls do. It doesn't know what the goal is. It doesn't know what the gameplay mechanics are. It has to figure out
everything on the fly, starting with what is it even supposed to do in the game. And every single game is entirely unique. They're all built on top of Core Knowledge Piers only, just like in Arc 1 and 2.
So we'll have hundreds of interactive reasoning tasks like this one. And efficiency is central to the design of ARC3. So models won't just be graded on whether they can solve a task, but on how efficiently they solve it. And we are establishing a strict limit of the number of actions that a model can take. And we are targeting the same level of action efficiency as we observe in humans. So we're going to launch this in early 2026
And next month in July, we're going to release a developer preview so you can start playing with it. What's it going to take to solve Arc 2? And we are still very far from it today. Then solve Arc 3. And we're even further away from that. Maybe in the future, solve Arc 4, eventually get to AGI. What are we still missing? So I've said that intelligence is the efficiency with which you operationalize the past to
to face a constantly changing future. But of course, if the future you face had really nothing in common with the past, no common ground with anything you've seen before, you could not make sense of it, no matter how intelligent you were. But here's the thing: nothing is ever truly novel. The universe around you is made of many different things that are all similar to each other, like one tree is similar to another tree, is also similar to a neuron, or
Electromagnetism is similar to hydrodynamics, is also similar to gravity. So we are surrounded by isomorphisms. I call this the kaleidoscope hypothesis.
Our experience of the world seems to feature never-ending novelty and complexity, but the number of unique atoms of meaning that you need to describe it is actually very small. And everything around you is a recombination of these atoms. And intelligence is the ability to mine your experience to identify these atoms of meaning that can be reused across many different situations, across many different tasks.
and this involves identifying invariants, structure, things that seem to be repeated principles.
And these building blocks, these atoms, are called abstractions. And whenever you encounter a new situation, you're going to make sense of it by recombining on the fly abstractions from your collection to create a brand new model that's adapted to the situation. So implementing intelligence is going to have two key parts. First, there's abstraction acquisition.
You want to be able to efficiently extract reusable abstractions from your past experience, from a feed of data, for instance. Then there's on-the-fly recombination. You want to be able to efficiently select and recombine these building blocks into models that are fit for the current situation. The emphasis on efficiency here is crucial. How intelligent you are,
is not just determined by whether you can do something, it's determined by how efficiently you can acquire good abstractions from a real-time experience, how efficiently you can recombine them to navigate novelty. So if you need hundreds of thousands of hours to acquire a simple skill,
you're not very intelligent. Or if you need to enumerate every single move on the chessboard to find the best move, you're not very intelligent. So intelligence is not just demonstrating high skill, it's really the efficiency with which you acquire and deploy these skills. It's both data efficiency and compute efficiency
And at this point, you start to see why simply making our AI models bigger and train them on more data didn't automatically lead to a GI. We are missing a couple of things. First, these models lacked the ability to do on-the-fly recombination. So at training time, they were learning a lot. They were acquiring many useful abstractions. But then at test time, they were completely static. You could only use them to fetch and apply a prerecorded template.
That is a critical problem that this adaptation is addressing. TTA adds on-the-fly recombination capabilities to our AI. That's actually a huge step forward that gets much closer to AGI.
That's not the only problem. Recombination is not the only thing missing. The other problem is that these models are still incredibly inefficient. If you take gradient descent, for instance, gradient descent requires vast amounts of data to distill simple abstractions. Many orders of magnitude more data than what humans need, roughly three to four orders of magnitude more.
And if you look at recombination efficiency, even the latest set of the RCT techniques, they still need thousands of dollars of compute to solve our arc one at human level. And that doesn't even scale to arc two. And the fundamental issue here is that deep learning models are missing compositional generalization. And that's the thing that arc two is trying to measure.
And the reason why is that there's more than one kind of abstraction.
And this is really important. I said that intelligence is about mining abstractions from data and then recombining them. There's really two kinds of abstraction. There's type one and type two. They're pretty similar to each other. They mirror each other. So both are about comparing things, comparing instances, and merging individual instances into common templates by eliminating certain details about the instances. So basically, you take a bunch of things, you compare them,
you drop the details that don't matter. And what you're left with is an abstraction. And the key difference between the two is that one operates over a continuous domain and the other operates over a discrete domain. So type one or value-centric abstraction is about comparing things via a continuous distance function
And that's the kind of abstraction that's behind perception, pattern cognition, intuition, and also, of course, modern machine learning. And type two, or program-centric abstraction, is about comparing discrete programs, which is to say graphs. And instead of trying to compute distances between them, you're going to be looking for exact structure matching. You're going to be looking for exact isomorphisms, subgraph isomorphisms. And this is what's underlying
much of human reasoning. It's also what software engineers do when they're refactoring some code. So if you hear a software engineer talk about abstraction, they mean this kind of abstraction. So two kinds of abstraction, both driven by analogy making, either value analogy or program analogy.
and all cognition arises from a combination of these two forms of abstraction. You can remember then the left brain versus right brain metaphor, one half for perception, intuition, and the other half for reasoning, planning,
rigor. And transformers are great at type 1 abstraction. They can do everything that type 1 is effective for: perception, intuition, pattern cognition. They all work well. So in that sense, transformers are a major breakthrough in AI, but they're still not a good fit for type 2. And this is why you will struggle to train one of these models to do very simple type 2 things like sorting a list or adding digits provided as a sequence of tokens.
So how are we going to get to type 2? You have to leverage discrete program search as opposed to purely manipulating continuous interpolates even in spaces learned with gradient descent. Search is what unlocks invention beyond just automation.
All known AI systems today that are capable of some kind of invention, some kind of creativity, they rely on discrete search. Even back in the 90s, we were already using genetic search to come up with new antenna designs. Or you can take AlphaGo with Move37. That was discrete search. Or more recently, the Alpha Evolved system from DeepMind. All discrete search systems.
So deep learning doesn't invent, but search does. So what's discrete program search? It's basically combinatorial search over graphs of operators taken from some language, some DSM.
And to better understand it, you can try to draw an analogy between program synthesis and the machine learning techniques you already know about. In machine learning, your model is a differentiable parametric function, so it's a curve. In program synthesis, it's going to be a discrete graph, a graph of ops, symbolic ops from some language. In ML, your learning engine, the way you create models is gradient descent.
which is very compute efficient, by the way. Gradient descent will let you find a model that fits the data very quickly, very efficiently. In program synthesis, the learning engine is search, discriminatory search, which is extremely compute inefficient, obviously. In machine learning, the key obstacle that you run into is data density. In order to fit a model, you need a dense sampling of the data manifolds. You need a lot of data.
And program synthesis is the exact reverse. Program synthesis is extremely data efficient. You can fit a program using only two or three examples. But in order to find that program, you have to sift through a vast space of potential programs. And the size of that space grows combinatorially with problem complexities. You run into this combinatorial explosion wall.
I said earlier that intelligence is a combination of two forms of abstraction, type one and type two. And I really don't think that you're going to go very far if you go all in on just one of them, like all in on type one or all in on type two. I think that if you want to really unlock their potential, you have to combine them together. And that's what human intelligence is really good at. That's really what makes us special.
we combine perception and intuition together with explicit step-by-step reasoning. We combine both forms of attraction in all our thoughts, all our actions everywhere. For instance, when you're playing chess, using type two, when you calculate, when you unfold some potential moves step-by-step in your
in your mind. But you're not going to do this for every possible move, of course, because there are too many of them. You're only going to be doing it for a couple of different options. Like here, you're going to look at the knight, the queen. And the way you narrow down these options is via intuition, is via pattern cognition on the board. And you build that up very much through experience. You've mined your past experience and consciously to extract these patterns.
and that's very much type one. So using type one intuition to make type two calculation tractable. So how is the merger between type one and type two going to work? Well, the key system two technique is discrete search over a space of program. And the blocker that you run into is commutator explosion. And meanwhile, the key system one technique
is curve filling and interpolation on the curve. So you take a lot of data, you embed it on some kind of interpolating manifold that enables fast but approximate judgment calls about the target space. And the big idea is going to be to leverage this fast but approximate judgment calls to fight committal explosion and make program search tractable.
A simple analogy to understand this would be drawing a map. So you take a space of discrete objects with discrete relationships that would normally require combinatorial search, like pathfinding on a subway system, for instance, and you embed these objects into a latent space where you can use a continuous distance function to make fast but approximate guesses about discrete relationships. And this enables you to keep combinatorial explosion in check while doing search.
This is what the full picture looks like.
This is the system that we are currently working on. AI is going to move towards systems that are more like programmers that approach a new task by writing software for it. When faced with a new task, your programmer like meta-learner will synthesize on the fly a program or model that is adapted to the task. This program will blend deep learning sub-modules for type 1 sub-problems, like perception for instance, and
and algorithmic modules for type two subproblems. And these models are going to be assembled by a discrete program search system that is guided by deep learning based intuition about the structure of program space.
And this search process isn't done from scratch. It's going to leverage a global library of reusable building blocks of abstractions. And that library is constantly evolving as it's learning from incoming tasks. So when a new problem appears, the system is going to search through this library for relevant building blocks.
And whenever in the course of solving a new problem, you're synthesizing a new building block, you're gonna be uploading it back to the library, much like as a software engineer, if you develop a useful library for your own work, you're gonna put it on GitHub so that other people can reuse it.
And the ultimate goal here is to have an AI that can face a completely new situation and is going to use its rich abstraction library to quickly assemble a working model, much like a human software engineer can quickly create a piece of software to solve a new problem by leveraging existing tools, existing libraries. And this AI is going to keep improving itself over time, both by expanding its library of abstractions
and also by refining its intuition about the structure of program space. This system is what you are building at ENDIA, Arneu Research Lab. We started ENDIA because we believe that in order to dramatically accelerate scientific progress, we need AI that's capable of independent invention and discovery. We need AI that could expand the frontiers of knowledge, not just operate within them.
and we really believe that a new form of VR is going to be key to this acceleration. Deep learning is great at automation, it's incredibly powerful for automation, but scientific discovery requires something more. Our approach at Endia is to leverage deep learning guided program search to build this programmer-like meta-learner
And to test our progress, our first milestone is going to be to solve RKGI using a system that starts out knowing nothing at all about RKGI. And we ultimately want to leverage our system for science to empower human researchers and help accelerate the timeline of science