We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Greg Kamradt: Benchmarking Intelligence | ARC Prize

2025/6/24

MLOps.community

AI Deep Dive AI Chapters Transcript

People

Greg Kamradt

Topics

Greg Kamradt: 我认为加速通用人工智能（AGI）的进展至关重要，因为我相信AGI将是人类有史以来最伟大的技术之一。为了实现这一目标，我们选择通过基准测试来推动AGI的发展。这个基准测试由Francois Chollet在2019年创建，旨在评估AI在解决人类容易但AI难以解决的问题上的能力。我们关注这类问题是因为人类大脑是目前我们所知的唯一通用智能的实例。我们对AGI的定义是，当我们无法再提出人类可以解决但AI无法解决的问题时，我们就实现了AGI。为了验证这一点，我们推出了ArcAGI 2，并对400人进行了测试，以确保人类能够解决这些任务，而AI仍然不能。因此，我们认为，理解人类大脑的工作方式以及人类智能与AI之间的差距，是通往AGI的快速通道。通过专注于填补这些差距，我们可以更快地实现AGI的目标。我坚信，通过持续的努力和创新，我们最终将能够创造出真正具有通用智能的机器，从而为人类社会带来巨大的利益。

Deep Dive

Chapters

Greg Kamradt discusses the Arc AGI benchmark, which focuses on problems easy for humans but hard for AI. The goal is to identify the gap between human and artificial intelligence, using human performance as a benchmark for AGI.

Arc AGI benchmark focuses on human-easy, AI-hard problems.
Human brain is the only proof point of general intelligence.
Arc AGI 1 and 2 are unsolved by AI, but solvable by humans.
A capable human, not a PhD or toddler, is the benchmark for human performance.

Shownotes Transcript

Translations:

中文

This is too good. You were awesome, man. This is great. That's how we know it's legit now.

I'll tell you what, I was looking forward to this. Dude, I can't tell you how stoked I am. Tell me exactly how it went down that you were on a live stream with Sam Altman. Yeah. So I run ArcPrize right now, right? And we run an AI benchmark called Arc AGI. Which is? We want AGI progress pulled forward. Like we want tech progress pulled forward because we believe it's going to be one of the best technologies that humanity's ever had, right? Yeah.

There's a big question on, well, how the heck do you make progress go faster? And so our route that we've chosen is from a benchmark and it's created by Francois Chollet in 2019.

It takes a very interesting approach. There's a lot of benchmarks out there that go PhD plus plus problems. And so they'll ask you like the hardest questions and then even harder. And they'll say, this is the last test we're ever going to have to take because we can't come up with any harder questions. AI ends up solving those. Like it ends up doing it well. Like the ceiling on AI is like really high. Like it's insane. It's already doing some superhuman stuff. So we take a different approach on that.

We want to know what types of problems are easy for humans but hard for AI. I love that. And the reason why, just getting into it, like the whole shtick behind it is because we have one proof point of general intelligence right now.

And that's the freaking human brain. So these are things like strawberry. I would say that that is a class of problems where if you can find things like that, it's like, dang, AI can't do that, but humans still can. We probably don't have AGI if we can come up with those problems, right? Now, the hard part is those are one-off questions. And so it's easy to find one-off questions. But if you want to find a domain where you can come up with like 200 questions in the same class that you can actually quantify this for.

then that becomes a lot more difficult. And so our theory about AGI, and this is more of a working, this is an observational definition rather than like an inherent one, is when we can no longer come up with problems that humans can do,

but AI can't, then we have AGI. Wow, okay. The inverse of that, though, is if we can come up with problems that humans can do and AI can't do, then we don't have AGI. We're not there yet. And by virtue of Arc AGI 1, our first version of our benchmark being out there, the fact that it's even out there and unsolved

That's a class of problems that humans can do. We just came out with ArcAGI 2 and we actually went and gathered, we gathered up 400 different people and tested them on ArcAGI 2 on every single task within there. And we made sure, because if we're going to claim that humans can do this, humans better be able to do it. So we got 400 different people down in San Diego and we tested them on all this and every task that was in there was solved by at least two people in under two attempts. So humans can do it. We have first party data for that, but AI still can't do it. So

We claim that we don't yet have AGI for them. But they're kind of hard tasks. They get harder for sure. Yeah. Well, that's what's crazy is the way that I think about it is there's a gap in between what humans can do and what AI can't. That gap is narrowing. And so we need to make sure that humans can still do it within a reasonable attempt. We're not looking at PhDs. We're not looking at two-year-olds to see if they can do these. Uh-huh.

A competent person, give them these tasks and see if they can actually do it. So if you pluck somebody off the streets, they have college education type thing? More or less. So when we did our filtering, we made sure that they could use the internet. Things like that. So my mom is out. We didn't want to teach them how to use a computer when we taught them what ARC was. You know what I mean? And so that doesn't allow us to make the claim about the average human.

human. So we're careful about not saying that. That's not what we're going for. We're going for a capable human. Some people like to argue with us on it, but that's a different conversation on it. So we run this benchmark, RKGI1. Okay, great. We get an email from one of OpenAI's board members who we have a relationship with in early December and more or less said, hey, we have a new model.

We want to test it on Ark. And back in that day, it was Strawberry? It was the Strawberry model? There's so many names going around. I mean, there was Orion even at that point. There was Strawberry. There was What Did Ilya See? You know, there was so much stuff going around. It's like, who the heck knows what rumor refers to what production version? And it hasn't gotten better, to be honest. The official names are probably worse than the rumors. And I think that that tells you don't expect it to get better. Because it won't get better. So again, the email says...

We got a new model. We want to test it. Okay, cool. Yeah, sounds great. It's open AI. They have a new model. And they claim to have a very good score, but they didn't say what their score was in the email. On the ARC price? On ARC price. Because we have public data. And so the way we run our benchmark is there's a bunch of public data that you can go train on, and you can go kind of test yourself on it. But then we have a hidden holdout set. Nice. That...

We can get into why that's important in the first place. Yeah, that's the only way to do it. It's the only way to do it. You have a hidden holdout set for it. And they said, we want to see, are we overfit to this? Because we think we're doing pretty good, but we want to try it on your holdout set. Will you come and test it for us? So we spent the next two weeks testing it, basically, working with their team to go do it. This was through NeurIPS 2024 of last year, too. So I'm at NeurIPS in Vancouver thinking I'm going to relax and just watch talks the whole time. I'm literally testing and hitting OpenAIS API endpoints for that. But we get through, and...

It's like, holy shit. This is good? I mean, freaking soda. It was multiples better than we had ever seen another model do beforehand. And keep in mind that this thing had been out there for five years so far and there hadn't been this type of progress. So we're like, holy cow. And so I get on a meeting with Jerry and Nat.

To kind of just like, kind of like pre-brief before we go and do the testing. And I say, what score do you, do you claim on Arc AGI? Because we're going to go and verify that. Because if they claim what it's different, then that'd be a big story. They claimed 87%. And keep in mind, like the high, the highest on here with a publicly available model is like in the 20s. And then custom built solutions, purpose built just to try to beat Arc were scoring in the 40s and 50s at that time.

And they're claiming 87%. And so it's like, all right, this is a really big deal for us. Anyway, long story short, we go through it. And what's interesting is in inference time compute world, you can no longer just say, here's our model, here's our score. It's here's the model. Here's how much inference time compute we spent. Here's our score. So now there's another variable on it. And what we confirmed for them is...

is that on low compute, and we can get into what low compute actually means in a second here, they scored 75%. And on high compute, we saw, yep, more or less we validated their 87% score. Which is like, okay, yep, it's validated. So we write up our blog post, and it's just like a one-pager Google Doc, and...

somehow, or not somehow, but just like along the lines, Sam ended up getting pulled up on the thread, like on the email thread that we had going back and forth. And we said, we have our results. Here they are. We want to discuss them live. He goes, great. I'm free Tuesday at 530 or whatever. And it's like,

Oh, yeah. Let's go. Yeah, let's go. Like, it's wonderful. I mean, it's like, it's a huge opportunity. It's like, we put together a blog post and we get inside the room and we show them, basically put up on the screen. We show a blog post. Everyone reads through it. And discussion, discussion. And he goes, okay, great. You guys should join our live stream on Friday.

And so we're sitting there not even like we hadn't even considered that we're going to be testing this new model. Like, you know, if you were real. And then he says, you know, on Tuesday, you guys should come join us on Friday. And our one requirement was that we didn't want to just go up there and have them tell us what to say. Like we didn't want them to write the script for us. And so Mike and I, Mike Knope, who co-founded ArcPrize, we basically wrote the script that we were happy. We were comfortable with. We gave it to him. They said, yeah, looks great.

All right, cool. That's that. See you Friday. Exactly. Well, so they had a very big production. No, I wouldn't say big production crew, but call it 12 people between marketing, comms, videographers and events and sound and everything that were in there. And so we had two rehearsals that went through for that. Two rehearsals went through, went great, made edits. And then...

You know, the live stream comes out on Friday or went out on Friday. Wow. It was funny because there was a room that wasn't much bigger than this, but it was a much bigger room, but there was partitioned off on all the sides, you know, with like kind of just like, you know, stage or whatever. Yeah. Meant to look all. It was like it was just a table. You guys are sitting around the table. Yeah. They did a really good job with that. But so I'm on the other side of the partition. I'm hearing Sam and Mark talk about it. Mark Chen, at the time, their SVP of research.

And they're like, now we'd like to invite Greg and then just walk from behind the curtain and just go jump on there. And what's wild is that you knew how many people were watching on the live stream, but it was a small room. Like I said, only like 10 people in there. But it was cool. So did that. But this was an unreleased model. So now we call it O3 Preview because there's a preview model for it. And it was more of a capabilities demonstration about if you push this to the max, what could you actually do for it? And it put this into perspective.

um on the low compute it was they were spending about 20 per task on this and we verified with 500 tasks and so that's about 10 000 bucks of compute that they spent on this just for arc it's like dang that's a lot right yeah like that's a you're not gonna find that many people that want to spend 10 grand on solving our tasks right but that was low compute and then the o3 preview was the first reasoning model no no no it wasn't the first reasoning model because i you know it the

Depending on when you want to call, I believe their first reasoning model was when they did O1 preview. I don't even know when that was. That must have been early 2024, mid-2024, maybe even 2023. So we tested O3 preview low compute, and then we also tested high compute.

Which was way more money. That used, I forget the exact dollar amounts. It was in the thousands of dollars per task. Oh, what? Per task. And again, remember, we tested 500 tasks or whatever. I think it was 170 times the amount of compute that was used on the low compute side. But either way, the TLD, it's like, okay, so what? That's a lot of money. What's the important part? The important part is...

It just reconfirmed that you can pay for better performance, which is crazy, right? And there's open questions as to where that scaling actually tops out. And so we haven't done as thorough analysis on that system as we'd like to. There's a big open question on does it asymptote?

Or can you get to a hundred percent if you just give it more and more and more but keep in mind? It's log what you give it So you they you know, let's just say you spent a million dollars in order to get a couple percentage points upgrade You need to spend 10 million and then you spend a hundred million. It's like well, where does it actually end up stopping for us? Yeah, what's that trade-off? Yeah, and also are you gonna have to wait a year before it gets done? It's a long time. So for the high compute the job took

overnight, more or less. Like, overnight, maybe even took longer. I forget the exact time on the duration for it, but it was not a short amount of time. You're not going to be sitting there waiting for the response. Yeah, it's something you do. You let it run, go out, have your life, and then come back and see if it was able to do it. But actually, just today, opening out last week, they launched their 03 production. Great. I love 03. Yeah.

And so there was like a big open question. It's like, okay, well, how does what we tested in December match to what was publicly released today? And we, um, asked opening high, we asked Jerry, um, for confirmation on this and a bunch of nuance TLDR. It's not the same model. Exactly. It's not the same model and there's less compute being used. So we should expect not the same scores. And so we just, we tested on it today and, um, yeah, as expected, it does really, really well. It doesn't do as well as the model that we tested. It's not the 87%. Um,

But also they released 04 mini. So they're just, they're fricking keeping the models coming. Right. And so, um, a lot of good testing, a lot of good stuff. Um, but what's cool is that like RKGI is the tool that we're using to evaluate these things. We have RKGI too. And all these models are still going really, really low on RKGI too.

So basically Arc AGI 2 is meant to be like that next step, like order of difficulty more? So I've been looking for the good analogy with it. So apologies, I don't have it down yet. But the way I think about it, and this may be an incorrect one, so apologies if I butcher it. But like if Arc AGI 1 measures...

It's really good at measuring car speed between 20 miles an hour and 40 miles an hour. Below 20, it's not very good. Over 40, it's not very good because it just maxes out. It's like literally redlining. It's over at the top.

Arc AGI 2 is measuring cars from 40 miles an hour to 80 miles an hour. So below 40 miles an hour, you're not going to get much signal. You're going to get a little bit of stuff. So it's got to be those premium models. It's going to have to be premium models. And we're not yet seeing that models are making substantial progress on it. So I think that the best open source model right now I think is getting like 3% to 4%. I'm sorry, not best open source. Even when we tested O3 medium, it was getting like 3% to 4% on this.

And down at that range, we're only talking about 120 tasks on that. So like down at that range, we're talking about noise. It's not until it starts to get like 10, 15 that you're really going to start to see something substantial from it. Yeah. And how do you go about deciding these questions? Yeah. And then there's also the other side of it where you can't just have it be a super hard task. You have to almost, you have to be creative about it. That's not like, again, because we hold ourselves to the restriction that humans need to be able to do it.

And that restricts you from just doing hard, hard, hard, hard. Yeah, it can't be this PhD plus. It can't be the PhD plus. But as long as we can come up with those problems, that tells you there is a gap between human intelligence and AI. And like people argue with me and say, oh, you don't need to aim for human intelligence if you want to aim for AGI because they're two different things.

I agree. But like our hypothesis is that the fast track towards AGI is understanding how the human brain works and understanding where the gaps are. Cause if we aim for those gaps, that's going to tell us something interesting from there. Um, we can talk about this later, but like it's nowhere. The human brain is nowhere near, um, theoretically optimal intelligence. Like we got a lot of biological baggage. Yeah.

I could tell you that right now, man. My human brain is not working to full capacity ever. So by no means am I saying it's the best example, but it is our only example of general intelligence. And so we see it as a useful model to go after. Anyway, so how do we pick the problems? So in 2019, Francois Chollet came out with this paper called On the Measure of Intelligence, which is so fascinating because it's like, how do you come up with the problems? That's actually not the question to start with. The question to start with is, how do you define intelligence? Uh-huh.

Because if you can define it clearly, then you can come up with problems for it, which is the freaking fascinating part. So Francois came out with this paper on the measure of intelligence. And so his definition of intelligence was what is your ability to learn new things? It's not how good you are at chess. It's not how good you are at go. It's not how good you are at self-driving.

It's if I give you a new task and a new domain and new set of skills that you need to learn in order to do it, can you successfully learn that thing? Is it how fast you learn that? So now that's a great question. So my opening definition of intelligence is always just binary. Can you learn new things or can you not? But his actual definition of intelligence is your efficiency of learning new things. So just for example, I like to do efficiency in terms of two axes. Number one is the amount of energy required to learn new things. Yeah.

And we'll get into that in a second. But the second dimension is the amount of training data that you need to learn that new thing. So basically how many times you need to do it before you learned it? Exactly. So a crude, crude, crude example is if I'm going to teach you how to play Go, we might need like six hours. I'll teach you the rules and you'll become like basic at it. We can at least have a conversation around it. Think about how much training data went into the...

that ended up beating Go a lot, right? And so, of course, that was better skill for it, but there was almost outsized training data that went into it. So another way to do this is do humans have an internet's worth of training data in their head to output the intelligence that you see from us right now? And the answer is no, no, it doesn't. Language models do. And so on the recent podcast, it was the internal one with OpenAI, Sam Altman, and I believe the fellow's name is Daniel, but he was talking about the efficiency of language models

and what is an LLM's efficiency of language versus a human's efficiency of language. And he said, by his estimate, I think this might be a little low, but he said that humans are 100,000 times more efficient with language than with current LLMs, which speaks to, and one of the underlying things that they kept on talking in the pod was that, look, compute isn't what's blocking us anymore. We have a shit ton of compute. Like,

We have a lot of compute, like Stargate, all of them in video. We have so much freaking compute. What's blocking us right now is more on the data side, but underlying all that, what's blocking us more is also on the algorithmic side. It's like we just literally need new...

We just need new algorithms basically breakthroughs in order to get to the human levels of efficiency on it Just the random point to really drive this home is like the other reason why I love using the human brain as a benchmark for it is because you know how much energy the human brain takes like literally calories like how many calories does human brain consume and you convert calories into energy and then you compare that to what is the inference energy used to solve arc and

Like you can already tell you're miles and miles ahead. So human brains is a good benchmark for us. Also, we should note, like, did this all just start you down this path from the needle in the haystack? Was that like what blew up? You know, um,

Needle in a haystack is a fun bullet point on my journey that I've been on so far. I wouldn't call it the thing that did it. It was cool, but it didn't make me rich. It didn't blow me up. It was a small little thing. You got a few retweets. I got a few retweets from it. I got a few likes on Twitter, but it wasn't much from that. No, but the inherent...

thing, like what, whatever it is about what drives me and like, whatever it is about me that makes me put my energy where I do needle in a haste that came out of that spot, like other stuff comes out of that spot and like, you know, everything. And so like, I would say that all like,

All the activities that happened were symptoms of where I choose to put my energy and consequences of it. And those consequences line themselves up to put myself on the path of where I am. And then it opens doors. It's like, hey, this happened. Because for those listeners also, they should know that you were doing

Amazing tutorials. I was doing YouTube work. Yeah, that's how I found you. Back in the day when you were doing the YouTube tutorials, you were like the first guy making lane change tutorials. So that's another wild story. It's super brief on that. I remember the first lane change. Well, I was scrolling Hacker News, just trolling or whatever. And I saw this was October 22. Right when ChatGPT came out. Right when ChatGPT, maybe even a hair before. Yeah.

And it said, show Hacker News, Langchain. So it was like literally like the launch blog post of Langchain. And I'm looking at this and I'm like, holy shit. This is solving a lot of the problems that I had building with the raw API at the time. Because keep in mind, at the time, there wasn't a chat model. It was just DaVinci 03. And so like trying to work with that thing was obnoxious.

Like you had to go, there was a lot of friction to get the value out of it. Anyway, Langchen helped out with that a little bit more. And I was like, this is so cool. And I had had a previous history of doing pandas tutorials on YouTube that went nowhere. They freaking sucked. Like it was talking about like me and like my mom's basement, like in my underwear, like making pandas tutorials. At least you're on it. It wasn't exactly that, but it was along those lines. So-

So I'd made, I think like some, I don't want to make it call it like 80 pandas tutorials or something like that. Cause that was my craft. Like data analysis was my craft at that time. That's what I pride myself on. Um, and I saw, I went to YouTube and I typed in lane chain and nothing, there was one tutorial by a guy who I ended up knowing a little bit later on. I'm super awesome. His name is James Briggs. And there was one lane chain tutorial. And I kind of had just like one of those small little, like little light bulb moments. And I was like, dude, Greg, you should do what you did for pandas.

But you should do it for Langchain. And all I did for Pandas was just see what I was curious in, go and make a bunch of tutorials and functions.

And so at the time, just based off just riding my pandas kind of like success or ripples, and it wasn't success. I just mean like whatever was coming from it. I was getting like three or four new YouTube subscribers a day. And I did my first Langston tutorial, and I got 16 new subscribers after that one. I was like, that's 4X. Success. That's 4X of where I was. Anyway, I did number two, and that next day I got 25. And then I did number three, and that next day I got 50. And keep in mind, like that's 10X what I was doing beforehand. I pulled my wife in the room. I'm like, holy shit, Eliza.

This is like there's something here like like i've i've re i've retold the story a few times but like there's a few times in life when you notice that The roi and your energy that you get that you get sometimes often in life It's like you put out one unit of energy You're getting like 20 back like it's really not much like you might get some money But you're not getting like fulfillment, you know, blah blah blah Yeah at that moment in life I was putting on one unit of energy and I was getting like two or three times back because you were getting energy I was getting energy

Like I couldn't sleep. Like I was just like, I got to wake up. What am I doing today? Like what tutorial am I making today? You're just upgraded my setup. Like I was so fricking jazzed on it. Like I know I was like met Harrison, did all this other stuff. And just through that, just natural questions came around. Like how do you do better retrieval? All these business questions that I had beforehand, how do you do better on that? Um, and one of them was for needle in the haystack, which is everybody was talking about long context. Oh,

Oh, it's longer, longer, longer, longer. And I'd seen some tweets that were like, yeah, but it's actually not that good at long context. And I was like, you guys are idiots. Let's just go and test this thing. Yeah, there's a process we can follow. I was like, remember, I'm a data dude, so that's my craft. And all I saw in my head was a heat map.

And I was like, the length? And then there was that whole question around if the position of where your needle was had a factor into it. I was like, might as well throw a two by two because it's going to look pretty if nothing else. And so I ended up doing that and that's where Needle in the Haystack came around. That's so wild, man. So now we were talking before about

the reasoning models, and just this test time compute. And you have thoughts. Here's the undisputable fact. You spend more money at inference time, you get better performance. The open questions are, and this is where people argue with me, but I still believe it's open, is does it, like for top frontier models, does asymptote sub 100% or can you get to 100%?

I think there's too much money that you need to go figure out to go try to answer that question. That's a big one. So it's so high risk that why even try? Well, not high risk. The cost is guaranteed. You're going to spend a shit ton of money. What you get returned, TBD, on where it is. It's not worth it right now. But here's the other thing. It's like I harp a lot on AGI and a lot of that stuff. We have really, really useful, economically useful models right now without having AGI.

That's cool. Like, that's great. I love it. That's value to the world. I'm a capitalist at heart. Like, I want good tools to be used for the good of humanity. LLMs, O3, O4 mini, all that stuff are great tools that are going to bring us really, really good progress. The AGI conversation is a separate conversation. And that's more of a theoretical, philosophical, scientific one around, well, what is AGI? How do you actually define it? And how are we going for that? Yeah, what is intelligence? What is intelligence? And what's wild, what blows my mind is, you're freaking getting me going.

man. I'm already ranting. What's wild, man, is that we don't have a formal definition of intelligence that the community relies on. For something as hot topic as AGI and what we have right now, it's making me wonder if it can be formally defined, if it hasn't already beforehand. There's a few stories I can tell in my head. One is it can't be.

But that also takes a very humans are really smart approach. And we've seen many times over and over again that like humans are not as smart as we think we are. So the alternative is that maybe we just don't have a sensitive enough understanding about like the actual tools about what we need for it. But then the other story is the other story that you play in your head is like, yeah, it can be. We just don't. We just don't know. Potentially. And we're never going to know.

Uh, potentially. Um, yeah. And then there's, there's a whole different subclass of intelligence, which is human relevant intelligence. So like there's a certain class of intelligence that you need to survive on earth right here. That's what humans have. That's what we have been built up there. But if you, if you really expand out and this is where we get into like more philosophical, like in the, in the grand scheme of things, the earth is, is a pretty small piece, right? So if you're talking about universal intelligence and talking about like theoretical intelligence, um,

Let's not go here, but I'll just I'll just light the match. Oh, they care But if you if you jump into people are gonna think I'm going over the edge of this But if you jump in a like simulation theory What's the intelligence that governs that type of thing that would make our own world that come from there? I guarantee it's not human relevant intelligence and there's a there's a theoretical optimum That's that we're not even gonna touch But that's the other thing too. You got to walk before you run. We're gonna start with human intelligence first anyway

Reasoning models. I mean, they're great. I mean, you can scale them, throw more money at them, get better performance. They take longer, thinking for longer. There's big open questions on how the reasoning models actually work. And so one simple way to do it is the very first reasoning model that people ever came up with was they told the model, please think out loud first, and then give me your answer. That ended up doing better performance. Crazy, right? And then what you go do is you go train on processes like that for much, much longer.

And that's another way to scale these things up. You say, think for longer, think for longer, wait, reflect a reflection step, you know, and you say, you say, keep on going. Another method to scale these things up is you say, all right, I'm going to tell 10 of you, I want you to, I want 10 of you to think out loud. And then I'm going to see what all 10 of you respond. And I'm going to pick the Ben's answer that comes from there. There's even further ways to do it. Um, which is like, I want you to think of

the first step in your process. Okay, now what are 10 potential steps that would come after that first step? All right, I'm going to pick the best one of those 10. Okay, now I'm on step two. Think up 10 potential step threes. I'm going to pick the best one and then boom, boom, boom and go all the way down. There's always latency and cost trade-offs that come with those things. But either way, it's undeniable the performance you're getting from these things and how good they are.

even through vibe anecdotes and even through RKGI performance. So, uh, they're very impressive. Yeah. So it's almost subjective and objective. Don't get me started, dude. I mean, this is another one of my things. RKGI is a verifiable domain. Like you can just go check is the right answer, right? What blows my mind is like, there's no right answer for how good a summary is, right? There's no right, there's no right answer for how good, um, an AI took notes on your call and then went and put them into Salesforce. Like how good are the notes? Right. Um,

Yeah, and how good are they to who? Well, so that's the whole point is you have to keep in mind what is the background engine? What is your eval engine in which you're evaluating these things from? With ARK, it's an equality check. We can tell we have the right answer. It is not, right? Much of what drives the economy and drives humans and everything, the eval engine is human preference. Well, that's what I was going to say. With ARK, don't you find that

The answers can be subjective. So if you're just looking at whether or not the task is correct, yes. If you're looking at claims as this is human solvable or whatever, then it's a lot more subjective. There's a lot more subjective that comes from there. But in terms of eval engines, I have a priority order of my favorite eval engines back there. Number one is going to be physics.

And what I mean by that is I think that the coolest thing that we could have AI do for us is discover new knowledge about reality, basically about physics. So you think about what is the right answer? Well, it's what does the scientific process say about physics as the eval engine? That's so freaking cool. There's no umbrella that encompasses physics. Physics is what we're in, right? And so I think that's number one that's super cool. Number two, capitalism. And so you think about...

capitalism is a human construct of a set of rules that we all play by. It's a system, and there's laws, and there's how we choose to do things. Running a business is an experiment playing in that world, right? And so it's almost like capitalism is the eval engine, and I'm going to go try to make a whole bunch of money, but you've got to do it within the rules, right? And so there's certain things you need to go for, so I think capitalism is a really interesting eval engine. And then human preference after that, which is like, how good is this summary? But the wild thing about human preference is there's no way to...

like at scale quantify that, which is really tough. Which is why when you do RLHF, you got to go spin up like data, not data centers, but like huge, huge conference rooms of hundreds and thousands of people giving you preference optimizations on which one's better, right? That's how you do it, which that's crazy, but that's what it takes in order to do these things. So go back real fast to this capitalism one or even the physics one, because in a way we are assuming that

what is happening to us as humans is discoverable or is the engine, the eval engine. But potentially it's not. It's just us as humans. Yeah. So a big caveat with that is the way that I think about it is if it's true, what we see is what we get. Like if it's true that like reality appears to be what it is. And I know there's going to be like, even as you start to delve into like the quantum stuff and we don't know what's on the multi-world side, like we don't know what's on that other side.

So pending something surprising coming out of there, which I would love because it's like I want the truth. And if that's the truth, then freaking so be it. That's freaking awesome. Pending all that, assuming what you see is what you get, then I think what I say still holds on. Like I still think that the reality is if there's some unexplainable thing that like it's just out of our reach to go do it.

I like answers less that we don't have an explanation for, at least. But I'm not ruling it out. I'm saying, yes, that is a caveat. I'm operating looking this way for it, though. When I think about it, it's like there's something beyond our understanding. Potentially, that is what we are going to get helped to understand. AI can help us understand it, but it's going to be outside of...

what we are looking at. Just like when you have the chess move that is played and then later it's like, oh, yeah, of course. I never would have thought of that or it would have taken us decades to figure that out. Now we get to see. But that's a wild one. I'm with you, man. And humans are...

very poor at forecasting the unknown unknowns. And right now that's all unknown unknowns. And countless examples, go and ask somebody about something in the 1800s, what would today be like? They have no freaking idea. They just had no idea what came from it. So that will happen to us, like whatever happens. And even with how accelerated these timelines people talk about, I mean, even call it 10 years from now. You know what? I had a great dinner two nights ago with a friend and he was saying, I had as a thought experiment,

to come up with headlines for what 2030 would be saying in different magazines. So he was saying, I created one headline for Wired, and it was that teen 3D prints microchip in their basement type thing. So that was one. And then another one was, and this is a complete tangent, but it's trying to think forward on like, oh, what could be possible? He was saying,

Data Center on the Moon opened or second Data Center on the Moon is opened by the US. Sure. So you're like, well, maybe that's not too far off. Yeah. Both those seem tractable to me because the path to do those is you could lay that out. It's straightforward. If you said something that didn't have a clear, obvious lineage to get towards that, then I would start to think about it a little bit more. But yeah, I'm of the David Deutsch philosophy that all problems are solvable.

And that's the argument for optimism is if you believe all problems are solvable, then there's nothing out there that like should really worry you that much because you can go figure it out. Go do it all. After he told me that, I was trying to think, what would my headline be for 2030? Where would I go with that? So it's only five years from now? Yeah. Four years and three quarters? If we're going to be specific. Yeah. I mean, you kind of got to be with these things. It's like...

That quarter could be a big difference. I've been going really deep on there's a big conversation around intelligence explosion.

30% GDP growth rates and all that. And one of the criticisms I have with some of the more outlandish ideas is that they're not as tactical and they're not as concrete as I really wish that some of these projections were. So getting concrete, saying four years and three quarters, it's like, well, damn, OpenAI just came out with 04 Mini this past week. When is 04 coming out? When is 04 Pro coming out? Could it be like at the beginning of 2026? If so, you only got three years left.

With those types of things to go for it. And so concretely, how is the GDP going to grow 30%? How is that data center going to get up to Mars? How many launch windows are there left? Or even up to the moon or whatever it may be. So it's like what I'm thinking about, one thing that's caught my, that's nerds on me a little bit is like, so Elon wants to go, he has a Mars window that he wants to go shoot for. Humans are not going to be the first

Yeah, why would we? Why would you? We already have the Mars rover. We have the Mars rover. And so humans aren't going to be the first one. So that means they're going to send Optimus up there. Could you, are we going to have AGI on Earth before that window? If so, then you pretty much have AGI on Optimus because like you just go send a bunch of commands. And so next thing you know, I feel a little bit like almost insecure, but then I need to remind myself not to be so emotional. But it's like, damn, humans weren't the first on Mars. Yeah.

We missed that one. You know? I mean, sort of. It sounds so lame to think about. But that was my first reaction. I was like, damn, there's going to be this robot that is intelligent, that's its own human being, but it's not a human. And then I think it's like, damn, am I just specious? And I just love the human race so much. And now I need to open my eyes. We needed to claim this. I wanted to plant that US flag on Mars. You know...

I don't know. Even if it's just like humanity's flag or whatever it may be. But if you think about it that way, there's already been the Mars rover. So how is it different than the Mars rover? And that's where my biological baggage is bringing me down. Just because it's like a humanoid shape? I think it's less the humanoid shape for me and it's more just...

a generally intelligent being that can do its own thing. That's artificial. It doesn't need to be. But isn't the Mars rover, the Mars rover's not being controlled. I think it is. I think it is. Don't they send it instructions and tell it what to go do? Huh? That's a good question. I should figure that out. Because it moves pretty slowly. I thought it like waits for the instructions.

We got to fact check that one. It's hilarious. It's like, what next? And you're like, three minutes later or four minutes later, okay, turn right or pick up the rock or whatever. I mean, I don't think it's that far off. I'm pretty sure it's like that. That's funny. Yeah, I thought it was a bit more autonomous. Or maybe they send three or four instructions at once. And if it fails, then resend them or figure out where we're at now. Yeah, something like that.

Somebody will have to give us that one because that is hilarious. What else have you been thinking about? Yeah, well, in terms of headlines, I still haven't given you a headline I'm thinking about here. So headline 2030, Wired says it. I don't think it's outside of the question that there could be a headline that says humans are no longer able to come up with questions that AI can't answer, which isn't that sensationalist? It's kind of muted from a sensation standpoint. But if you use the observational definition of AGI,

What other problems are there, right? But I still wonder if there's a world where you have run out of questions, but you're still not seeing it. Where every once in a while, you'll find that question again.

It's not that you can find 100 of them, but there's still those stupid questions where it is like the strawberry or the 9.11. Yeah. And here's the thing. I don't want to give the viewer the impression that I'm relying on this as a formal definition. I think it's a pretty good working definition for sure. It's easy to communicate and it's easy for us to go against. I think we'll come up with a formal definition. But to your point, how often do you ask a human a question and it's like, what were you thinking? Yeah.

You know what I mean? So efficiency is such a big piece of this here. It's like that Will Smith iRobot meme that keeps on going around. It's like you ask me a question, can you? So then I could see that, though. Yeah, we can't come up with more questions. Or we have to have AI come up with the questions that it can't answer. Potentially. And that's a whole other, I think that's a very underexplored

It's talked about using AI to help build AI, to help align AI, to help test it, all that other stuff. And that will happen because, again, definitions are important. But look at all the people using Cursor right now to go build AI models. It's like, is that using AI to help you build AI? It's like, yeah, it is. So it just depends about how direct you want to use AI for it. But yeah, so what we're thinking about for ArcGIS 3.0.

because we're coming out with RKGI2. It's going to get beaten one day, right? We know that RKGI2 could be brute-forced. So if you give a data center's worth of compute and energy and time, like a month worth of a data center, yeah, go brute-force it and literally try all random permutations using one of the DSLs to try to solve ARK. Yeah, you're going to figure it out. But that's why efficiency is a big piece of this. Then the energy and the cash that you needed to go do that doesn't make us interested because it's a verifiable domain. Do you consider ARK1...

beaten because of that 87%? Like is 87% a pass grade? That's like a B plus. For a long time, we talked about 85% being basically human threshold on ArcAGI 1. I think that much like

the battle of MMLU where people were like, we got 88.8. Well, we got 88.9. Well, we got 90.1. It's like at that point, you're redlining where your signal is actually telling you for it and you're actually losing signal and that you don't, you get diminishing returns on the signal that comes from it. So I think the Arc AGI one for anything between like five,

to probably 90%. I think it gives you really good signal on where something is. Anything outside those bounds isn't giving you a ton of it for it. So I think it's still a really useful tool today for it.

It will eventually go out of vogue, though, once models get so good at it. It's getting closer and closer. So it's almost like you see the end of this lifespan. And here's the deal. There isn't one benchmark to rule them all. Even if you wanted to understand a model's capabilities, you need a portfolio. Not only that, but look how many benchmarks had their place and then were phased out because they did their job.

Like, just even for example, look at ImageNet, what happened with that. 2012, a big data set of images. That had a huge impact on the industry, and it did its job. Would anybody go and do... They don't report on ImageNet today, and that's okay. They have other types of benchmarks where they need to go deeper into image vision capabilities in order to get a better job about it. So that's where ArcAGI 2 sits. But like I said, we're not seeing meaningful performance on it yet to give us a ton of signal. Brute force it, if you weren't, but...

then you're kind of defeating the purpose. Totally. And so even though you can do it, should you? So we run a Calculus competition to try to beat ArcGIS 2.

The incentives there are to beat it at any means necessary because we have money on the line. And within the competition rules, there's no type of solution requirements. So people brute force the crap out of that all the time. Like that's the, that's a whole other part of my life. It's like, if anyone was talking to me about benchmarks, great. I love, I can talk about all day long. If anybody wants to talk to me about running an AI competition,

Talk to me about it all the time. We did it all last year. We put a million dollars up to anybody who could beat Arc AGI on Kaggle, and nobody was able to. But we saw leaderboard probing. We saw people getting around the rules. We saw where our incentives, where we made assumptions about participants' incentives, were not in line with our assumptions. Wait, how so? Yeah. Basically, if you wanted to win prize money, you needed to open source your solution.

We thought that the money, being a monetary incentive, would be enough to make people open source their solution. There was one group out there who had a really strong solution, really, really awesome. And they made the choice. I'm not exactly sure to the exact reason. It was one of two. It was either we think that we have a better chance at not open sourcing our solution and competing next year for the grand prize.

to do really good at it. And so they wanted the $700,000 instead of just the yearly $100,000. Or it was because it was so close to their startup's proprietary information that they didn't want to open source it. Both of which are not necessarily in spirit of what we were aiming for as a competition, but we did not properly construct the incentives enough or communicate early enough that this was an issue. And so...

We basically did what we could, which is we took them off the leaderboard because you're not placing if you don't open source. And then this year we made a lot more, we're being much more clear about our intentions with this. How are you aligning incentives now? Yeah.

through better communication. And then not only that, this year we have a public and private leaderboard. So the leaderboard that's seen right now is all just based off of public data. But the final leaderboard that says whether or not you've even placed or done well is all in hidden data. And if you want to get your private score, you need to open source. You need to open source. Okay, yeah. So we're hoping that that does it. Either way, we're not talking about hundreds of thousands of teams here. We're talking about maybe 10 teams that are in the running.

I can go and have conversations with each one of those 10s and make sure that they're seeing it the same way. And also the gaming of the leaderboard. You saw that? Yeah. I mean, so people get creative because like... Money's on the line. Money's on the line and Kagglers are professional competition people. They're really good at data science stuff and they're really good at playing competitions. And so one thing that I saw is that

They will try to suss out attributes about arc tasks one at a time. And what they'll do is they'll put in a wait statement in their script that says, if you see this current task attribute, wait 50 seconds. And then when they submit their solution to Kaggle, they say, did it run instantly or did it wait 50 seconds? And then that's a way you can tease out some more information about it. Because the only other information you get is you get a score. You get a single integer, which is your score out the other end. You can't really tell that much information about that.

Kaggle tries to prevent that a little bit more with some obfuscation about how long it actually took, but people get creative like that. And now with the Ark Prize 2, do you have to create a variety of

different tasks or is it very much in all right we're in this one field trying to do it well so you brought up a question earlier which was good and i didn't answer it fully which is does it take a lot of research and like deep thinking to build these things so i would say for francois's paper in 2019 that took a lot of work to put that that hypothesis together that formal definite or that that definition of intelligence right out of that came okay using this definition of intelligence

What would a problem look like that would actually go and test these things? And that's where the arc paradigm came in. So, um,

What it is, is basically you have an input and you have an output grid, and it looks like a checkerboard. And you see, okay, the input turns into the output some way. I need to figure out how do you transform the input into the output. You get a few examples, and then you get a test. And on that test, you only have the input. And what your goal is, you have to go cell by cell and type out what the output would be. The important part is that each separate problem on RKGI requires a different rule for

or a different transformation to actually solve. So what I mean by that is let's say... Super variety. Super variety. And it's almost like it's a meta thing. I'll get why this is important in a second. The way that why this is important is let's just say one arc task has a square on it. And on the input outputs, all you're doing is you're just adding a border to the square. Okay. Now on the test input, we're going to give you a square. We're going to ask you and on output, you just need to put a border. Okay, cool. That border transformation rule will only be asked once.

On another task, what we might ask you to do is fill in the corner of every single shape. You go fill in the corners of all those different shapes. And so what we're forcing the tester to do is learn the new mini skill in each one of those questions. And then we're forcing you to demonstrate that you've learned that skill on the test. By doing it. By doing it, which goes back to the definition of Francois' definition of intelligence, which is learning new skills. That is...

So simple for us, right? Like write a border on a square, but it is exactly that. And the reason why it's so hard for machines is because humans are very good at abstraction and reasoning. It's like, oh, duh, just put a border. Okay, but that's actually really hard for AI to go do. Now, arc one, people are like, oh, it's so simple. It's not a good test of AI. Well, keep in mind for five years it was unbeaten, right? And it actually pinpoints at the moment. Like right when models started to get good was the exact moment that reasoning models took off.

Okay, that's just something really interesting about reasoning models. And using Arc 1 as a capabilities assertion, you can actually tell something about reasoning models, that there's a non-zero level of fluid intelligence that actually comes from that, which is very cool. Arc 2 is a simple extension of the Arc 1 domain. We still have input-output.

We still ask you to do rules. The difference with it is that the rules are much deeper and they require a bit more thought from a human perspective to go do it. So instead of just doing a border, we might ask you to do a border and do the corners. Or put an X in. Or put an X. And now there's two rules. I won't go into the details on it. We actually have a full, Francois put it together. We hosted a private preview of Arc AGI 2 for donors for our price because we're a nonprofit. I should have said that earlier, nonprofit. And.

and he gave a 30 minute presentation on RKGI 2. - Wow. - But I wanna talk about RKGI 3. - Of course. - RKGI 3 is gonna be departing from the RKGI 1 and RKGI 2 framework. - Style of doing it? - Style of doing it. So it's very scoped and narrow domain if you just have matrices, input, output, you know, filament. It's very scoped. You don't have very many axes of freedom for that. So we are taking inspiration from simulations and games.

So back in 2017, DeepMind, they put together an exploration. They called it Agent 57. So they tried to get an agent, more or less, an RL agent, to go and try to beat a bunch of different Atari games. There's like four that they didn't solve, which is super fascinating.

What arc AGI one and two don't allow you to don't make you do is they don't make you figure out what is the goal? They don't make you figure out the rules of the environment. They don't make you have long-term memory with hidden states So like you learned something early on in the game and you have to remember that that thing still applies later on in the game And so what I tell people is if if you can make an AI that beats one game Well, we've done that a bunch. We've made a IB chess. We made a IB go. Okay, cool If you can make an AI that beats 50 games

Hmm, that's much more interesting. But the problem is that those 50 games are all public and you can have developer intelligence and developer intuition as to how to go beat those 50 games. What Arcade GI 3 is going to be is we're going to make AI beat 50 games it has never seen beforehand and they're each novel from each other.

And that is a much further extension and axes of freedom about where we're taking this. What you can assert about the model that beats it is it will have had no choice but to interact with its environment, learn the rules of the game in 50 different novel situations. But you're not letting it just simulate for hours and hours. Or maybe you are. That's the test time compute type thing. We will. And that's where efficiency comes into it.

And so what we're going to do is we're going to go test 400 humans on those 50 games, and we're going to see how many actions does it take for a human to actually solve this, and how many actions does it take for AI to go solve it. So that's where we get our efficiency that comes from it, in addition to cost and energy that comes from that. I think we got to go. I just saw the boss man. That's a great way of ending it, though. We'll cut it there. This is too good. You were awesome, man. This is great. You got me freaking going, dude. But it was perfect. It was like...

Greg Kamradt: Benchmarking Intelligence | ARC Prize 48:30 Share

MLOps.community

Deep Dive

Shownotes Transcript

Greg Kamradt: Benchmarking Intelligence | ARC Prize