We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

⚡️How Claude 3.7 Plays Pokémon

2025/3/4

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Shownotes Transcript

hey everyone welcome back to another late in space lightning pod this is alessio partner and cto a decibel there's no swigs today we got a special co-host vibu which if you're part of the latent space community on this score you definitely seen um welcome bibu as a co-host first time what's up guys

And then we had David Hershey from Anthropic Today, who's the person behind Cloud Place Pokemon. It's funny, I saw we first DMed about playing Magic the Gathering together in a session. And then people were like... On all of the different nerd angles you can get me. And then people were like, David is the person doing this. And I was like, okay, I'll DM him. And then, yeah, it was cool. We already had a touchpoint. So...

Welcome to the show. This is our second Anthropic episode. We had Eric Schlantz from the SWEA agent before, so welcome. Thank you. Glad to be here. Excited to talk Pokemon. Yeah.

So let's give a little background on this. So Sonic 3.7 came out a couple of weeks ago. I don't know. Time goes by so quickly. This week? I don't know, man. It feels like two weeks ago. And then you had this Cloud Place Pokemon thing that kind of went viral where if people remember, there used to be this thing called Twitch Place Pokemon where people could go on Twitch and kind of type in the chat and then busy like figure out what the next section that the emulator would take us. What you've done instead is given it and it's a cloud and basically have

have cloud figure out how to walk through it i'm looking at it right now so far it's been stuck in mount moon for 52 hours poor guy i probably met a

15,000 Zubats. So yeah, let's talk about what gave you the idea for it, kind of the origin story that we can go through the implementation. Totally. Yeah. So I actually started working on it in like June of last year for the first time. And for me, so I work with customers at Anthropic and I just like really wanted to have some way for myself to be able to like experiment with agents, like in a real way, some framework, some harness where I could actually just like

go to town and try some different things and see what actually worked to get Claw to do like pretty long running tasks in general. And so I like had that in one hand and then I was like, okay, what is the thing that will make me the most addicted to making this work? Like how will I grind the hardest actually trying this? And Pokemon was like a pretty clear answer. Someone else at Anthropica actually like tried once to hook it up. So I had a little bit of like the shell

of what i needed to to actually put it together and to like kick off what what became an obsession a little bit in the uh coming months so yeah like i played with it in june in this like trying things out this was like sauna 3.5 came out in june of last year which is when i started kicked it around it's very good but like you know you could see like kind of signs of life but like not much really happened um and then ever since then as we released new models it's sort of been like the way that i get to know one of our new models a little bit right so we released

the new version of Asana 3.5 in October and like use this to like really kind of see like what's it better at. And it got better. Like you could see it start to like, it could get out of the house somewhat reliably, which was not always true. And it got a starter and it like even named it sometimes. Like it was like doing stuff. Not great, but like it,

could move um along the way too like i'm just like we have a quad place pokemon slack channel like i'm sort of just like giving people updates so over time as i'm like posting gifs and up paris updates like i'm this is like slightly growing a popularity of a cult following internally of people who are somewhat interested but then like uh you know a couple weeks ago i was bashing an early version of

Sonic 3.7 and it just like you can just tell it had like it was a little different. It's clearly not still good as you said at the top like it's in Mt. Moon for its 50 something hour. This is a little bit worse than average from what I've seen so far by now but like this is like you know about on brand it doesn't really have a great sense of direction. It's pretty bad at seeing the screen stuff like that but like

Like it plays the game, you know, like it gets Pokemon, it catches Pokemon, like it caught its first Pokemon. It got out of the radio the first time, like a whole bunch of stuff happened for the first time or like could squint and see a thing play in the game. And yeah, like posting updates, obviously internally, it was very fun. Like people were just like kind of going wild at the fact that this was actually happening finally.

And it was like entertaining enough that I could kind of see it. And the other side is like, we kind of just like got finally a sense that this was like an actually useful way to measure what was going on with this model. You know what I mean? Like,

There's one thing that it's fun and fun to follow along, but internally, I think we got more of a sense that you could actually use this as a bit of a measuring stick for what's going on in the model. I've spent, you don't want to know how many hours I've spent staring at Claude play Pokemon. I have to have seen and read millions of words that Claude has generated in the course of playing Pokemon over the last eight months. So-

Like you can kind of get a feel for like what's actually going better, what's it getting better at and that kind of thing. And with this particular release, like I think the fact that it got this much better at this kind of reflects a lot of things that we wanted to be true about the model to begin with.

And those sort of lined up were like, okay, maybe this is like an interesting way to actually tell people about what's going on here for a crowd that maybe doesn't like quite know as much about software engineering and all the other ways we've told people about agents in the past. Yeah. Were there any other games that you consider? To me, it seems like Pokemon is good because it's like, you know,

isometric, you know, it's kind of like flat, so you get any score, then it doesn't have too many hidden facts about objects, you know, kind of like everything it's described. Did you consider anything else or...

was pokemon just kind of like by far and away the first choice i didn't but it's mainly because like pokemon was the first game i ever got as a kid right it's this is like purely coming out of my own nostalgia um but also like the twitch play pokemon like i was also something that i cared a lot about well a decade ago or whatever that was um at least it's not a decade ago i think it's actually a decade ago i'm sorry yeah painfully um

And 11 years ago. Yeah. February, 2014. Yeah. It's nuts. Okay. Pokemon red is 20 years ago. Oh my God. 20, 25 at least. So yeah, I, for me, it was that like, since then there have been a lot of people in the problem. We were like, Oh, we can do this. We can do this. We can do this. Uh,

I think there's like a lot of fun things you can do. Pokemon is actually really nice because like if you don't do anything for five seconds, like there's typically not a consequence by the nature of like doing inference on a model every like a snapshot of time. It's actually a pretty good game to be able to do this with, but yeah.

But yeah, it was mostly just like my love for Pokemon coming through here. You put together a very nice architecture diagram. Do you want to screen share that so people on YouTube can follow along and then we'll put in the show notes if you are just listening? I know that Veebo had a bunch of questions on that too. Yeah, let's do it. Very, very straightforward questions. Basically, can we just double click into all of it? Yeah, yeah, yeah. It's easy. Okay.

I found it off Twitch and like no one was talking about it. So I started sharing it around and I lost the original source, but basically everything in here is like pure gold. The memory is a little interesting, but yeah, if you want to just go through high level. Yeah, you got it. Yeah. I want to like preface that. I do not claim this is like the world's most incredible agent harness. In fact, like,

I explicitly have tried not to hyper-engineer this to be the best chance that exists to beat Pokemon. I think it'd be trivial to build a better computer program to beat Pokemon with Quad in the loop. This is meant to be some combination of understand what Quad's good at and benchmark and understand Quad alongside a simple agent harness. So what that boils down to is this is a pretty straightforward tool using agent from my perspective is how I would frame it.

So at the end of the day, like the core loop is just like having a conversation that rolls out and it's essentially like you build the prompt, including like everything we've had up till now. You call the model, it sends back some tool use. Typically you resolve those tools and then talk about summarization. But like basically some a few different mechanisms to maintain the information you need to do something long running inside the context window.

So like what this boils down to is like when you think about what an actual prompt looks like, it rolls out kind of like this. You've got tool definitions, which describe three tools that I'll get to in a second. A short system prompt, it's like pretty boring. It basically tells the model how to use the tools. And like there are about six facts about Pokemon that I give it and like a few corrective things that I've seen it do like really horribly wrong. I'm like, hey, you might want to consider doing this a little bit better. But it's like really not a lot of system prompting going on.

We have that knowledge base, which referred to you, I'll talk about. This is the main way it stores long-term concepts and memories as it's operating over time. And then the bulk of things is this conversation history, which is, it's like a chain of tool use. There's no user interjections at all for the most part. So it's like, go, and then the model uses the tool, and then it gets a result back, and then it uses another tool, and it gets a result back. So pretty straightforward. Feel free to cut me off, too, if you've got questions along the way, but otherwise, I'm going to keep rocking.

Yeah, yeah, go ahead. Cool. Okay, so most of the money of this is just, like, in the tools themselves. When you think about what's going on, it's really, like, it can press buttons and it can, like, mess with its knowledge base, and that's about it. I'll talk about Navigator separately because that's, like, a patch for how it actually can deal with some of its vision deficiencies. Using the emulators, it's basically, like, execute a sequence of button presses. It'll say, like, press A, B, left, right, whatever. It gets back a screenshot.

and screenshot overlaid with coordinates of the game. These coordinates are used for this navigator tool that I'll describe in a second, but it's just basically like help quad get a slightly better spatial sense of what's going on on a Game Boy screen. I've been through it a lot. Sorry, does that come with the emulator or are you adding those in? I add that in.

I have somewhat extensively reverse-engineered Pokemon Red by this point to extract roughly every bit of possible information from it. I don't use most of it, but I have essentially everything you could know about the current state of the game I have exposed programmatically to be able to tinker with it at this point. I was just reading this diagram, like, yep, you just get what spaces are walkable based on what's stored in RAM, and I'm like, oh, you definitely reverse-engineered this little idea. Yeah. Good news is we also released Cloud Code this week, if you saw that.

And that has been, this would all not be possible without the help of having Quad also go figure out how to do all of this for me. Cause I could have done it, but there's a lot of like tedious, here are addresses in memory, map that to a Python program that I had no interest in doing. So thank goodness for Quad code. So yeah, it gets these two screenshots. It gets like a small blurb of state, which I read straight from the game. There's a lot of this here. Actually, like funny enough, the thing that matters is location.

quad will like pretty aggressively hallucinate that it succeeded in transitioning between zones if you don't like tell it it did not uh this just comes down to like literal vision issues and and so like most of the patching of extra help i've given it been like attempts to make it so they could still play despite not being very good at seeing game boy screens in particular um and then it gets like a handful of like reminders this is this reminders us a decent amount of work but it's like

Things like, you know, remember to use your knowledge base occasionally. And we tell it if it gets like stuck, for example. So if you detect that it like hasn't moved in 30 spots or 30 time steps, I once saw it see like a red box on the screen that was like the doormat and think it was a text box and spend 12 hours pressing A overnight to try to clear the text box, which you see that happen once and you add in some helpful reminders to not do that.

How much knowledge does the model have about the game itself? You know, so for example, types, right? Does it know about types, weaknesses and things like that? Or how much are you trying to put into it? Yeah, if you go to quad.ai, like it, it will tell you about like some stuff. I have not yet decided if

the knowledge that it has about Pokemon is helpful or harmful towards it playing the game. Like half of the time when it's like, oh, I know this about Pokemon, it then like uses that to hallucinate something. So for example, beginning of the run on Twitch, you saw it like go out of the lab and see like this NPC in the bottom of Pallet Town and be like, it's Professor Oak. I found him. And it's like very much not Professor Oak, but like the fact that it has like indexed on this concept is like a little...

It's stuff like that, that it's like unclear to me where it is, but it clearly has some information about it. There's like a million game guides about Pokemon sitting on the internet. It's unsurprising that like there's a decent amount of information there. I don't really give it a lot of extra information. It picks things up. I watched on the stream the other day, like it tried to use Thundershock on a Geodude and it failed. And it's like, hmm, I forgot about that. That does not work. And so like clearly there's like, it knows some stuff. It's not perfect. It picks some stuff up as it goes through the run.

Ideally for me, I think it's just interesting to see what it actually learns as it's playing. So the more it does that is the more I'm actually interested in it. Yeah, one of our Discord members, Nodjung, he had a good question about the sense of self. Yeah. Sometimes it gets confused. Who is the actual playable character in the scene? How do you steer that? Yeah, I think sometimes it gets confused. It can be applied to many things in quad-playing Pokémon.

In particular, when it's trying to look at the screen and understand what's going on. So I have attempted to prompt it all sorts of ways. Like you are at this exact coordinate and you're in the middle of the screen and you're wearing a red hat and things like that. And that's all neat. But Quad doesn't particularly understand the middle of a Game Boy screen and a whole bunch of concepts like that, which means you can prompt all around everywhere. But this kind of spatial awareness and where something is with respect to something else is something that Quad's still just not...

great at in its current incarnation. So one of the side effects is it sometimes loses track of who it is on the screen and thinks there's something else there. I'll keep tracking through this. So I hinted at this like other tool that I give it called Navigator. And this is just like the only other patch that I have for the vision issue. So Navigator, basically what it does is like quad can say it wants to go to one of these coordinates that we provide in the screenshot.

And then we like automatically press the buttons to get there. It has to be something on the screen. Like I'm not trying to let Claude just like navigate a whole map by asking too politely. But one thing you'll notice if you run it without this tool is if like Claude wants to get from one side of a wall to another side of the wall, it like happily just tries to walk through the wall repeatedly because it doesn't quite have the concept of like what's between it.

And I spent a lot of time like prompting around this and it just like isn't, it's just not, it's one of those things not very good at. So in order to make it somewhat fun to learn from quad playing Pokemon at all, we use this navigator tool, which is like helps it actually get around a little bit better.

So since we covered a bit about the different tools, the prompting and the strategies, I'm curious how many tokens all this is using. Like there's a part to conversation history and truncating parts of the messages in state. But like, yeah, at a high level, how many tokens is this using? And then can we kind of go into where those are coming from? What's being truncated? Yeah, you got it. When you like think about the prompts here, essentially like every step, something that looks like this gets sent.

So if we just go through what each of these looks like, everything in the system prompt is probably like a thousand tokens, pretty small, like a handful of paragraphs. Knowledge base, I let get up to like 8,000 tokens. So I put some like arbitrary cap on it so it doesn't go to like, quad will write, put a whole bunch of BS in there if you just let it keep writing stuff. So like the cap helps constrain it to like try to think about what's actually important a little bit. And then the conversation history,

I haven't like kind of finicky, but it basically rolls out 30 messages. That's actually like something you can tune. I've tuned it to be 30 messages about like the best performance I've gotten. And so what that means is it basically like use the tool, get a response back, use tool, get a response back. It's allowed to do that 30 times. And then at that point, it triggers the summary, which takes that conversation history, summarizes it, makes it the first user message. And then we kind of roll back out again.

So the bulk of the tokens end up being in the conversation history once it's as longest. In fact, like this, the bulk past that ends up being these screenshots, which are scaled up a decent amount to fit in. I do actually like I allowed to see a number of the previous screenshots, but not all of them because you start like it ends up being a ton of context if you let it see like even 30 turns worth of screenshots. So I'd trim out a few.

That's where the bulk of the actual tokens are. So in practice, this rollout ends up like at max ending up around 100,000 tokens, I think is where it is like the longest message you ever send to the API on one of these turns. And it will fluctuate in like summarization, depending on the state of knowledge base, probably between like 5,000 and 100,000 tokens.

And is that like per action state of the game? And roughly, do you have like a high level ballpark estimate of how long this would, how much and how long it costs to run this? Like, let's say people want to compete. Yeah. Yeah. Yeah. Like how much would this be? I think you'd really want to think about running this as a side project in terms of the impact on your personal wallet and how much you care about Pokemon. It's not clear to me that without the blessing of Anthropic, I would have decided to take on

take on this project for my own wallet's sake uh especially if you want to like experiment and like try 10 different things i mean it's it's costly i don't know like i haven't spent a lot of time on the exact number it's not that hard to estimate if you like i just told you a bunch of numbers you can kind of back it out uh but like i think to like do a lot of experimentation there's like at least thousands of dollars of tokens being consumed so it's not a

It is not a cheap rollout. Yeah. But yeah, in the scheme also of how some people use tokens, it's not terrible. How many turns are you keeping in memory before you summarize? It's 30 right now. Yeah. I've tried more and less. I think one thing you see a lot when you talk to people building agents is there's some effective context length that actually has the model be the smartest.

and that seems to vary slightly model by model but but for this model for whatever purpose like this the 30 message worked better than 20 and better than 40 so uh kind of plot in between those that it worked pretty reasonably yeah does that change based on location like how many would you want to give it to get it out of mon moon so i say hey we gotta we gotta bring plot home we can't let him stay yeah yeah for another 57 hours i actually am not sure it does like i i

I've tried posting, like, I can have a ton of screenshots, like 20 or 30 screenshots at a time be able to see. And it's, like, not obvious that, like, that temporal concept is actually super relevant to it. And, again, this is just, like...

Trust me, as someone who has spent like a lot of hours obsessing over this, uh, you can try to prompt quad a lot of different ways to understand how to navigate better. And anything short telling it exactly what to do does not improve. It's like actual navigation. It's just like not a skill it's great at. It's like good enough to, to like random walk its way through some of the complex mazes and in like good, easy areas, it's pretty good at bopping around. But yeah, I think I could tell you if there was like a way to prompt this slightly different that, uh,

would navigate better and i would believe there is something but it is not like it is not an easy lift yeah yeah i asked i just asked claude ai right now how do you get through a mountain in pokemon red it does have it does have a plan but i don't i don't i don't know i don't know if it's the right i don't know if it's the right plan i have seen it come up with a lot of answers to that question and most of them are right

This is part of the pain. When I talk about I'm not sure if it's knowledge is better or worse, like you usually fix it. Like, oh, I know the exit is on the eastern wall and it just like spent 12 hours trying that. Yeah, it's like unclear to me that that we're actually not just like harming it by having it think it knows the answer.

Yeah. I think that's the interesting part, right? Like, you don't want it to just know the answer. Yeah. Like, the model clearly knows a lot about the game. There's, like, EV, IV maxing. Pokemon was very, very extreme. But, like, if that's what you wanted, we could just hook it up to a knowledge base. Like, hook it up to a guide if you know how to beat Pokemon Red. But the interesting piece here is actually, like, can it figure out what to do without just memorizing the path through? That's exactly right. Like, that's part of why...

I don't know. Part of what I've realized putting this out in the world is people will draw their line of where purity is anywhere on the spectrum. Like, is this cheating? Like, yeah, maybe. Who knows? Like, frankly, like, I don't particularly care. The main insight that I have is, like, when we put this out, like, you learn a lot about what the model's good and bad at by staring at it. And that's kind of what I like about it, so...

Evaluating the model is kind of separate than your emulator and how it can use an emulator, right? Like we can always improve those things.

I'm curious, as you switched from 3.5 to 3.7 and sort of reasoning models, were there any degradations there? Like, did it did it kind of get worse or anything? And was the prompting somewhat consistent? Like a lot of what we've seen with different reasoning models is like you kind of prompt them differently, right? You tell them what to do, let them figure it out. But yeah, any any insights there? Yeah, that's a good question.

One thing that's nice about 3.7.1 is it's like this hybrid reasoning model. So like it kind of can do the old thing and the new thing. And it's actually pretty good at just like being an out of the box model and having this like thinking mode where it can spend time reasoning. So I didn't like really run into any like serious degradations. The one thing I'll say is like literally every model that has come out with Pokemon, like the main change that I have made to this agent is deleting prompt stuff. Like

There's a whole bunch of like band-aid-y prompt stuff I've added in the past. It's like trying to like steer it away from doing a lot of the things that it got horribly stuck doing in the past. And as the models get better, I found that just like making sure it's as simple as possible and giving them as much sort of like free reign to try to solve a problem as possible.

is useful and like the way i think about this is i'm like less confident over time that i understand exactly how a model is intelligent right like it's capable of all of these like ridiculous things it does phd level stuff in some ways and like is unable to see a screen as well as a four-year-old in other ways but like my confidence and like exactly what i need to tell it to do to be smart at playing pokemon is actually like really small right now you know

If I tell it, this is the way you need to solve this problem, that might not actually be the best way for 3.7 to solve this problem. It's like just different than I am in terms of how it thinks about these things. I found that just like kind of like pulling some of the unnecessary instructions where I tried to like use my intuitions about what would make the model better out of the prompt over time is the thing that's just like sort of consistently as models got smarter, gotten more juice out of this. I was watching the stream yesterday or the day before and

It was a very tense battle. I think they were, like, down to, like, 2 HP each, and, like, the opposing Pokemon, like, missed a scratch or something, and it didn't die. And, like, you could tell, like, I was like, wow. It was, like, very dramatic, and I was talking about the game. How... Yeah. Is there any thought being put into, like, trying to have it more... Like, do you prompt it to be more rational to let it know that it's not a real life, that it's a game? It's like... It feels like it gets very distressed when...

they're actually the pokemons are actually gonna die it's funny they um it knows it's pokemon it's like you're playing pokemon red like it does know that and it has a sense of that but it clearly grows some attachment i'll tell you a fun story we tell it to nickname its pokemon now it will occasionally do without it but it's like more fun if it nicks names it's pokemon so that's like in the prompt is like it's fun if you nickname pokemon you should consider it and one thing we found when we started doing that is it got more protective of the pokemon it nicknamed like it's pretty obvious like when it catches a pokemon

Now that it has a nickname, it will like go heal it right away if it's hurt. And that did not ever happen before, which is pretty like, so there's some cute little things, cute quirks about quad who really wants to protect its precious nicknamed Pokemon, which is great. So I will say it's kind of normal. Like, like when I was five playing Pokemon red and you know, I had two HP in the midst of scratch. That meant everything. That was existential. I agree. I agree completely. How about Pokemon?

skilled transitioning. So one question that I had, so you're playing Pokemon Red, right? So you want to play silver or gold next. Have you thought about how models can kind of learn from these games and like store these learnings and then use them again in the future?

I'm sure it's not part of the project today, but curious your thoughts. I've thought about it only a little bit, which is like, I think there's some like interesting when you actually read one of the knowledge bases that it has gained, like on some of the longer rollouts when they're good, like there's actually some like pretty decent tidbits about how it should act and try and do things and like some of the ways it succeeded. And actually one of the things that's most unique about

3.7 sonnet that I've seen is like it will have like meta commentary on what it's good at and bad at and it's knowledge base like I misperceived this thing and so like I need to be careful doing that again you occasionally see show up there which is um which pretty cool so like I could imagine there being some way to like translate that knowledge base from one game to another I think my knowledge base is frankly like kind of kludgy of an implementation right now like it's like more or less a python dictionary that's appended to the prompt

And I think you could find better ways if your goal is to transfer across games and things like that to manage a knowledge base that Quad can actually use more well in different scenarios. But there's definitely pieces there that I think it would be off on a better foot on the next Pokemon game if it had that. Or even if I were to restart the stream, it would have some tidbits that it would probably speed up if it had access to things that I learned in the past that's interesting.

Yeah. Yeah. I always think of that in card games, you know, like you have the idea of like temple and a card game and it's like, you know, it's the same magic as it is. And, you know, Star Wars, flesh and blood, all these different things.

I feel like games are similar, where learnings you get from Pokemon you can bring over to similar open world games. I think it's also particularly interesting for some of the things that are like how Quad learns how to play a game in general, where it's like, pressing too many buttons at once is a bad idea, like I watched what's going on, that kind of thing.

Like definitely is stuff that it has learned that is like interesting in a meta way that it's like hard to give it that sense of self necessarily in training. I think sometimes like it's hard for it to know like what it's getting bad at in some scenarios, but it's interesting to think about how it can learn across things.

Well, some of this also is due to a simulator, right? So a lot of what's learning is how do I use a simulator? What am I good and bad at? But the model internally should know quite a bit about Pokemon, right? Like if you've played Pokemon, going from Pokemon Red to Emerald to Diamond,

Having played the first one doesn't help you that much in the second, right? You kind of get the general concept. You get what types are good against other types. And the model knows a good bit of this, right? But it's still interesting to show. This is more so like it shows that knowledge bases kind of help with understanding how to use the emulator, right? Like it struggled and then it figured it out. So even though I know it's Pokemon, it's like this thing can now learn how to use an emulator. Yeah, which is pretty cool. That has been like part of what's been fun.

seeing all of my progress on this thing. I had a bit of a follow-up question to the last one with Alessio. So if people want to blow thousands of dollars and want to, you know, improve this a little bit, is there anything else that you'd want to see done? Whether that's like improve emulator, try different stuff. Is this just anything that like anyone watching this, you'd kind of hint them towards what you'd want to work on, what they'd want to work on? Yeah, no doubt. If I had to guess, like the biggest lift that exists around this is probably...

something around the memory, which I don't think is hyper-optimized right now. The nice thing about the memory is it's always in the prompt. It doesn't go away. Sometimes if you leave it up to Quad to try to read and load and save to memory bases, it will underutilize it or forget things. But I think there's probably something there. I will say, all of the many, many hours I've spent tweaking around the edges of this thing, nothing quite does it like a new model, though. Fundamentally, I think the limitations right now are some smarts things like

I've seen, and I mean this in the kindest way, but I've seen a lot of people in Twitch tell me about ways that they can fix the navigation capabilities with a better prompt. People would be welcome to try, but I would guess that would be like a somewhat fruitless avenue. I don't think, I think it's just not very good at understanding. The first time, I'll give you a very quick anecdote, which I think is like my favorite for like why this is particularly hard.

I have this clip of Quad leaving Oak's lab and being like, great, I left Oak's lab. Now I need to go up to the north end to go to route one. And it just like hits up on the D-pad and goes straight back into the lab. And it's like, shoot, I'm back in the lab. I need to leave. And it hits down. It's like, great, I'm out of the lab. Now I can go up to route one. It's straight up. It just like goes up and down 12 times. And it's like, you're not fixing that with a prompt. It just literally doesn't get it. It doesn't understand. And so...

It's pretty hard to make like little around the edges changes that like make a huge, huge difference. Yeah. I mean, I've always been fascinated by the fact that Twitch plays Pokemon actually beat the game. Yeah. From a, you just look at it and you're like, this cannot possibly work because you have people trying to sabotage it too in the chat. Not everybody's trying to solve it. What, what do you, so I just like that up. It took 16 days and seven hours for Twitch plays Pokemon to be read.

How close do you think we are to a model that can beat it in less than 16 days? And do you think it needs like some core, like model really big jumps? Or like, do you think it's like, we're close? I think there is model stuff, at least from Quad. Like I'm confident there's model stuff that needs to happen for it to be like really capable. I can have like four spots in the game stuck in my head. It's like, I think there's literally no hope it's going to get through that.

So I think there's like a gap that's mostly around like its ability to like see and navigate and remember visually like what's going on that I just don't think is like we've figured out yet. So to me, that's like a pretty big gap. I do expect like I think it's going to keep getting better. Like I have no reason to believe that this is not just like a fundamental like ability to scale, learn and understand problems thing that I think is getting better as we train models to be more capable of sort of these like long horizon tasks like

I actually do think this is like a pretty reasonable proxy of that. And I think it will continue to get better for a little while. I don't know if there are like affordances around images and videos and stuff like that that we need to figure out to make it work. It's like unclear to me if that's true or not.

But yeah, I think we have a little ways before we can beat the game in 16 days. I do not have a lot of faith that the current stream is going to be standing in Victory Road in 13 days. What's been your favorite moment from building this to thinking of the idea to just seeing it play? Any major highlight? I think the hypest I have been is...

when it beat Brock the first time where I was just like, you know, I've been doing this for eight months. And then like a few weeks ago, like I kick off a run, wake up the next morning and it's like, oh my God, oh my God. And it was the other good thing about it is like, I woke up at 8 a.m. and I checked my, I have it send me updates to Slack. This is like ridiculous things, but it's like literally like about to start the Brock battle. Like I opened my phone and it's like, oh, this is like happening right now. And it was like a pretty hype way to start a day. I think that was my, uh,

I have a lot of like other cute things, like some of the cute nicknames it's done over time and things like that are endearing. But that was like the peak hype for me. It was like, we beat a gym leader. Like we've got a badge, like quad's doing it, you know. A bit of a follow-up. So I noticed that you mentioned it eventually started beating multiple gym leaders. Were these all the same run? Was it different runs? Was it?

Yeah, I have like the run that you saw that's like on the graph we put out alongside like in our research blog is like a single run that I have watched like get through

at least Surge's gym. And then it got a little past that. And the reason that that's where we stopped reporting is because that's like the physical amount of time that occurred between when I started it and when we launched the model. So that's like, that was a very hyper up-to-date graph on the best run we had. Awesome. I know we're running out of time. My last question is, are we going to work on Magic?

on cloud place magic next or maybe we can do like the magic arena intro yeah uh funny story there was a project i did right before i joined anthropic that was like training uh an open source model to like slightly be better at picking draft or cards in a draft like i was training it on like the 17 lands data that exists to like learn how to how to pick cards out of a pax a little bit better uh and i i did talk about that in my interview to get hired at anthropic so

So I've put time into this. I'm ready. I am ready for that project too, that I have that code sitting around as well somewhere. I really get it in all my nerd ML slash gaming hobbies here. Yeah, no, I'm ready. I don't know if you're planning on open sourcing any of the Pokemon stuff, but if

If you want to work in open source on the magic stuff, I'll be happy to collaborate. Awesome. We've talked about it. I don't know yet what the plan is. There's a certain amount of this is not my day job that I have to figure out how I want to deal with that. We'll see. Awesome, David. Any parting thoughts? Anything people have missed? No. I think the one thing I do like to drive home when I've been talking about this is I really do think this is just demonstrating...

a thing that is going to make agents better with this model you know like this is a very fun way to see it but like i think the thing is that it like has some ability to like course correct update and figure things out a little bit better than models have in the past and even if there's like stuff it's dumb at like it tends to have an ability to like power through it in a new way and so i think what it's exciting to me is just like i think there will be some real world stuff that comes out of this model once people play with it and i'm pretty excited to see like how people take

The skills we put on display a little bit here or lack thereof in some cases and figure out how to turn them into actual agents that do stuff.

I have a quick last question on that, actually. Is there any guidance or any way that you like quantitatively measure the evals of this system? Like a lot of it is vibes. A lot of it is how far it gets, where it gets stuck. But like, are there any lessons or any specifics about how you measure how it actually does? So I've done a lot of like little small tests of like put it in this scenario and see what it does. But I like, frankly, the best test I have is just like

run it 10 times on this configuration and see how quickly it progresses through milestones of the game. It's the best thing about games, right? It's why games are such a useful thing.

there's literal like benchmarks of gym badges that are moments of progress in a game which are like ways to evaluate what happens and so i think like how quickly it's able to make progress is actually a pretty recent or a reasonable like eval if a slightly expensive one to calculate it's an integration test not a unit test wow awesome david thank you for joining thank you for filling in on the whole side too yeah my pleasure thanks for having me guys i appreciate it awesome good to see you

⚡️How Claude 3.7 Plays Pokémon 37:38 Share

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

Shownotes Transcript

⚡️How Claude 3.7 Plays Pokémon