We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode “Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red” by Julian Bradshaw

“Research Notes: Running Claude 3.7, Gemini 2.5 Pro, and o3 on Pokémon Red” by Julian Bradshaw

2025/4/21
logo of podcast LessWrong (30+ Karma)

LessWrong (30+ Karma)

AI Chapters
Chapters

Shownotes Transcript

Disclaimer: this post was not written by me, but by a friend who wishes to remain anonymous. I did some editing, however.

So a recent post my friend wrote has made the point quite clearly (I hope) that LLM performance on the simple task of playing and winning a game of Pokémon Red is highly dependent on the scaffold and tooling provided. In a way, this is not surprising—the scaffold is there to address limitations in what the model can do, and paper over things like lack of long-term context, executive function, etc.

But the thing is, I thought I knew that, and then I actually tried to run Pokémon Red.

** A Casual Research Narrative**

The underlying code is the basic Claude scaffold provided by David Hershey of Anthropic.[1] I first simply let Claude 3.7 run on it for a bit, making observations about what I thought might generate [...]


Outline:

(01:04) A Casual Research Narrative

(09:10) An only somewhat sorted list of observations about all this

(09:22) Model Vision of Pokémon Red is Bad. Really Bad.

(12:58) Models Cant Remember

(14:34) Spatial Reasoning? Whats that?

(15:22) A Grasp on Reality

(18:40) Costs

(19:13) Why do this at all?

(19:47) So which model is better?

(22:36) Miscellanea: ClaudePlaysPokemon Derp Anecdotes

The original text contained 6 footnotes which were omitted from this narration.


First published: April 21st, 2025

Source: https://www.lesswrong.com/posts/8aPyKyRrMAQatFSnG/untitled-draft-x7cc)

    ---
    

Narrated by TYPE III AUDIO).


Images from the article: Claude is currently trying to go down the stairs which he is )Left: game in emulation. Right: game screenshot as Claude will see it.)Whoops don't look at this, I got it right on the first try.)Notice what is labeled as stairs here. Quote at the time: > kind of tempted to switch to gemini now lol)> I told you not to write down a label until you actually used the door claude> this shit's on you)> If he one shots this area it will be one of my proudest thingsNarrator: Claude did not one shot the area.)Would this help a human child?)> Oversight-o3 trying so hard right now)ASCII map generated by Claude vs. corresponding game screen.)ASCII map generated by code vs. corresponding game screen. These are more effective, if kind of cheating, but LLMs don't fully understand them either.)Nurse Joy and the )Starting room as seen in original game, no color mods.)Custom overlay from the Research Narrative.)The trees with two diverging branches are cuttable. The round trees are not.)A Pokemon center, according to o3, for 800 actions and like $100.)Pictured: a blue pokeball, a yellow pokeball, a red pokeball (Charmander, blech), and two irrelevant circles.)Pokemon trainer with Bulbasaur, Charmander, and Squirtle from the original games.) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.