Background: After the release of Claude 3.7 Sonnet,[1] an Anthropic employee started livestreaming Claude trying to play through Pokémon Red. The livestream is still going right now. TL:DR: So, how's it doing? Well, pretty badly. Worse than a 6-year-old would, definitely not PhD-level.
** Digging in** But wait! you say. Didn't Anthropic publish a benchmark showing Claude isn't half-bad at Pokémon? Why yes they did: and the data shown is believable. Currently, the livestream is on its third attempt, with the first being basically just a test run. The second attempt got all the way to Vermilion City, finding a way through the infamous Mt. Moon maze and achieving two badges, so pretty close to the benchmark. But look carefully at the x-axis in that graph. Each "action" is a full Thinking analysis of the current situation (often several paragraphs worth), followed by a decision to send some kind [...]
Outline:
(00:29) Digging in
(01:50) Whats going wrong?
(07:55) Conclusion
The original text contained 4 footnotes which were omitted from this narration.
The original text contained 1 image which was described by AI.
First published: March 7th, 2025
Source: https://www.lesswrong.com/posts/HyD3khBjnBhvsp8Gb/so-how-well-is-claude-playing-pokemon)
---
Narrated by TYPE III AUDIO).
Images from the article:
)
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.