We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Claude Plays Pokémon - A Conversation with the Creator // David Hershey // #294

2025/3/21

MLOps.community

AI Deep Dive AI Chapters Transcript

People

David Hershey

Topics

David Hershey: 我开发了一个AI代理，使用Anthropic的Claude模型玩宝可梦游戏。这个项目始于去年六月，最初只是我个人的一个练手项目，目的是学习如何构建AI代理，并以此为乐。起初，模型的表现并不理想，但随着模型的迭代更新，它的能力不断提升，最终能够在游戏中取得显著的进展，甚至击败道馆馆主。这个项目也为我们提供了一个独特的视角，来评估模型在长期决策和信息处理方面的能力。我并没有使用现成的代理框架，而是自己动手构建了一个简单的框架，它包含三个主要工具：按键操作、知识库和导航器。知识库用于存储和管理信息，以便模型能够在长时间内保持一致性。模型通过定期总结其行为来更新知识库。这个项目也让我对Claude模型有了更深入的理解，并让我意识到大型语言模型不仅仅是聊天工具，它还可以执行一些实际任务。虽然这个项目使用的是宝可梦游戏，但其背后的技术和方法可以应用于其他领域。关于模型微调，我认为对于大多数任务来说，提示优化比微调更有效。提示优化迭代速度快，成本低，而微调速度慢且成本高。在尝试微调之前，应该先充分尝试提示优化。当然，微调在某些特定情况下是有用的，例如调整模型的输出格式或使其更好地理解特定类型的输入数据。但对于需要提高模型在特定任务上的性能，或需要模型理解特定类型数据的场景，高级微调是一项非常困难的任务，需要专业的技能和资源，除非对模型性能有极高的要求，否则大多数情况下不需要进行高级微调。关于AI代理，我认为它代表着未来发展趋势，并在多个领域具有应用潜力。编码是近年来代理技术取得显著进展的一个领域。代理技术在法律和会计等领域也具有应用潜力。代理技术的突破往往是突然发生的，一个模型的改进可能会导致某个领域出现巨大的变化。代理的可靠性是其成功的关键因素。我目前主要关注的是新的大型语言模型及其应用，我相信AI技术将使更多开发者能够使用AI技术，并改变人们的工作方式。AI技术的使用门槛正在降低，托管的AI平台简化了AI的部署和使用，这使得更多人能够参与到AI的开发和应用中来。 Demetrios: (Demetrios主要以提问和引导对话为主，没有形成完整的观点陈述，故此处略去)

Deep Dive

Chapters

David Hershey from Anthropic's Applied AI team discusses his project of Claude, an AI playing Pokémon. He explains his motivations, the development process, and the challenges involved in creating an AI agent capable of playing a complex game like Pokémon.

The project started as a personal playground for building AI agents.
Initial attempts with earlier models were unsuccessful.
The current model uses a combination of prompt optimization, an internal knowledge base, and simple tools to interact with the game.
The project highlights the model's ability to maintain coherence over long periods and make progress in complex tasks.

Shownotes Transcript

Translations:

中文

I'm David Hershey. I am on the Apply.ai team at Anthropic, and I like a latte in the morning. Welcome back to the MLObs Community Podcast. I'm your host, Demetrios, and today I'm talking to an old friend, David. He's been working at Anthropic, and it is a blast every time I catch up with him because I learn new stuff. I chat with him, and he opens my mind a little bit. Recently, he's gotten quite famous because of the...

anthropic model that has come out and is playing Pokemon. That is cool. He made that and we talk all about it. Let's get into this conversation with Senor David. I had to put on my nice outfit because I knew that I was going to be talking to someone who is famous. I don't know about that, but I appreciate it, Dede.

Dude, you created the Pokemon. Claude plays Pokemon. And I didn't even know it. You were showing me that. And I was like, yeah, this is great. I thought it was just Anthropic that did that. I didn't realize that was your thing. Yeah, that is my baby, my side project for a while. It has an interesting story of how it came to be a little bit because it wasn't always like this Pokemon.

I put out into the world this weekend for a long time. It was just sort of like my, my little fun side hustle, but, but yeah, that is my baby. Somehow people are watching it. Like it's been stuck in one spot for two days and there's the 1500 people still watching it. So I'm, I'm sort of like amazed that people, people care. They're still doing it. That is incredible. Tell me about like, how did you even get that idea? How did you go about executing on it? Yeah, no doubt.

So like, honestly, it started like the first time I tried this was in like June last year. There's basically like two things that were true. I was working, I work with customers at Anthropic. I was working with a bunch of customers. We're all working on agents. And like, I wanted some place for me to be able to like build out agents to some extent, right? Like some of our more successful customers were building really cool agents. And it's just like, I needed some playground for myself to build it on.

And so it's like, okay, I'm excited to try to build with agents. How do I want to do it? And it's like, well, I should probably do the thing that's going to be like the literal most fun. Like if I'm going to really get into it, like I'm going to have some fun along the way. And someone else that had probably actually before me had, had tried hooking up quad to Pokemon, like an initial like little test. And so I was just like, let's do it. So I, uh,

Back in June last year, I kind of, like, dove in and built out a little bit of, like, different... A handful of different, like, agent frameworks to try to play Pokemon. And then, since then, it's just, like, kind of, like, every time we've released new models, I've been slowly iterating and improving and sort of, like, working on it. Nice. Cooking it up. That kind of thing. So...

But yeah, it just started out of like purely like, I want to do this thing and I'm going to like make sure I have a great time along the way. Which on a side note, I then became like deeply obsessed. If you ask my wife, she's probably like kind of upset at me and how much, how obsessed I am with this thing. And you realized that it was good enough to release to the world now because why? Yeah. Like, okay. So we tried this on Sonic 3.5 when it came out in June. Yeah. And it like...

You could see it do some stuff. Like it got out of the house, like it meandered around, but like it struggled. Right. We tried it with size 3.6 unofficially and size 3.5 new in October.

And it like, it got a little bit further, like it got a starter Pokemon. It got out, like it did some stuff. Right. And I actually thought about releasing it then. And like the thing that kind of was happening in the middle was like, I would post these updates to our Slack about this. Like I have a Slack channel at Anthropic about this and all about Pokemon, all about Pokemon is the internal one too.

people were following along right like so i i joked that was like claude's social media manager while i played pokemon kind of for a while like i would just like pull out gifs and clips of it like doing stuff and people would get kind of hype and it was like fun enough even back when or with the last model that like it was like should we put this out like it's kind of fun but it's like it was pretty bad like it didn't get very far it would have been like a 12 hour experiment there really would have been much to it um and so then with this model like

it just like kind of reached the breakthrough where like you could see it kind of meaningfully do stuff and make progress. And it's still like, to be clear in its current state, like there's stuff it's good at, there's stuff it's bad at, but it's like enough that like,

it does move through the game slowly. And there's also like this, like give and take of it, like does something really stupid for a while. And you're like, no, Claude, why? And then it like solves it. And it's very like, it's just like kind of, it's got that tension that's good in content a little bit. And so it became like, and I think like people internally following us along, we're just like having a really good time. By the time with Claude was like beating gym leaders and stuff, you know, people were like,

freaking out having a good time how to release this oh yeah a little bit of that the release actually came late but then the other side like honestly is um like we put it out in our research blog like the chart of the different models making different amounts of progress

And there's like literally just some amount of like, it's an interesting way to see how models handle these like long horizon decisioning things. All of these evals that people are used to like MMLU and GPQA, like there's all these evals that exist out there. And most of them are like, here's a prompt and I get a response and like, does it get it right or not? You know? And there's like fleetingly few that are like actually tested in the model's ability to like

take in new information, try another thing, make progress. And that's because it's like pretty hard to actually like measure those things. Like doing 10 hours of work, it's like pretty hard to measure like how good was that 10 hours of work, you know, when a model does it. But Pokemon, it's like, I don't know, you beat a gym leader. That's a thing that happens after 10 hours, right? And so it actually like kind of to some extent, even internally for us, became a measuring stick of like how well can this model like stay coherent over time

hours and hours and hours and prompts and prompts and prompts of like taking new information, trying to learn, update, do stuff. It became interesting enough for us to like understand what the models were good and bad at still, but it was like also just like a good thing to put out there to show people like, what can this do? Like, why does this matter?

And I know Pokemon's kind of like a goofy way to do that to some extent. I think like it, it resonates with some people that like, oh, these models like aren't just like a chat thing. I type in a prompt, but they can kind of like go do stuff sometimes. Even if it's like not that good, like comparatively, it's better than it has been.

Do you feel like you're going to now start simulating a whole bunch of these to get some data and then maybe try and make it better for the next one? I think part of what's fun about this is that it's not trained on Pokemon. Part of the fun thing is it's exploring this thing for the first time and...

getting a feel for that. And so I think we're going to stay in the version of this where it's like just sort of like a good way to see how it experiences these new environments that it's never really been trained to do.

Well, take me through the internals. What does it actually look like? You mentioned you created an agent framework for it. You didn't want to grab anything off the shelf or you like what's. I'm a learn by do kind of guy. You know, sometimes you like don't get to the depth of like, how does this thing work and why does it work that you wanted to get to? So that was where it came from.

I learned this, like my favorite way to learn this, this is a complete tangent, but I took Karpathy's like computer vision class in grad school. And he has this like right gradient descent from scratch exercise, like homework exercise he did where it's like, you know, like TensorFlow had just come out of the time. You could do that, but it's like, no, you're going to like, you're going to write it. You're going to figure out how to like implement a machine learning framework yourself. And that's like the way that I understand machine learning now is like 30% because of that one assignment.

The value, the time to value that you got, you still remember that one. Yeah, yeah, yeah. It's incredible. Yeah. Part of what's so fun about like doing this and building in this way with these models is like you just like learn a lot about them when you stare at them, right? A million tokens, you know, like I've seen Claude write

I don't even want to know how many words about Pokemon, but like you learn a lot about like how it thinks by just like reading it a lot, like getting into the weeds, seeing how it reacts to different prompts and stuff like that. That's like the core of why I decided to kind of go my own way. I tried a few like the published papers, like I tried Voyager back when that was a thing and a few other things. But like the other day, I just kind of like wanted to hook something up myself.

And what does it actually do? Like, what is the way that it takes in world information and then like acts on that? Yeah, it's actually like pretty simple. Over time, I've like actually stripped out a lot of complexity from it. So I'll go over it quickly, but it's like it's not the craziest thing in the world. It has like a quick prompt to tell it like it's playing Pokemon. And I give it access to three tools, essentially.

It has like the ability to use like press buttons. It can press like A, B, start, select, up, down, left, right, you know, like press buttons. It has this like concept of a knowledge base that's actually like stuck in its prompt, but it's just like bits of information that it stores to keep track of things over long periods of time. So it's like the electric is super effective against water, the thing it could like decide to write there potentially, but it like fully controls that. So it can like add sections to it, edit sections, that kind of thing.

And then I have so quads like still not that good at like actually seeing the screen. So the last two I have is what I call navigator. And that lets it like point to a place that wants to go on the screen and like the model like will just like automatically move it there if it's like within reach of like the current screen.

Um, it doesn't have like a great understanding still of like the difference in its position between like where it is and where it wants to go. So like, if you just like, let it tell, say where it wants to go, it does a little better. Uh, it's the only like simplification we really make for it. And then like how it actually sees the world is when it presses a button, it gets a screenshot back, like to see where, where it is after the button press.

It also gets, like, a little dump of some stuff that, like, reads directly from the game, like its current location and a few other things, but it's, like, pretty minor. But basically, like, it presses buttons, it sees what happens afterwards, and then it presses the next button, and it goes and goes and goes. Kind of like me with a new program. Yeah, yeah, yeah. More or less. It's not that different. It's not that different from how you play it. But, wait, so there was one thing there that is interesting is the...

prompt that Claude itself has the ability to update things in its own prompt. Yeah.

So the key insight there is like, you know, you're playing Pokemon. Let's see. I'm looking at the stream over here. And this is in the like three days since it's been up. It's taken 16,884 actions as of this exact moment in time. Wow. And so like, if you think about that, that like roughly correlates to like 16,000 screenshots, like a whole bunch of stuff that it has seen over that time. And yeah,

uh, it like that much information doesn't fit into the context window of a language model. Right. So you need some way to like, be able to condense, like to get rid of old information

And so the knowledge base, I'm literally not sure this is the most optimal way to manage this, but the knowledge base is one way that it can keep track of information over longer periods of time. So what ends up happening is it like takes 30 actions and then it summarizes the conversation, like the things it did for the last 30 actions and like chunks that down.

And then it does it sort of like accordions like that. Right. So it takes 30 actions, does a summary, takes 30 actions, does a summary, that kind of thing. But trying to keep everything in the summary, it hopefully writes kind of bloats. So the knowledge base is a way for it to like kind of track longer horizon things. Yeah, that makes sense. Well, dude, talk to me about some of the stuff that you're doing right now at Anthropic. I know you're leading a team that is engaging on fine tuning and. Yep.

We mentioned there's, I had the big question of, hey, is fine tuning really all it's cracked up to be? And I say that because we did a round table with NVIDIA probably like a month ago. And one thing that was abundantly clear, because we did the whole round table on fine tuning and folks were like, I've seen way more lift for way less effort by just tuning my prompt, not fine tuning the model.

Yeah, but what are your opinions on it? I think that's about right for the vast majority of things, honestly. Like, to some extent, that's the best thing about language models, right? Like, I come from the world of machine learning, you and me both, like...

Uh, when you have to make a change to a model by like training it and getting new data and stuff like that, it's like, that is slow. There's a reason ML ops is such a hard thing is because like getting that all right is really fricking hard. Um, and comparatively like prompting is a miracle where you can like iterate on this thing with a, you know, even if you have a big eval suite, it's like a 10 minute iteration cycle, not a, not a.

three week or whatever iteration cycle. And so I think like for, um, from my experience, like I frequently encourage anybody who's thinking about fine tuning, like the, one of the first things I go in and do is like, how far have you really pushed prompting? Like how, how far have you really gotten with it? And, um, you know, there are people that have like really stretched it to the limit, but there are a lot of people who like maybe, uh,

because like prompting is a little weird still like people haven't quite figured it out like how to get it right it can be finicky it's not easy to like get the best prompt and I think so some people like get stuck part way I think they're at the top um

But until you like really are confident that you've gotten the most out of a prompt, I think it's like almost never a good idea to to even consider fine tuning for most use cases. And there's so much you can do to add knowledge with brag with like whatever other way you want to put things in the prompt like that are way easier than trying to like do fine tuning that. Yeah.

For someone whose job is to help people with fine-tuning broadly, I spend a lot more time telling people that they should steer away from it to start because I think you've got to be pretty precise. It's really expensive. It's a challenging... I don't mean expensive from the cost of even doing it. It's the cost of people's time working on it. It's just really hard to justify more often than not.

Yeah. Um, obviously like I have a job, so I don't think that's like, that means that's the, it's always a bad idea. You have a job and a team that you lead. So when is it right to do? Um, yeah, great question. I think there are a handful of use cases and times where fine tuning really makes sense. Um, I work on and see a subset of them, but I'll try to like give you an overview of like my, my viewpoint on like where, where it can be a good idea. Um,

One that I guess I'll call out first is like a thing that works, but also feels a little bit to me like a trap, which is like the pattern of taking, like trying to train a smaller model to do something that a bigger model can do well to like save money particularly is the thing that I've seen occasionally. The reason that I think that is like a bit of a trap is when you look at the last few years of model development, like

Every handful of months, a cheaper, smarter model comes out. And just by like kind of doing nothing with your whole dev team, you can often get that same outcome. Like you can get a cheaper model that it's like a cheaper, faster model that does the thing that the last model did fine.

Just with a little bit of patience. With just patience, right? And I think unless it's such a layup because you already have data and you know exactly how it's going to work, the process of getting fine-tuning right is still pretty challenging. Building the right data set, making it work well, making it not degrade the other capabilities of a model. These are all pretty challenging things to get right. And you're going to end up sinking a lot of developer time into something that you could just do nothing and do. Yeah.

that's one that I think like it does work. Like you can, you can have success. Like you can, you can do the thing, but I'm like,

I think people need to be a little bit more like skeptical of should they do it, even if it's possible for that. Because that was that use case right there was one of the folks that was in that roundtable. They said, look, I cut my cost by like 90% because of that very thing that you're talking about. But that's once they get a model that is working, right? So the cost of the model spitting out information is probably higher

90% cheaper, but what was the cost of getting that model? I mean, there's like an easy corollary to that, this with an anthropic, which is we put out haiku 3.5 four months ago. Right. Uh, I think it was November at my time. I, I, I lose track of time here, but it was somewhere around there. And, uh, it like benchmarks wise is about as good as opus that we put out in March of last year. So it's like, uh, eight months or something like that between the two.

And it's like literally a tenth of the cost. Like there's your 90% cost savings there, right? So it's like this is like the actual thing that happens. So like maybe you care about that six months. Like maybe you get it done in two or three months and you saved money for six months and that makes a big difference for you. But like...

I don't know the opportunity cost of people who could work on machine learning is pretty high. I think it's, it's not clear to me that the trade-off is great for that is all I would say, which is again, it works. Like if that math makes sense to you, I think it makes sense. If you're at that scale, then of course go for it. Right. Yeah. Right. Like you can, if you save $5 million for a million dollars of work or something like that, then like,

you made $4 million, I suppose. The flip side does still exist though, that like you could have done maybe something else that was more valuable than that with your time. Anyway, I'm like, we're getting into the nuance of this, but I just like, I personally would approach that with a little bit of skepticism is all. And the other side is for form, right? Yeah. And so then there's like this other form of like kind of relatively simple fine tuning that I think works, which is just like,

There's a handful of cases where it's like, I needed to follow this output format a little bit better. I needed to like understand this input format of data I have. So like maybe I've got some specific kind of document that I want the model to like understand a little bit better and be able to work with a little bit better. And there are just like a bunch of like little examples of like, my data looks like this. This data is like a little different than what the model's ever seen before. It's not that hard to like reason about, but like getting the model to like actually get it might be really powerful. Yeah.

Um, and I've seen like all the things like, um, you know, classifiers with language models, like language models have this like pretty good understanding of text in general. And like, you can get them to do like simple classification tasks that work really well. Um, there's a handful of stuff like that. That's like kind of in the, this is an easy task, but I just kind of need to like get this model that doesn't know much about it to understand this data. Um, those are like pretty reasonable. I've seen a decent number of people have success with those. Um,

And then there's this like third category, which frankly is where I spend most of my time, which is like actually taking a really good model and trying to make it better at something. And that's really hard to be clear. You know, like there's a reason why the research labs are, are make good models and make better models than most other people's can. Cause like they, they hire pretty smart people to work on this task of like, how do you take a model and do the last mile fine tuning to make it really good.

it's like sufficiently hard that most people really should like not engage with it i don't think like that extra juice is worth the squeeze for a lot of people um i think like the vast majority of these cases like you could typically take a model and get good enough performance yeah off the shelf but if you just like happen to be in the place where like you really care about the five percent past where a model can get right now maybe that's like why you're competitively differentiated maybe it's

Like the difference between your product kind of working and not is that last 5%. Then I think it can make sense to do research. And that gets like pretty sophisticated. I think there's like all sorts of different fine tuning methods that can work there. And you have to be like, it's a research project very much at that point. It is not like an engineering project. It's like we need to do research to figure out what data methods and tooling are going to actually make a model better at a task.

But it's certainly possible. Like, I mean, labs all the time make models better at specific tasks. You've seen it in a whole bunch of different ways. There's new models that come out over the last, like, in particular, six or eight months. So it's, like, obvious if you squint that labs have the capability to, like, take a model and make it better at a thing. And if that's true, then that should be true for labs.

anybody effectively, but the cost is really high. So like, it just has to be like, so worth it to go down this, this sort of research endeavor, but it does work. Like you can, you can take an arbitrary problem and think about how do I make a model better at that problem? Well, it's funny that you mentioned before also something that I wholeheartedly agree with on since we're not super clear with what is happening when you send in the prompt and you don't know if,

the output is not good enough because my prompt is not good enough or because I need to now make the actual model better then a lot of folks potentially default to fine-tuning as the next step because it's just like oh well I gotta I gotta make this output better so I guess if I can't get there with my my prompts that I bought from an AI influencer off of Twitter I'm

Then I think I should now start looking into fine tuning. And that is hopefully what we are advising folks not to do. Yeah, I think that's like exactly right. There's this like kind of mysterious promise of fine tuning that exists sometimes, I think, where it's like if you put all of your data into the machine learning box, like surely it will get better.

Um, and as someone who has like put a lot of data into the machine learning box and watched, watch the models get worse more often than they get better for like the average set of data you get. I can tell you that's like a very expensive task to take by default. Like when I think it's not straightforward, it's not obvious. It's not obvious that any single set of data is going to make the model better at what you care about.

Like that's why I describe some of this as like research to some extent, because like actually figuring out what's going to impact the behavior you care about just takes time and it's not easy. And so I think like there's something about like the pull of like the power of machine learning getting us there. It's gotten us a long ways. But I have certainly seen a lot more teams have a lot more success working on prompting longer than fine tuning.

especially in like raw count. I really want to talk to you about agents because that is the buzzword of the year. And it feels like everyone is trying to do things around it. I have recorded a ton of episodes with folks that are putting agents into production. I feel like you've probably seen some really cool use cases with agents. What are some just off the top of your head areas you want to jam on around agents? Yeah.

obviously like we have seen the same thing, which is like, I think a lot of the people that we've seen have the most success with quad have been building agents. Yeah. And we've put a lot of work into making models that are like better at doing agentic shaved things. We are like very much of the belief that that is what's to come. Um, you know, like it's funny. I think you think about agents and there's, um,

So part of the funny thing that I think has happened and part of the funny thing about working in this industry is it's kind of hard to know when, when and what model is going to come out that makes a specific agent work well.

Oh, I don't think we internally at the labs like have a perfect grasp on some model we're forecasting for the future, like what it's going to be good enough to solve some agent task is going to be right. So part of the funny thing is like you kind of just sit around knowing that there's something like you squint and something looks like it's going to work. And then maybe one of these models is like kind of good enough to hit the liftoff point where it's actually valuable.

So to give you an example, like that clearly happened with coding last year, right? I think like at the tail half of last year, and especially when released the updated Sonnet 3.5 in October, like you saw a lot of the coding agents really, really explode. And I think that's because like it was the first time where a model really got like to that next tier of, oh, this is like went from like a cute thing that I'm like happy to play with to like, whoa,

This is really, really good. Yeah. It's useful. It saves me time. I can describe things. That's when you saw the YouTube videos of like my eight year old daughter just built a website. Exactly. Exactly. And that's like that's the thing that I think is is kind of interesting because like what that implies a little bit is like you just kind of have to like.

get a feel for what are we close to, like what could pop next a little bit. And kind of like maybe it's hard to know if it will, when it will, which release from which lab, like who knows what's going to make something really, really great. But I think you can like start to see different things take off. So coding is the one that like has obviously taken off. We released Claude Code recently.

Yeah, I saw that.

Um, so coding has quite clearly exploded and I have like hunches about other things I think could, you know, um, it's hard for me to say for sure what will for the same perspective that you have, but I think there's like, um, some more sophisticated, like for example, legal workflows, I could imagine getting a lot better over long horizons. Um, there's,

Uh, like, like stuff. And we have people that work at Anthropic who have worked on things like accounting workflows, not like a product we're working on to be clear, but like, I, I'm like no people at Anthropic who have in the past work on that kind of thing. I think there's like a lot of stuff that is probably further out, but like you could imagine a model that like gets good enough at using a spreadsheet that there's just like a ton of like manipulation of spreadsheets, which is just so much of like work that gets done is like menial moving things around spreadsheets and running formulas and stuff.

that you can imagine like models getting better at that kind of thing. And that like having a really big impact on what kinds of agents are like actually useful for doing work. If you can go to cloud.ai and say like, here's the spreadsheet and here's this analysis I need to run. Can you like go do it? And create a data visualization from it? Yeah, like build the analysis, build a model, like predict, forecast this thing, tell me what the answer is and like give me the sheet to show it.

I need a dashboard for my boss. Sure. Right. In 20 minutes when my meeting is. Yeah. Yeah. Um, but like, yeah, I think part of the frustrating thing on the flip side of this is like, I, I even like, I have about as much information as I could have. And it's like hard for me to predict what the next thing that's going to work is. But if you had to guess me, like with a pattern or ask me what the pattern is for like, what's going to happen with agents, it's that like,

And it could happen because of this model we just put out this week. It's like hard to know to some extent. But like a models are going to come out. Some startup or someone is going to be like building this agent in this category. And it's like, oh, shit, it works now. Like it didn't work. And like, oh, oh, it's like good now. And then they're just going to explode. You know, like it's going to be over. It's going to be a month and they're going to explode. And it's going to be like all you hear about could took off.

A la Cursor. A hundred million in whatever X amount of months, not years. Right. And that's like, I think we're going to see, you're going to see that pattern happen with just like a handful of different products, workflows and tools people have where.

like around some model that gets like the key set of small capabilities you need to make it so that it went from like oh it looks like it's about to do this but then it like tripped and stubbed its toe and now i have to go figure out what went wrong it like ruins the immersion of trying to use this agent to like oh i asked it to write this code and it just did it and i like didn't think about it uh and that's like a world of difference so it's funny you mentioned that too because the

the hackiness that you have to do when your agent doesn't work. You spend all this time trying to

just get it there so you don't have that experience or your end user doesn't have that experience of, oh, I thought it was going to work and then it didn't work. And so the reliability of the agents is something that I think... Well, that's like so much of why it's like the tipping point, you know? It's like as soon as you have to go in and figure out what didn't go well, like you waste all of it. Like you have to get the full context on why it went wrong and what happened in the middle. And it's like you are like nearly at the point where you may as well have done it yourself. Yeah.

And even if it's like one little tiny mistake at the end that happens, right. But it happens like enough, then it's like, Oh, I'm trying to have it do this thing, but I have to actually go like do the whole thing to figure out what went wrong. But as soon as it like crosses that threshold of like more often than not, it, you don't need to check it. Like it, it looks that it, then it's like, Oh, this is the coolest thing I've ever seen. It's incredible. Yeah. Which I think like, again, I'm just going to fall back on code because it's the thing we've seen so much, but it's like,

I've had AI write code for me for a long time. Like I I've been an anthropic for a year. Like I've used AI to write code a lot over the last year, obviously. Um,

But like there's this big difference from like I get like I copy and paste it in and I like poke through it and I like kind of get it there to like it just kind of happens that is like way different. You know, I wish I had like some hot spicy prediction of like exactly what the next thing is that it's going to take off. I really don't. But that's like the thing that I expect to be true for agents is that like a model comes out a month goes by.

The thing clicks for someone, it explodes. And then like, bam, there's another industry that's like really got a new way of doing it. Yeah, it takes the menial work and all of these tasks or whatever the people that are in that sector normally do. Now they're doing it a lot faster. And so they're able to produce a lot more from it.

And I think it's just it's exactly what you're saying. Like, it's just going to be more reliable. And you're going to see that reliability is going to then echo out into the world. Yeah. Yeah. No, I completely agree. I completely agree. And it's funny you mentioned how the model updates will cause this type of second or third order effect.

And also how folks almost need to reset when new models come out. Because I was talking to some friends who are building agents and they said that every time they upgrade to a new model, it's almost like they have to start from scratch with the prompts again. Because...

a lot of what you're saying is the models are better at doing things so now the prompts you don't need to specify all these little edge cases you don't need to tell it to do that thing anymore because it already does it it's already in the training or the fine tuning um yeah i mean with pokemon to bring it back to that like every time i've tested on one of the new models it

Like the most common change it makes is some like deleting stuff, you know, it's like, oh, I had all of this like prompt stuff in there to like tell it to not make all of these stupid mistakes and to like try to put band-aids all over it. And then if I just delete all the band-aids, it's just way better. I think like part of why people have a hard time seeing new models come out sometimes, I think it's true for like almost every model that comes out, is like you look at these benchmarks, but they're not typically like the thing that's holding the model back is not like some thing.

some specific, not some big thing. It's not like you learn physics overnight and that's like how it, like why it's been to be better at filling out a spreadsheet. It's like, oh, it got just like good enough to click on a cell in this spreadsheet, you know, like it clicks on the right cell now and then it all works. Right. And it's like kind of imperceptible

If you're just chatting with quad.ai, that it's like this model that could, for the most part, look like it's like just kind of sounds like maybe incrementally a little smarter or whatever, like not much. But like if it got a little better at clicking on a spreadsheet and that means that it can fill out spreadsheets now, it's like, oh, the game over. And so I think that's part of like we're in this funny era where models come out. I think to some extent,

like at first glance for some people it's like what's the big deal what's so much different kind of looks the same like on cloud.ai it feels the same right um and people notice a little but it's not like this like gargantuan thing for everything until you find the thing that it's actually way better at like this oh this when i asked her for this thing it's like yeah way better and then it's like you know that that's where you end up seeing a lot of the change from my perspective

And in Pokemon, that's what happened, right? Like, it's like, it wasn't obvious. Some of these benchmarks look kind of similar. And then suddenly, like, we hook it up to Pokemon and it's like, oh, it's like night and day. This model, like, can do it. It's like, whoa, that's crazy. That's so weird. We got to show more people. Release this to the world. That's what we can expect as far as the next big breakthrough is agents on your video games. Yeah.

And so actually one thing that it reminded me of there is it's like one. What was the quote from Neil Armstrong when he came off of the moon landing? One small step for man, one giant leap for mankind. It's like one small thing for the model, but one giant leap for mankind in a way. Yeah.

Makes me think about it. So what other stuff have you been thinking about lately? I remember, and to let you know, whenever I chat with you, and I always love chatting with you, I come out of these conversations like, oh, damn, I wasn't viewing that that way. But now you totally changed my mind. Point in fact, one of the first times that we talked after

After the whole AI revolution started and got underway, you were saying, dude, there's like millions of developers that can now use AI as opposed to the...

1 million or hundreds of thousands of machine learning engineers that use AI. Like, do you realize what is going to change now? And I was like, are you sure? I don't think so. It doesn't seem like it's that good. And after that conversation with you, I was like, oh, well, I guess maybe I'm going to be open-minded about it. So are there other areas in, and I'm not saying that there's been such a big disruption in

but things that you're thinking about now that you're excited about that don't involve Pokemon. What do you mean? What else is there? That's probably been your life for the past, whatever. So at least the last four days, it's been a little bit, a little bit of my life, but yeah,

I'm going to give you like a series of boring answers because like I've just been so enveloped in trying to make better models and help people use them better. And so like I honestly like feel a little blinders on which is like

All I can think about these days is, like, what's going to happen with new language models and how are they going to change how people use them. And so I'm kind of, like, still stuck on the same thought that I gave you when we talked a while ago, which is... Has it changed? Has that, like, assumption of, hey, now everyone is going to start using AI, you still feel...

strongly about that? Are you doubling down? Are you backing off? Yeah, I mean, if you go talk to the world, if you talk to the people we talk to, it is not

it is not just ML people that we taught to, to build AI features, you know, it's not even close. Like the vast majority of the people that build with quad are, are that build with quad or engineers. And then you talk about the people who are like actually getting value out of it. It's like people going to quad.ai and doing all sorts of other stuff too. Right. It's, it's like,

And you can build workflows on top of quad. Like you, I saw you are interested in MCP. Like you can build like pretty significant workflows without really like needing a huge engineering effort. Like you expose a few things via MCP and like you can start building out like workflows that really do a lot of your job for you just by like gluing some of the pieces together that have been exposed in your organization. And so yeah,

like, yeah, much more radicalized than ever that the part of what's happening here is that like, we are dramatically raising the bar on who can use it. I think there's a long ways to go. Like, I think the experience of go to a blank chat bot and like try to figure out how to use it is not good enough for some people. Right. And so I think like figuring out how to elevate people to feel like they understand, like, what can I do with this thing?

is an unsolved problem in general. Yeah, I heard it put that this way that I really like is how much cognitive load are you putting on the end user? Yeah. And when it is a blank chat bot and they have to create the question or create the whole prompt, that's a lot of cognitive load as opposed to scrolling TikTok, you know? And so... Yeah, it is. And it just like, I don't know. It's...

AI, you just like see a lot in the world. And I think it can be like somewhat overwhelming if it's not grounded in like, what does it do for me? And I think people are slowly figuring that out. But, you know, maybe this is like ties back to you to like software engineers quickly figured out what AI can do for them. You know, it's been a while since I've run into someone who is not at least to some extent figured out.

a way that using language models like has a pretty big impact on how they do their job. Yeah, even just with the coding. That's true. I may be biased, obviously, but in my eyes, there's almost like the rabbit hole that every software engineer has to go down when they start trying to build AI features and

And that is they start learning about AI more and more and more and all. Next thing you know, it's like, OK, well, you're almost like this hybrid of someone who is a data engineer. Like they have to learn about pipelines. They have to learn about all this stuff that we've been doing. It's almost like this gateway drug into the ML world or AI world, I guess, is what they would call it.

it's funny, like a lot of the core skillset that I think was relevant for machine learning people, once you can like give up control of the gradient descent is actually still a pretty relevant, relevant skillset. Like a lot of what you do to make,

these systems better as you like build the data sets you need to evaluate them you do this like sort of like stochastic iteration process over prompts or whatever it is or like the various systems that build out an agent or whatever it is but there's like a lot of just like pure experimentation you need to do to like get it and it needs to be good like you need to track your experiments you need to be like thoughtful about what and how you did it um if you're imprecise with that like you you end up in the same holes that we've we've learned a lot about doing machine learning in the day

Um, and so like, there's just all of these sort of like crossover skill sets that I think are true where there's a lot that people, um,

that engineers building with language models can learn from the history of machine learning. And there's a lot of skills that people with machine learning backgrounds, I think can, can use. And, and I think like, it's just a question of like figuring out where and how to like let go and live in that hybrid world. Like ideally you don't need to worry about inference anymore, you know, like you, it's like, what are the coolest, like one of maybe the least talked about things I think about what labs do, but like,

they figure out how to serve machinery models at like incredible efficiency to, to people. And in the past it was like, Oh, I'm gonna have to figure out like a GPU cluster and serving and routing and all like, there's just like all of this annoying stuff you need to figure out to use ML. That's just like,

a fancy API where you get to pay as you go on demand. Like it's crazy how convenient that is to use machine learning and like how quickly people got hooked on the drug that is token based pricing where you just pay for API calls instead of having to pay to like host GPUs and deal with it. Like,

So I guess like I use that as an example of like, hopefully some of this has just like gotten much easier. Like you don't need to worry and think as much about GPUs if you're just like going down the managed route that I think most people should be on. There's like all of this infrastructure hassle that you can kind of not have to worry as much about. Like the data problems tend to be like a little less painful. It's not like cluster scale data. A lot of the times that you're thinking about with language models, it tends to be like

you know small amounts of information about people maybe but like it's not like a lot that you're really working with um compared to like you know our background we both have seen like really gnarly data problems yeah um and in my time in anthropic i've seen nowhere near that kind of data problem which is just like it's just getting a little simpler so yeah my one buddy told me who and he works at a bank so you can imagine the strictest of regulations he was just like

for the new gen because i was asking him what are you doing he's serving both gen ai use cases and traditional ml use cases so you can all that fraud detection fun stuff he's doing and the gen ai stuff totally he said man when you can get away with it outsource everything to the like get rid of all the headache for the platform and just outsource it to these labs i know um

Yeah, I think that one like has just taken a little bit of convincing for people because I think it happened so fast where we like a lot of people invested so much in figuring out all of this really complicated infrastructure to be able to participate in machine learning. And I think like the idea of, oh, we can do it without all of that, you know, like we can build on top of this like place that's going to like.

in some cases host training and inference and everything. And I just like submit data sets or submit inference calls and like it all happens and it auto scales and it's as big a scale as I want. Like everything's perfect.

And that's like a thing that I think it takes some adapting if you've like spent this muscle building out the infrastructure. But man, it's like way, it's way easier. The people who let go and embrace the free infrastructure they're getting to some extent, the free for management they're getting, I think have like been able to make a ton of progress on this stuff. Yeah. It's like the grandpa's yelling at their kids. Back in my day, we had to actually hook up all this and figure out our own gradient descent. Yeah.

This is what I said to you, though. This is why I was so excited to join, because I spent so much time helping people figure out how to hook up all of this stuff. And it's so hard and challenging. And now I just like... It's all custom, right? You put some words over here and you get some words out the other side. And we all just have a good time now. It's great. Yeah. And then, yeah, exactly. We can play Pokemon. Or at least watch some AI play Pokemon. Yeah, yeah, yeah. Exactly. I'm so much more...

Claude Plays Pokémon - A Conversation with the Creator // David Hershey // #294 46:58 Share

MLOps.community

Deep Dive

Shownotes Transcript

Claude Plays Pokémon - A Conversation with the Creator // David Hershey // #294