We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#116: AI Agents, MCP and the problems with AI benchmarks | ft. Matt Carey

2025/4/19

Real World Serverless with theburningmonk

AI Deep Dive AI Chapters Transcript

People

Matt Carey

Topics

我主要从事 AI 集成工作，最近发现 StackOne 的 API 对构建 B2B 代理非常有用，因为它解决了上下文窗口问题并使所有数据都非常干净整洁。 MCP 类似于 Chrome 扩展程序，为 AI 客户端提供了一个插件系统，允许用户扩展应用程序功能，将 AI 从“黑盒”中移除，让用户可以根据自己的需求添加功能。 MCP 的核心是“工具”API，允许用户通过几行 JavaScript 或 Python 代码与数据库等外部资源交互。MCP 通过标准化的协议，让 AI 应用开发者可以轻松地集成各种外部工具和资源，而无需重复构建集成。工具调用适用于构建自定义 UX 和应用程序的开发者，而 MCP 则更适合那些不需要构建自己应用程序的用户。远程 MCP 服务器是协议的真正力量所在，因为它允许用户无需安装任何东西即可直接使用各种服务。 MCP 的身份验证机制正在不断完善，目前主要使用 OAuth 协议。Anthropic 的 MCP 强调用户控制，通过“采样”机制让用户可以控制工具是否使用 AI 进行外部推理。Google 的 agent-to-agent 方法与 Anthropic 的方法不同，它更注重为每个特定任务使用一个单独的代理。如果一个系统包含循环，并且 AI 模型可以决定下一步做什么，那么它就是一个代理。是否构建 AI 代理取决于问题的复杂性和不确定性，可以通过尝试使用简单的 LLM 调用和状态机方法来判断。选择合适的模型非常困难，最好的方法是创建黄金标准的输入和输出，然后测试不同的模型。现有的 AI 基准测试通常不代表实际应用场景，因此开发者应该创建自己的基准测试。对于约束性强的任务，LLM 可以有效地解决问题，尤其是在有良好的类型提示和测试的情况下。大型语言模型的上下文窗口大小并不意味着模型能够有效地利用所有信息进行推理，开发者应该谨慎使用上下文窗口。现有的基准测试容易受到模型参数调整的影响，因此开发者应该创建自己的基准测试来评估模型的实际性能。开发者可以使用公司内部的数据创建长上下文基准测试，以评估模型在处理大量信息时的性能。获取可靠的 AI 信息的最佳方法是依靠一些了解该领域的专家进行筛选。博客、播客和研究论文是获取可靠 AI 信息的良好渠道。

Deep Dive

Shownotes Transcript

Translations:

中文

Hi, welcome back to the end of the episode of Real World Service. Today we're joined by Matt Carey, who's a returning guest. Matt is a founding AI engineer at StackOne. He's also the co-founder of AI Demo Days, which you might have seen on social media and it's been going around all over the place. I guess San Francisco, you were there recently. And he's also an advisor on the Open UK AI Advisory Board, which advises the UK government around all things AI.

Hey man, welcome back. Yeah, thank you for having me back. Super exciting to be back on. Yeah, since we last spoke, a lot of things have changed for you. So why don't you let us know what you've been up to recently? Yeah, so I think when we last spoke like a year and a half ago, maybe? Something like that, yeah. Yeah, I was working on a bunch of open source AI stuff on Quiver. We were talking about Quiver mostly.

Yeah, so everything changed. Quiver went to YC. I joined Stack One. Yeah, it's been kind of crazy since then. Still lots of AWS stuff, but now sprinkling in some more ML, which has been really good fun, getting my teeth stuck in there. So with Stack One, I never quite understood what it is exactly they do and how does AI fit into it. So maybe tell us about that. Yeah, we'll talk a little bit. So...

I guess the best way of understanding it is it's for, it's B2B, so enterprise focused. Uh, and we do integrations for any type of application. So, um, previously like massive compliance, they need to understand like what their users are up to, like whether they're, they're, they're customers, customers, like what, whether they're admins on like, um, any of their software stack. And so we, we'd give them the integrations for that. Um,

You can think about it like integrating with a whole industry at once. So we have 250 SaaS integrations and maybe like 100 HRIS, like HR systems. And so if you want to create an integration that goes from like an onboarding flow, say from an applicant tracking system to an HR system,

and you want to release that to your customers, all of your customers probably have a bunch of different HR systems that they use. And so instead of integrating with all of them, you integrate with one API, you get all of them for free.

It's a little bit of a confusing take because we sell to our customers, but then the people who actually use us is our customer's customer. And so it's all white labeled. Recently, we've found that some of the unification stuff we do is actually really fun for agents because we tackle problems like context window and making all the data really nice and clean. And we didn't really know it was going to be useful, but seems to be super useful for AI.

I joined about a year and a half ago. I can talk a little bit about that if you want. Yeah, and I joined a year and a half ago. Sort of the premise when I joined was that I was going to work on helping us build integrations faster. So building automations, creating like a workflow builder, essentially. So for our internal teams to be able to build integrations faster, for our customers to be able to build integrations on our platform, all of that fun stuff. And we did pretty well with that. We did over a year,

Really successful 90 integrations in three months is like current velocity, which is crazy. It's like super fast, so much faster than our competitors. Good fun. And then now I'm working more into like, how do we,

We have an API that you call to get employees from an HR system, from any HR system, to get the last job posting from an applicant tracking system, to send a message in a messaging system. Our API does all this stuff. But how do we make this API really useful for people building B2B agents? So how can we use it for MCP? How can we use it as more of a tool set? Things like that. That's what I'm spending a lot of time on at the moment.

Okay, well, seeing as you mentioned agents and MCP and you're wearing an agent's hat. So maybe let's go into that. Obviously, MCP has came out and it's everywhere. So on social media and everyone's kind of talking about it. So do you want to give us maybe like a high level? What is MCP and why do we even need it? Because there's already been previous attempts at integration through function calling or tools. So how is the MCP kind of different and why do we need it?

Yeah, so MCP is really fun because it's the first time that you've had a plug-in system to a bunch of applications, like not even AI applications. I think of it as like a...

like Chrome extensions. So a Chrome extension extends the functionality of your browser. And a lot of browsers that are based on Chromium support Chrome extensions. And so you have like some type of universal plugin system to do whatever you want to do with your browser, to customize it, to make it yours.

And I think of MCP as kind of like that, but for AI clients. So for things like Cursor, I'm assuming Notion will come out with something soon. Raycast also. What other apps will people have used? Canva, Figma, all of these like applications that people spend a lot of time in are really embracing AI. Yeah.

But they're embracing AI in just one way. And they want to allow... MCP allows them to put the user first and let the user add functionality they want to add when they want to add it. So...

forgetting about the technical specifications of MCP, it really just allows you to extend an application, any application that supports MCP, with whatever tools, whatever extensions that you want. So you take the AI out of the glass box, I think is how David, DSP explained it to me, the guy, the creator of MCP.

he demoed at one of our events. It was really good fun. Okay. But I guess that's quite abstract because from what I've seen so far, MCP is a protocol where you can create a definition, an API spec for, say,

one of your APIs or if you want to expose some databases that you've got running on the SQLite or whatever and you want to make it accessible to AR in a way so that you have a formal or standardized way of describing what resources and tools are available on your MCP server. And so once that's registered with, say, Cursor or CloudFly,

cloud or whatever AI model you want to use, essentially you're basically telling the AI model that here's my resources. You can use them to query data, to write data via tools. There are three APIs for MCP. You have tools, resources and prompts. Pretty much

None of the clients support resources and prompts yet. So tools is really all anyone is familiar with. So this is like your database tool. So your database tool is a few lines of JavaScript or Python that says, given these inputs, do something with them. That might be go get data from a database and return them to the client.

Yeah, you can do a crazy amount of stuff with that. So I think someone made like an Ableton server where they're making music. I've seen like some Minecraft ones. I saw a Blender one, which was really cool. Like I said, 3D models. But really the idea is just have some sort of universal system that allows you to have used the agent that someone spent a really long time creating and the agent UX that you like and the application that you like, but use it with all of your external tooling. So yeah.

it gives the client builders like a big headstart in enabling like external integration access, which is, which is really good fun. And obviously like very like poignant because I work in integrations. So yeah. Yeah. It's a,

It's a fun space. So I guess in that case, in your examples, the client is basically the people that are building the AI-powered apps. So they are the clients and they are using MCP servers to allow their application access to resources. I think you would use Blender's example to then basically allow the LLM to then create 3D renders by

connecting with Blender by sending it data so that Blender can then take that and generate a 3D rendering. Yeah, exactly. So yeah, I think it's really cool in that we haven't had, the standard like this hasn't really existed. So essentially what you would have to have done before if you try and consider what you do without MCP is you're just limited by what your client can build.

And so if you're in Cursor, for instance, you're limited by the agent that Cursor is building and the integration that Cursor can do, which is fine, which is fine. But say you want to create like a coding agent that has access to a browser and so wants to look at a browser, if we're using the coding example. To do that, you would have to, first of all, recreate Cursor, whether in a CLI or something like that, you have to recreate Cursor and then you'd give it a bunch of tools for your browser.

So it removes that whole side of it where you maintain a hacky, buggy UX that actually curses value to a certain amount of money. And they're going to put a lot of effort into the UX, the same with Figma and Canva and Notion and all of those others. They're going to put a huge amount of effort into the UX. And all you have to do is play around and give it the tools that you want when you want them. Yeah.

Right. Okay. And I guess the difference with, say, tooling or tools or function calling, whatever they call it, is that function calling would be something that's proprietary or implemented on the client side.

So someone like Cursor will implement their own function calling for tools to integrate with a specific target, whereas now you're talking about a protocol for the integration side or the server side, so that they basically say, here's how you can integrate with us. And any client, anyone building AI power apps can use the same MCP protocol to integrate with those clients.

services rather than everyone have to kind of build their own version to integrate with these different services. Yeah, so tool calling still has a place and we actually just published a tool set for a bunch of our API endpoints.

But the place is really for people who are building applications, not wanting to necessarily use applications. So if you're building your own customized UX and your own application that is your value proposition as a company, then you're going to want to use tools and you're going to want to use the code directly. And a tool call is just a way of interacting of a model doing things. And so you're going to be way closer to the model. So you're going to need to implement tool calls.

For people who aren't necessarily building, the MCP allows them to just plug in MCP servers. And we'll see lots of more. Previously, in the past few months, we've had mostly local servers. And for local servers, you kind of have to have some idea of what NPM does or what UV does for Python in order to run them because...

you have to have those things installed on your computer to run a local server. But we're going to have a lot more remote servers. And that's actually what I spent a lot of time doing is building remote MCP servers. And so for things like, for instance, Sentry, they released one very recently.

you can now just like copy and paste their config for the century remote MCP server. And it works in cursor. Like it authenticates with you. It opens up an authenticates and then you can use it and you don't have to like install anything. You don't have to do anything. It just plugs in directly. So that's like the real power of the protocol is this remote stuff.

maybe an unintended consequence, but I think it's going to be really, really cool. More and more companies are going to start hosting their own remote servers just alongside their API, I think. Okay. Actually, that brings up an interesting question as well, just in terms of authentication. How does that authentication work? So if, say...

say, Canva or whoever owns the user service and they have MCP servers or allows the AI clients to access them, how do they, I guess, authenticate against, say, my account? So how do I... What's that...

integration look like so there's a there is part of the spec for authentication it's oauth uh it's currently it's going to change a little bit for the longest time it wasn't in the spec uh but i think with the latest spec release about two weeks ago uh it was it was updated to include oauth so there is like a whole you can go into the mcp specification uh and it lays it out there's some like nice uh nice nice charts and stuff but essentially the server will implement oauth

At the moment, it's as if the server was an OAuth server itself. So the server will authenticate with the third party and will allow the user to input some credentials and things like that. But that doesn't necessarily work for enterprise who have separate OAuth servers. So they are going to update this.

And yeah, I think it's a work in progress that they're working really hard on this. There's like, there's buy-in from an obscene amount of 14, 500 companies. And so they will work this out. They will make this easy.

uh internally we've hacked it a bunch and we have like this uh api key service uh and then because it's a worth compliant now mcp it had there is there is in the spec like an authenticate an authorization header with a bearer token and so we just hack that by putting in an api key as that barrow token which works pretty well means we'd have to like host any other oauth to like

sites or anything like that but it's still a work in progress still a work in progress okay so uh i guess in that case uh what about uh google's agent to agent because that pretty much came out right after mcp and it um to the outside like myself it kind of looks like solving similar problems yeah is that right you're gonna get a lot of standards that develop around these things um

Yeah, I think it's going to be fun. It's going to be fun. It's going to be the Wild West for a summer length of time and then we'll...

Probably not even stand behind one standard, but there'll be like three or four or five. I don't know. Have you seen, you know, the meme where it's like 14 standards, one universal standard. Now we have 15 standards. Yeah. Yeah. The old, was it, was that the comic? Yeah. Yeah. So we're going to like, I think we're just going to have loads and loads of different ones. They're all going to target slightly different things. So yeah,

and different layers of the stack. So maybe they'll spread out. From what I understand, the agent-to-agent standard is much more about... Okay. Maybe let's talk a little bit about the idea of MCP. So Anthropic with MCP, the idea is that the user of their application, the user of the client, is in control the whole time. And they...

in their spec, they say to client builders that they should really allow, accept or deny on all tools and things like this. Anthropic really want the user to be in control and the user has control. And so they've built in a bunch of things to the MCP spec that really dictate this. So you can actually chain MCP servers

And when like, well, I'll explain. So there's this thing called sampling, which is a really bad name, but it basically, it says that the MCP server needs to do some AI inference. So the client has asked this called the tool. The tool is maybe get documentation from this place given this input string. And then the server goes, Oh my God, I have an input string. I need to get documentation, but I need to like classify this input string or something. So basically,

previously we'd have gone, oh, that's a sub-agent. But what MCP, and we'd have added an API key and we'd have called out to an LLM in the tool. But what MCP said was, actually, no, no, no, no, no, no, no. We give the user control. And if that tool goes and does stuff external with an AI, then the user's not in control. And so they introduced this thing called sampling, which no one uses yet, I don't think, but it's going to be, it might die. But at the moment, it's just a really interesting idea that, yeah,

that the server actually calls back to the client to say, can you do this piece of inference for me? And then the client can decide whether it wants to do this. So, for instance, in the classification task, it would call back to the client and say, can you do this AI classification to define maybe which external documentation to call? And then it goes back and then the tool goes off and does something. Not a great example, but hopefully you see where I'm coming from. The client...

The user of the client stays in control. The client can call the server. If the server needs to use AI, it can call back to the client and that's sampling. And Anthropic are like very big on this. Was that a Google take a little bit of a different approach. And this is why they say they're complimentary. Uh,

Whether you believe that or not, I don't know. But Google take a little bit of a different approach. And the approach for Google is that the tool would be a sub-agent. That's why they call it agent to agent, because the idea of sampling, the idea of the client staying in control is less important.

It's just not there. It's much more about one agent for each particular task. So in my case, in my HR tech world, we'd have one agent that knows about applicant tracking systems, and we'd have one agent that knows about HR systems, and we'd have one agent that knows about messaging systems. And if you needed something doing, you would have an orchestrator agent that just ties in and goes through to all of these different, just calls out to these individual sub-processes.

or a user that calls out to these individual sub-processors. So that's where they sit. They're just a different philosophy. But yeah, many, many standards. Okay.

Yeah, the whole AI agent, well, separate agent per task, that's kind of more similar. I guess that's more aligned with the idea of agents that I've seen online. That seems to be what people are doing. And that actually brings to a different question. They released a whole white paper on that. I really would recommend anyone listening to read the Google white paper because

It's really good. I think it's just called Agents. But if you read that one and then you read the Anthropic Building Effective Agents, you'll see the two sides of the same coin and the different avenues that people are going down. Okay. Yeah, because actually...

Actually, I asked a question on the LinkedIn recently that at what point something becomes an agentic system because there's a lot of examples I've seen, especially from the more AWS side of things. It's, okay, here's a state machine workflow with calls for different bedrock prompts, I guess, and they call it agents. But then again, there's no

you've got a static behavior, you've got a static flow. There's no sense of, okay, the AI is determining what happens next. And the same thing with, I tried out Postman's AI workflow builder. How's that? It's a really nice intuitive, I guess, workflow builder. But at the same time, the paths are quite static.

static, it's predetermined. You're using AI to make decisions. And then you can say, you've got a couple of steps, they generate some output, and then they become input for the next step, which is another AI prompt. And that can decide to say, okay, let's do X or Y. And then based on that, you can do some branching.

But it's not the same as say, okay, AI agent comes in. I think you use the example where you've got an orchestrator agent that decides what to do next, and the task is then passed along to one of the, I guess, specialized agents.

So looking at that, it still feels very much like a DAG, like a state machine, as opposed to something that I think you pointed to the anthropic paper, that an agent, a workflow is something that's kind of like a DAG, whereas the agents is more when the AI model has agency to determine what happens, when it's done, what to do next. If you can decide whether to call tools or not, and...

I don't know if anyone's coded that like whole process to decide whether to call a tool or not. You basically have a loop and you have a tool and then you have some method of executing the tool. So if it has that loop, I would say it's an agent. It's an agent. Okay. That's my personal definition, but like,

So even if you just have a function calling with, say, one LLM prompt, because you can do the same thing with, say, Bedrock. That's what we kind of did where you make a request with a prompt and the response says, oh, this is a tool calling and you have to implement those functions for tool calling. Even when that happens, you think that classifies as an agent? Yeah.

Yeah, if it's like single shot and so you just get the output out and you're using it to structure an output or you're using it as like the next stage in a process, then no. But if after, if at the end you go back to the model and say, now what would you like to do next? Then to me, it's an agent, yeah.

Okay, right. In that case, I guess the question then becomes, I've seen a lot of examples of agents that feel like it's over-engineering. I mean, at what point would you suggest that, okay, you should build an agent, a genetic system, as opposed to just something, maybe you have to make multiple prompts, multiple calls to the LLM, and it feels like a lot of problems probably can be solved with that, as opposed to having to actually

user function calling or what have you, at what point would you say, okay, this is something that you should implement with an agent as opposed to just maybe a sequence of LLM prompts? Yeah, I guess you just got to try. Okay. Yeah, you just got to play with it. You just got to play with it. So, I mean, the easiest thing is to start with like ChatGPT and just be, when you're building with LLMs, it really is the most straightforward way

And then just like, oh, can I solve this problem given some inputs and outputs on ChatGPT? And if you can, then it's like a single LLM call.

And then if you can go to the next stage, if you can think of it like a state machine, then try and model it like a state machine. But if at some point you get to a point where there's like, there are so many little heuristics that decide which path and which avenue to go down, you'll put it into production and well, not even in production, you'll

if you get 20 inputs and 20 outputs of what you're meant to have in your system, that's like, like a baseline for building any type of AI application is like 20 good inputs, 20 good outputs. And then you just write some tests, write some evals, um, evaluations, basically tests in AI land. Um, they're literally just tests. You can run them with like a test runner. Um,

But yeah, if you write some evals, you'll find out pretty quickly whether you can model it as a state machine, as a DAC, or whether the situation is so unbounded that you have to rely on the LLM. You can do some amazing stuff just by structuring the output of unstructured text, because that's what LLMs are really good at, right? You take...

Oh, a guy's going really fast outside my window. I don't know if you can hear that. LLMs are really good at taking some long, unstructured, very noisy text or a medium, actually. Now we have multimodal. And transforming them into some sort of structured output, whether it's a tool call or like...

or even json you know and so that that is hugely useful just by itself so maybe that's all your application needs um but i think as you push more and more and more and increasingly give uh

I don't want to call it agency, but complexity over to this model, you'll get to a point where you're like, I can't get any further on my evals. There's so many edge cases. We need to give the model agency. And I think we first saw that with maybe GPT-3, sorry, Sonnet 3.5 and a bit more so with 3.6 and then a little bit more so with 3.7.

That 3.6 is the one that they call it 3.5 new. I think it's just, we just colloquially called it 3.6. And you see it more with 3.7 that...

Yeah, the whole naming of models is a whole other clusterfuck. We can talk about that sometime as well. 3.7 was like the first model that I've used in an agentic context that actually really works. And then you can unlock some even more experiences. So it's like the first one is taking unstructured stuff to making it structured. The second one is like we take unstructured stuff, we make it structured, and we can also do things with that data.

So it's just like, yeah, how much do you want to hand over to the model? Start simple. Okay. And I guess when the behavior is less deterministic, I guess that's when you start looking at the, can I use the AI agent and just give the model more control over, I guess, how we do things, how we process things. I guess that kind of brings the second question is, okay, if you can't,

just one shot it and have the model do it in one go they need to do some other stuff um so that's where you know i guess i can see where the benefit comes in where you can use more specialized models for more specialized tasks um but then the the thing that i kind of i guess i've never quite reconciled is well how there's all these models out there how do i choose how

how can I easily choose the right model for each task? Do people just kind of end up using the most capable one for every single thing, but then they have different, you know, prompts, system prompt and user prompt, or maybe have different rack setup for each task, or is there some way where you say, okay, for this step, I'll use Cloud 3.5. The next step, I'll use...

Gemini 2.5 Pro, and next I will use DeepSeq. Good model that, the Gemini 2.5 Pro. Very good. Have you used it for coding?

Not for coding, but I've used it for just chat and I saw a lot of the benchmarks that puts it quite high. One of the ones I want to talk about later, maybe it's just about the performance with reasoning with large context window, but we can talk a bit more. We can touch that afterwards. Picking models is crazy hard. I mean, it's like...

Yeah, there's so many and they change very often and it feels like you're chasing your own tail a lot of the time. I guess the only way to make sure something works is to test it. Okay. And so, and this is so many people go wrong and like, I've gone wrong with this so many times. When you start a new AI project,

Just for the love of God, you need to have some idea of what good looks like. And so you do need the 20. I said, I know I said it like five minutes ago, but you, you need to create 20 gold standard inputs and 20 gold standard outputs and

and find some way to evaluate the output you get against the output that you have as gold standard. And like, whether that means, and this is really hard, like, especially for developers building with AI stuff, because suddenly they're like, oh my God, like I'm building in the legal tech domain. I don't know what this means. Well, you actually like have, unfortunately you have to go and learn what that means. What your outputs mean, what they like,

Whether something is good, either you have to go and learn or you have to find a stakeholder who knows and you have to chase them or you, and like you have to build this thing. We've done loads of stuff where just send a bunch of schedule messages, like one a day to someone who actually knows what's going on. And then get, just get them to mark up the answers, get them to correct it or to even just say good or bad and then build up like data sets as fast as possible.

for what good looks like. Because then once you have those tests, you can test against multiple models and you can just find the one that works. Start expensive, go cheap, and then see where you get to a level of success that you're happy with. Right. Okay.

Yeah, I guess that's where you're going to need more and more perhaps just domain experts involved in the process just so that you can even write the right prompt or come up with the right test for your, well, the right evals for your models as part of your process. Yeah, and that's super hard. Like even things like, there's some guys that are really good at things like this. So I would direct you to Hamil Hussain, Jason Liu. They have some really good

uh hamill has like some of the best blog posts about this type of stuff but and so i'm just gonna like regurgitate what he says but really like build um a nice a nice way of doing it is building little custom uis uh to allow people to mark up data and so you could do that with like claude artifacts or v0 or however you want to do it but

just something backed by an Excel spreadsheet. It doesn't have to be fancy that just allows people to see another input and then another output, and then they correct the output. And yeah,

Very quickly, you'll have some data and you need at least 20 questions. If you can't do 20 questions and you can't get that together, you probably shouldn't be building something non-deterministic. Okay. And also, I guess that's for each task. So depending on how many tasks you have, how many agents you have, you're going to need more and more and more of these test cases. It's why it can be easier to start thinking about it agentic because really you just need the...

the full integration test, as it were, like you just need the start and the end, was that if you break it up into a DAG, you really do need every step along the process. Yeah. So try and start with as minimal LLM calls as possible because you do need those 20 questions for each. And really you need 100. And in the future, if you're thinking about improving complexity, reducing cost, you're going to want to get 150, 200, 300 of those gold standard questions

And then you can start like many more doors open to you. You can start thinking about fine tuning that some stuff, maybe embeddings models you want to fine tune. If you're doing some sort of retrieval task, um, re-rankers also really good for fine tuning. Um,

But you just need to have some idea of how your system is performing. So tracking those metrics, tracking how you're doing with those gold standard inputs and output evals, that's the only way you can do it. Otherwise, you're just shooting in the dark. Benchmarks are so saturated, so bad, and guaranteed none of them look like your use case. Even the coding benchmarks, you know, like SweeBench? It's just Python.

Only Python? Yeah, it's mostly just fixing type hints in Python. They're really not representative questions of my day-to-day coding at all.

Yeah, not at all. That's surprising because people use that as almost like the gold standard and basically to back up argument that, oh, AIs can code as well as developers. Yeah, it's just Python. Okay. It's just one language. Okay. And it's like researcher Python. Like it's, yeah. You should have a look at it. Like these benchmarks, it's worth just having a look at some of the data.

and just having a look through, seeing what it looks like, seeing what they're actually evaluating against. I'm guaranteed you'll be shocked. I mean, that's consistent with a lot of the anecdotal experience I have and also other people have about just how...

good or how bad coding with AI is. I mean, it's useful up to a point, but then pretty quickly falls apart with anything a little bit more complicated than, I guess, working at the function level or module level. Once you start looking at, okay, the entire system level, you just start doing some really silly things from time to time. Yeah, I think it's like using models for what they're good at.

I was coding with a colleague recently and there was this really constrained problem. We were building a parser for a chat application. So this chat application outputs a long stream of text and it's delimited by XML tags.

And the tags define like where it should be placed in the front end. So some of it is just an answer. So it should just be streamed to the answer. But some of it is like citations. There is some like thinking section. There's like a tool calling section. And they're all, it's just, it's just one string really. Like the model just outputs a string delimited by XML.

It's not necessarily the best way of doing it, but we set this up ages ago and that was how it did it. And the front end was getting kind of messy, like passing those tags. And so we were just like, oh, we'll just move the passes to the back end. We'll make it a package. We'll make it like super nice and neat. And so all I did was I wrote like 10 or maybe not even 10, maybe like eight tests, like really descriptive tests of what I thought the end result of the passes should look like.

And then I just, then I just one-shotted it with Gemini 2.5 and it's,

genuinely after that like as long as it passed those eight i actually didn't really care what it did um it was like a cursory glance like it's worked for the last two weeks and it will probably carry on working because it passed my tests you know like those tests were um they were exactly what i needed and so i think a lot of people struggle with with model with lms because there's no there's no bounding there's no bounding area there's no forcing function

And so if you have things like really good typing in your code base, if you have things like tests that really do... If you have a constrained problem, models are really, really good, especially these new agentic models. Right. And I guess you just mentioned there about the Gemini 2.5 Pro again. So let's bring it back to the thing we were talking about earlier, whereby...

A lot of these models now have got really big context window. Let me see, Lama 4 came out, has got 10 million context window. But I remember when we last spoke at the serverless London meetup, you mentioned to me that, yeah, some of these models has got like a million token context window, but they can't reason with that. All the models that they publish for providers, they publish...

the Haystack test, basically testing how good the models are at recalling information. But it doesn't mean that they can actually reason with that much data in the context window. And since then, I've been keeping an eye out for other benchmarks and things like that. And I saw there was a Fiction Live Bench

which basically looks at how good the model is able to reason with different payload sizes. And apart from GemLight 2.5 Pro, which when it gets to 120,000 tokens in the context window, it can still get 90% accurate. But if you look at something like DeepSeq R1, it goes all the way down to 33% accuracy.

So even though you may have much bigger context window, you can't really use much of it at all. Most of these models just pretty much fall apart as soon as you hit about 40, 50 thousand tokens. So that's it. Lama 4, Skeleton, Maverick. By the time you hit about 60 thousand tokens, it's about 30% accuracy. I mean, I think that being nice, like

We have a lot of internal benchmarks and mostly you can't get over 20,000 tokens on any model. It doesn't matter which one it is. You've got to be really careful about stuffing the context window. I actually just updated my CodaView GPT project. I don't know if we talked about it last time, but

It's been stale for ages. I just updated it because at the time we were like, oh, my God, we were trying to optimize for the reduction in API calls. So we were filling the context window as much as possible with your code to say, review these pieces of code, fill the context window. And so we could make less calls because we were like, oh, my God, that'd be better. We just make less calls. We do them in parallel. We're less likely to get rate limited and all this stuff.

It's so naive because actually, um, we were filling like for a mini to like 90% of its context window. And I know for a fact it starts apps. It starts regurgitating rubbish after about 10 K tokens. Right. So I'd be really careful there. Like there have been some really good benchmarks going out recently. I've seen the fiction one. Um,

But yeah, I would use your own benchmarks and just be aware that if a benchmark came out before the last generation of the model, after the last generation of the model, 100% is saturated. There's so much money at stake and these benchmarks find their way into all sorts of training code. I'd be really worried.

like I remember the headline about, I've forgotten which model it was. I don't want to point fingers, but there was something about, about Harry Potter, like a whole Harry Potter, all the Harry Potter books being put through one model. And it noticed that they changed the name, they changed the color of like pizza or something, or they added like a text, like pizza is blue into the middle. And then they asked the model, what looks weird about this text? And it was like all the Harry Potter books and the model got it correct. And yeah,

I would counteract that with, of course it got it correct. It was trained on those books. And so, yeah.

It's going to know that there is a difference between the book it was trained on and the book that you're seeing right now. Right. So, yeah, I would find as much proprietary... There are some really cool things you can do with benchmarks. So first of all is find as much internal data as possible that you have. Long reports, long anything that just has never seen the internet. And...

Run your benchmarks using those. So if you have a bunch of reports, stack them up together and then ask the model about something in the middle. That's a really good long context benchmark that anyone can do pretty much who works at a company because they'll have loads of random data internally. Just stack your notion together. Your confluence, your Jira tickets. It doesn't really matter. Just stack some data together and try and reason about something in the middle.

So that's one thing you can do. The other thing you can do, which is even more funny, is you can use this idea of like the labs needing new benchmarks and then saturating benchmarks. If you publish your own benchmark about something that really matters to you, guaranteed the next model that comes out will be better. And so you basically have a bunch of PhDs working for you for free. And so I would also recommend companies to try and do that because that's very good fun.

Right, okay. Yeah, I've tried those experiments in the past where I tried to load a large document and then I see if I can actually get some reasonable answers and they never worked. I think I tried with ChatGPT, tried with a few other models where it just completely falls apart.

Of course, at the time, I didn't have the context where, okay, the context window is there, but you can't really use it much if you actually want to get reasonable answers from the model. Yeah.

And some releases are just hype, you know, like there was a Chinese lab release. Some of like, we use a bunch of the models from this provider internally. We use them self-hosted very, very good. I'm not gonna tell you the provider, but they released like a 1 million token version of their standard model, which was like 125, 125 K. And they released a 1 million token version. And yeah,

And we were like, oh my God, 1 million token. Yeah, we'll be able to see loads more stuff with it. It was actually just slightly worse at long context than the standard token length. It died at around 15K. And so like, it's pointless. The number at the end, it was just, it becomes just marketing at that point. And that maybe is a sell signal to stop using companies models where they actually put in resource into that because maybe they're not putting resource into actually getting better. I don't know. Google's an interesting one though, because Google have,

Google have a different, like their architectural style of their chips and everything is all quite well designed for long context. Like I'm sure someone will probably correct us in comments and things like this, but yeah, definitely go and have a, like it's worth a little research. I don't want to talk about it because I'm probably going to butcher it massively, but the way TPUs are designed, they, they have some preeminence for long context inference, at least if not training, I can't remember. Yeah.

But yeah, there's some fun stuff going on there with Google. I really can't, but yeah, I'd love to buy a TPU, just have one in my house, yeah. Be really fun. - Yeah, how much is that gonna put you back at 10,000, 20,000 K? - I don't know, I don't think they actually sell them. I think that's actually more the problem. - Okay. - It's just not a commercial business for Google selling TPUs.

or at least not like a singular rack level, like 10 TPUs. Yeah, I don't know. Okay. So yeah, Matt, thank you so much for joining us again. I guess before, last thing, how can people find you and also how can they keep up with all these things that's changing around AI? And I mean, personally, I struggle a lot just in terms of finding people

places and get a more educated take on all the hype because everything is game changing. Everything is, oh, this is going to change the world. And it's just exhausting. And it's marketing. So how can someone find sources that are more noise or at least have better signal to noise ratio? Yeah.

Oh, it's like super hard. Yeah, I was at the Cloudflare Connect conference a week or so ago and they were asking the same thing. It was like, how do we work out what's not just noise? I mean, it's easier for me because I work doing this daily. And so for me to spend an hour just to try a new model, to see what it's like to try a new technique, it's not the end of the world. But I really think like...

It's going to sound really bad. The best way to do it is to have a few filters, like a few friends who do that type of stuff who can filter it for you. That's like the best way. X is like Twitter is pretty good for more current stuff. So for research papers, but you really have to like curate your feed quite heavily. Yeah.

There's a couple of guys I follow which seem quite reasonable. There's Simon, I forgot his surname. I think he usually gives quite a good... Simon Willison? Yeah, he's a British guy, isn't he? Yeah, he is. Yeah, I watch podcasts with him. So podcasts are also amazing. So anywhere where there's a bit of a barrier for entry, it's kind of hard to do. So blogs, long-form blogs, podcasts...

to some extent, research papers, they're really good. What's really bad and which you should just block all the time is like five amazing things you never heard about MCP. Like, or like, you'll never guess what MCP will change. Like those people I tend to block because it just hurts my brain. And I don't have the mental energy for that. And all the stuff, forget chargeivity. This is three models you should use. Exactly. And like in-person, in-person stuff is really good as well because like the in-person talks, um,

they're really good because it's a barrier for entry, right? Someone's done some curation there, whether it's the talk organizer, even the speaker or the speaker's company. There is some curation there that you don't have to do.

And so, yeah, trying to leverage like other people's curation is, is, is really good. It's really good. There's a newsletter that I, um, that I really like by, uh, Thorsten Ball. Um, he, he made a bunch of books about building interpreters in Go. He worked at Zed for a bit. He works at Sourcegraph now. Um, what's it called? The newsletter? Uh,

It's okay. You can send it to me afterwards. Yeah, it's called Joy and Curiosity. Okay. It's just a Thorsten Bull. He only has one, so you'll find it. And he just posts some of the most, like he has some of the most interesting blogs that he links there. So I mostly read blogs from there, Twitter from friends, in person with people that I find on Twitter who I want to meet in person, I guess.

is the best thing i run ai demo days so if anyone's around in london uh we'll have another event in may sometime we have them every two or so months

They're publicized all over LinkedIn and Twitter and things like that. If you're into this world, you'll probably find them. Yeah, AI Demo Days. So it's demodays.ai. Okay, all right. I will put those links in the description below so anyone else can find it as well. And for your information, you are my friend that I go to to find and filter information about AI and what to actually pay attention to.

But yeah, thank you so much for coming on again and sharing with us what's going on with AI and what's actually happening with the MCP and why we should care. But yeah. Yeah, thanks for having me. It's been really good fun. Yeah, looking forward to seeing you again in London. Maybe next time I'm in London, we'll give you a shout and we can meet up again. Yeah, 100%. All right, take it easy, guys. And good to see you again. And I'll see you guys next time. See you in a bit.

So that's it for another episode of Real World Serverless. To access the show notes, please go to realworldserverless.com. If you want to learn how to build production-ready serverless applications, please check out my upcoming courses at productionreadyserverless.com. And I'll see you guys next time.

#116: AI Agents, MCP and the problems with AI benchmarks | ft. Matt Carey 48:08 Share

Real World Serverless with theburningmonk

Deep Dive

Shownotes Transcript

#116: AI Agents, MCP and the problems with AI benchmarks | ft. Matt Carey