We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

The Agent Landscape - Lessons Learned Putting Agents Into Production

2025/2/20

MLOps.community

AI Deep Dive AI Chapters Transcript

People

Floris Fok

Paul van der Boor

Topics

Paul van der Boor: 我在Prosus集团领导AI团队，我们致力于将AI智能体应用于旗下众多公司，服务全球数十亿用户。我们从简单的LLM发展到复杂的交互式系统，过程中积累了丰富的经验教训。AI智能体能够处理比传统方法更多更复杂的任务，但同时也面临着评估难度大、延迟要求高、成本高等挑战。智能体的形式多种多样，包括计算机智能体、网页智能体、API智能体和语音智能体等，其中基于API的智能体最为常见。我们还投资了Prompt Armor等公司，以应对AI智能体带来的安全风险。在成本方面，我们需要关注的是‘每单位智能的成本’，而不是仅仅关注‘每token的成本’。我们与众多初创公司合作，共同探索AI智能体的应用场景。我们发现，现有的系统和接口往往并不适合AI智能体的交互，需要进行改进。我们也通过建模分析，了解AI智能体对成本的影响，并根据实际情况调整策略。 Floris Fok: 我是Prosus集团的AI工程师，过去一年半的时间里，我参与了超过20个AI智能体项目的开发和测试。其中，只有少数几个项目成功应用于实际产品中，例如内部通用助手‘Kaan’和SQL分析助手。在项目开发过程中，我们发现，将AI智能体细化到特定领域并不一定能提高效率，用户更倾向于使用能够完成整个任务的综合性智能体。此外，评估AI智能体的有效性需要使用明确的指标，避免主观判断。响应速度对用户体验至关重要，过长的等待时间会降低用户的使用意愿。一些项目失败的原因是未能充分考虑用户的实际需求和使用习惯，例如Jira任务助手项目由于未能处理Jira系统中用户输入数据的特点而失败。改进方案包括先进行用户访谈收集信息，再构建AI智能体，或者采用‘AI优先’的设计方法，直接使用AI构建整个系统。我们还发现，将AI智能体集成到现有工作流程中可能会增加用户的认知负担，需要仔细权衡利弊。在数据分析领域，我们通过在结果中添加假设说明等方式来提高AI智能体的可靠性和安全性。在提示词方面，我们也经历了从复杂的系统提示词到简单的提示词的演变过程。

Deep Dive

Chapters

This chapter defines AI agents as LLMs interacting with the world, contrasting them with isolated LLMs. It explores the complexities of building reliable agents, using the analogy of a Mars rover to illustrate the challenges of integrating various components like memory, action capabilities, and data access.

AI agents are LLMs that interact with the world.
Building reliable agents is complex, requiring integration of various components.
The analogy of a Mars rover highlights the challenges of integrating reasoning engines with other functionalities like memory, actions, and data access.

Shownotes Transcript

Translations:

中文

Welcome, everyone, to a conversation between myself and my good friend Paul Vanderbore to talk all about how the Process team is using AI agents in their companies, specifically Process, and maybe sometimes their portfolio companies. We get into all the nitty-gritty details on how they have been innovating and what some of the challenges have been specifically around process.

Using AI Agents. Let's get into this conversation. We're going to be talking all about agents and more specifically, how you all in this global group of companies, including Delivery Hero and OLX, are using agents. You've got over a thousand ML practitioners in this group, and you're bringing agents and AI use cases to the over 30,000 people that make up the global collective group.

And knowing that, you've had some hard-earned lessons, and that's really what I want to dive into. Technical hard-earned lessons, user adoption, UX, UI, all of the fun stuff, because you've been doing stuff since then.

2022 when ChatGPT first came onto the market and trying to figure out how to make that more useful. We should probably start with just like what we were talking about yesterday when we had a bit of a Mexican standoff and we said, so what is an agent? We both looked at each other and we're like,

AI that can do stuff, right? You might have a better example of that. - Let's start with that. So what is an agent? So an agent, in my simplest description is essentially an LLM that interacts with the world. - I like that. - And we've obviously had LLMs or anybody in the space working with LLMs for years now, right? And the first versions of them, GPT-2 and so on, we've been playing around with it. We're sort of figuring out what's coming.

But at the end of the day, those LLMs were fairly isolated tools, right? It's a little reasoning engine in a box and you can give it a token and it gives you a token back. And that was, you know, of course, super impressive. And we saw that with the ChatGPT moment that sort of jumped in sort of everybody's lives. But one of the things that, you know, we saw as a fairly obvious next act of

of generative AI is agents, when these LMs would be able to interact with the world. And how they interact with the world, obviously, it's almost like a dial. You're dialing it up, right? So the first things we saw were, well, maybe they can access the web. Maybe they can access computer environments. Maybe they can access APIs. Maybe they can start to interact with the browser. So that's sort of the...

this idea of an agent. And to your point, we've been working on this for years because at Prozis we're probably...

One of the world's largest tech investors, focused heavily on e-commerce, serving about 2 billion consumers through those various companies that you mentioned. - Two billion! - Two billion, that's a lot. Yeah, it's a lot, right? - I don't even know how big that is, yeah. - iFood in Brazil, and Swiggy in India, and Stack Overflow, and Delivery Hero, and OLX, and many other companies, about 100 companies in the group.

So there's a ton of different opportunities for AI in general. And then if you go to agents, it becomes incredibly interesting and exciting to see all the different things we can build. So that's why I've been investing in this space for a long time with our team here in Amsterdam. We organized this conference recently.

Yeah, that's kind of the inspiration for this whole series is because we did the conference together as a virtual conference. And then we realized we want to create...

more and go deeper because what I saw in that conference was that you all are doing some very advanced stuff with agents and the conference was all around agents. It was agents in production. It was a virtual conference. We saw the most cutting edge things that are happening with agents out there in the world. And my conclusion was we need to have more conversations because I want to hear what you've done. And so I

This episode is going to be us breaking down the agent space, what it's comprised of, and then we'll bring on Floris, our good friend, who's going to talk about some hard-earned lessons that you all have had, what agents you've tried that died, so the whole graveyard of agents, and then what agents you actually were able to stick with and have been providing business value. How are you looking at that business value? How are you like

putting metrics around if the agent is useful or not. And so before we bring him on, we should talk a little bit more about what components make up an agent. But maybe taking a step back and thinking about these agents, then why is it so hard

to make them work and why is the graveyard still you know populated uh populating itself fast is because there's a lot of unknown pieces so think of an analogy right I don't know if you ever built things you know like uh with Raspberry Pi or with you know Mars Rovers have a young son so we're building a Mars Rover which is essentially a Raspberry Pi which connects to you know a bunch of sensors a microphone a camera

and a memory chip so you can see sort of what its path is. And for me, if I take this to the world of Gen-AI, what we basically have today with these powerful Gen-AI models, large language models,

is just the reasoning engine, is basically the Raspberry Pi without anything else, right? And that was the LLM. And now in the agentic world, we're trying to figure out how do you put this into a system that can interact with the world, right? That can actually...

understand, you know, what history of interactions it had. So memory, like this Mars rover needs to know what's its path, where does it need to go. It needs to maybe be able to take actions and, you know, decide to go left or right. In the agentic world, it's I need to access an API to fetch information or I need to store, you know, create a file and store it somewhere. It needs wheels. That's it. And the tools are part of it because that gives...

the LMDA ability to interact with data that's up to date because of course, you know, what the data is seen as during training time, you know, has a sort of endpoint. - Very different, yeah. - So you need, you know, other data, maybe that's proprietary data or data related to, you know, something that happened today.

It may need to actually generate its own data to not just read, but also write because it, you know, knows that you and I are interacting and you've asked me certain types of questions over time. So that needs to be stored in memory. Yeah. What does that, when does that memory get accessed? Um, then, uh,

when it generates an answer, it may want to actually think and critique that answer. So it's not just a one-shot token prediction, but it generates a plan based on what you're asking it to do. It can go to that plan, critique it, look at the end of the steps it's followed, whether that plan materialized, go back to a step and revise. So that entire system is what you need to have

working not just once but reliably especially if you're going to ship it to real customers in production and so on and that's been the journey we've been on right and um but it's from it you know the journey has been moving from a small unit of reasoning the lm yeah the right powerful the raspberry pi that now needs to be reliably connected to all these other pieces so it can help

with much more sophisticated tasks. And in fact, that's the promise, right? So moving from just a simple Q&A device to something that can actually help with much more sophisticated, complex tasks, especially in the e-commerce journey. In our ecosystem at Proz, there's a ton of opportunity to do that. The Raspberry Pi to the Mars rover. That's what we're doing right now. I like that metaphor. And there are some specific...

difficulties that arise when you start doing that. I think two things that we wanted to call out, especially right now, are evaluation can be really difficult and you're looking at latency requirements and then cost too. Yeah. Because as you mentioned, if you're doing all of these

LLM calls, it can add up quickly. And we're going to go over all that fun stuff later on in different episodes and break down specifically how you can do evaluation and what you can look for. But now we should probably talk about the different ways that you can use agents or that agents manifest. I think we could call it because back to that conference, we saw lots of different ways that agents are being used and

In a broad sense, I kind of bucket agents into, you have agents that are like the computer use agents that came out from Anthropic and it can use your whole computer and take over. You have web agents, which are a little bit more of a intermediary between, they're not using your whole computer, but they're using your browser. And then you have agents that are interacting with the world through APIs, which I think is probably the most common design pattern these days.

And then you also can have voice AI agents. And so interacting with an agent that you're talking to on the phone or maybe on a Zoom call,

There's probably room for us to throw in there like agents that you see in video games, NPCs maybe. I don't know if you want to call that a full-blown agent, but it feels like that could be one also. Yeah, for sure. I mean, you're describing the spectrum of complexity and I think it also gives us a sense of where we're headed. In the future, some of it was very near. So indeed, the natural first agent

a set of tools that you want to give these agents to are APIs because they're well documented, well structured. You know what needs to go in. You know what you expect back. You can test against that. It makes evaluation a lot easier. So that's the first thing that if, you know, the agents in production that we are working with are going to be able to do.

are typically going to be using well-defined APIs that are fairly simple compared to doing, let's say, more open-ended web browsing, for example. Of course, we're testing and we can share what we've learned and why that's also very hard and more expensive and takes time. We're going to have a whole episode on web agents. And so Dave is going to come and be our resident expert. Because that's the fun thing, too, that we...

I should mention, we get to pull from all the folks that are working at Process

who are doing deep dives on each one of these topics. And they get to come and tell us what they've learned over the past six months, just focusing only on that. Right. So that's exactly what we're going to be doing. And I think if you walk along the sort of levels of sophistication, we need a framework for this. So somebody, I don't know, maybe somebody's come up with what are these sort of levels of complexities for agents. But, you know, from APIs to browsing and just everything

in these two levels, if we should call them that way, there's so much opportunity still to make these things work. I mean, in the world of e-commerce, online marketplaces, platforms, you know, there's so many things that if you just can give an agent access to the web,

to the app, to the APIs that they can help you with. All of a sudden, they can book trips, ordering food, helping you pick products, and so on. That's after web browsing. The next thing is sort of just giving them access to a computer, a desktop. And we're working with various companies, startups out there that do exactly that.

And you can see the progress, like on benchmarks like OS World and so on, that these agents can now, you know, basically create pivot tables for you. They can, you know, with very brief instructions, right? Or they can download files, they can process them and so on. And...

That is going to come soon, 2025, probably give us a whole bunch of new exciting things and products on that front. Then if you go one step further, they can start interacting with the real world, right? So robotics. Yeah, that's another agent that I didn't even talk about. That's true. So that's sort of the levels of sophistication that I think we expect to kind of see maturing over the next months, years.

Sometimes things go faster. There's some prime real estate for some real thought leadership there with that map of the difficulties. Let's do it. They're volunteers. So one thing that we didn't really talk about is why use agents and why not just use traditional methods? Because

It seems like we add a lot of complexity. There is that benefit of, hey, I can just tell something to go do it and it'll do it for me.

But a lot of times you end up banging your head against the wall because it is so difficult. I think the simple answer is that you just add so many more possibilities or tasks that you can have these systems do for you. Once you move to an agentic world, you give them access to tools. That it's the obvious next thing to do. Because we've kind of gone through the question-answer cycle.

world, and we're doing that to ensure that these AI systems are part of our lives now. And they do that reasonably well. Of course, there's a ton of room for improvement. But as they become agentic, there's many more things they can do. And again, at Prozis, we're a large ecosystem of companies that help our users do things easier, better, faster as they interact with our e-commerce platforms. We see that the

The agentic capabilities allow us to do much more on that front. So...

And by the way, I will also say that one of the things you notice, and we'll talk about what we learn as you try to apply this to the current systems, is that the world is not ready or it hasn't been made for the agentic systems. The interface is the API. Sure, they exist, but they're not made for agents to interact with them. Well, they break all the time, too. It's really hard to get a very trustworthy API, even just for weather, a weather API. You can go looking and just getting that weathered

which you would think is a solved problem and is simple, that's hard. And so trying to go a few steps deeper and get more complex APIs or each API is different and it's constantly changing. Right. And if you're not up to date with those changes or making sure that you have some way to keep your agents up to date, then you're looking at a whole world of hurt. That's right. Preaching to the choir right here.

There is something I want to bring up too around how process works with the different companies because I tend to funnel everyone that I know that is starting a company to you

Because you're in such a unique position, and this position is that you have a ton of users, a ton of ML talent, and you know what problems need solving. And so there's users from the portfolio companies, but then there's also internal users because the process group is gigantic. Maybe we can touch on that a little bit more.

before we jump into it, because that gives more insight on how agents are valuable to you all and how you know what is actually worth doing versus not. Yeah, that's a great point. So because the setup we have as pros is we're a large global tech investor, the largest in Europe.

with operations all over the world, right? We've got food delivery in India and in Brazil and classifieds in Eastern Europe and education technology in the US and media in South Africa and many other companies, about 100 companies in the group, all with a tech angle, with their own tech hubs and AI teams.

We are in a very unique position to be able to work with them closely on lots of topics. Our focus, of course, is AI and increasingly now agents to figure out how can we solve real user problems. Like I mentioned, we have about 2 billion consumers across the group.

Two billion. That's a big group and they're all over the world. Every time you say that, I'm going to react with... This year it's two billion, probably next year it's three billion. Oh my God. And we do believe that...

The agentic systems that we're building are going to be able to solve lots of our user problems, helping make bookings, make transactions easier, find the right products they're looking for, learn things faster. And so...

All of these, let's say, real user problems are things we're trying to solve for. And when we, you know, our team, the AI team at Prozis is based in Amsterdam. And our job is to work very closely with the AI teams in the group companies to help them basically accelerate some of the cool use cases we think are going to be really valuable for the group in an e-commerce space in particular. In doing that, we...

I identify typically what problems are. So when you build a genetic systems in production, you know, we talked about all the issues. You need to make it affordable. You need to make it safe. You need to make it scalable. So identify problems that we go out there to find,

you know if anybody's offering them as a product or as a solution so we'll typically talk to founders startups in this space when we like what they're doing uh we you know either partner with them as design partners um you know we also can invest in them you know we've tons of examples there um and that's cool on the design partner front it is so valuable to

to have a company that is so advanced and understanding what is important. And then just to be able to plug in with you all. I know that a few companies I've introduced you to, they come back to me and they're like, oh my God, thank you so much. Because again, the whole reason we're doing this is...

I think you all are doing some of the most advanced stuff when it comes to agents. When you get companies that become design partners, they get to see how advanced you really are. And so if they're on the cutting edge, they...

Get to see the scale of what you're doing and then recognize if their tech holds up to that scale. Yeah, I think that's a valid point. I think the problems we face today are probably challenges that many others will face soon as well, whether that's in months or years as they also start to build these systems in production at scale. One of those things was the cost, right? That's one. Yeah, that's a great example where we've been continuously modeling technology

what's the impact of agentic systems in production on the cost profile? And the numbers we were looking at at the beginning were just simply cost per token. But you realize that's not really representative because as you use agents to answer questions or fulfill tasks, they use many more tokens, can do many more things. We measured for internal assistance token that you spoke about how much time they save per user, right, per question. So these systems can do that much better.

They consume more tokens, so they become more expensive, but the value you get for that is higher. And so we model these things. I blatantly stole that, by the way. I took that from you guys from the Agents in Production conference and made a blog post on it in response to some other VC who was talking about, he was saying the same narrative that you see all over the internet when a new model comes out

or a new update from OpenAI or Anthropic comes out and they say like, "The cost is just plummeting per token." And so I took your, or Euro's, one of the other speakers from the conference insight, which was,

Price per token is going down, but price per answer is actually going up because of these complex systems that we've got going on and how many LLM calls you're making. We all intuitively know the costs go down. Okay. But we measure this, right? Oh, nice. So to give you an idea, over last summer, we looked at how many tokens do we use to answer a given question in our internal assistant. It went up by 150%.

So more than double just the number of tokens per question. At the same time, that same period, it was about a three and a half month period, the cost per token went down by about 50%. But because the questions we're answering or the tasks we're doing with Token also become more

We're saving people more time. They're using it more, right? They're also using it more. So then the tokens per user go up. So ultimately, the budget that we have, you know, in token budget, actually goes up. Right? And then we measure how much time we save per question, and that also goes up.

up. So we actually model this and we have real-time insights. We will benchmark various models on quality, on cost, on leaderboard, as you know, and so on. We'll talk about that. But one thing is everybody knows the cost per token goes down.

But what's the ROI on that that you get, right? So does the return on investment, does the cost per unit of intelligence goes down, but does the return you get on that intelligence change as you build the genetic systems and so on? So we're in a position to measure this across various tools. Cost per unit of intelligence. Yeah, that's a way to describe a token, right? So you've got a token as just a...

basically a part of a generation and together it's you know system that has some intelligence um and um and so anyway we see the cost of per unit of intelligence trends to zero over the long term zero um yeah i mean look at the i think they we looked at we we spoke about this at the marketplace the cost of a token equivalent to gpd 3.5 two years ago dropped by 98 percent

But of course, we've got more sophisticated models now, the reasoning models that won and so on. But you also asked me about which companies we work with, right? So I'll give you one example.

We learned that, we saw that as you build systems that get more degrees of freedom, because they're agentic, they can do more things, they can, you know. - Just you saying that scares me. - Yeah, but it's also, well, they can think of- - The way you're wording it, just is- - They're generating answers, they're going out to sources. - They've got my bank details. - If you've given them to Tocantin, it will be a great partner for you. It'll be very safe. But we realized we need to make sure we understand what's the risk.

And so we invested in a company called Prompt Armor. Then their mission is basically to quantify through, think of it like pen testing for...

gen AI systems, what the risk is. So, you know, we work with them, of course, we invested in them. So trying to get my agent to buy stuff for somebody else. Well, this is more like on the security side. So it's like a pen running a pen test. So it's on the, it's on the infra. It's on the entire system. So you just have, I think of it like you have a chat bot or system you,

it's jenny power that can give you you know answers it can go out there and try to prompt do prompt injection attacks try and do data exfiltration and all the other new vectors basically you open up a whole new risk surface area that you need to understand and that's one example where

Of course, we invest in them because we think it's a promising product, but it actually, coming back to the ecosystem process, is they can now offer their product if we believe it's useful. We work with them to everybody in the group, everybody that's building Genii systems. And I think that comes back to how we work with founders. It isn't just about investing in them like a traditional fund would and hope there's fabulous returns, but it's about...

How can, you know, this offering that they have, this group of founders join this global ecosystem that, you know, where the sum of one plus one is whatever, 11, right? Because we're now working together. What they do isn't just, you know, in itself an interesting proposition, but it makes sure it adds and is additive to everything else we do across the group. And we see that a lot and we're doing that increasingly. Of course, our focus is e-commerce, right?

But sort of with the intersection of AI, there's a ton of new ideas, propositions, products, techniques, tools that we're looking at very closely to kind of bring them into the group. All right, now we got Floris with us. And it is cool to have you here to talk about all the things that you've been working on. All right, Floris, so what do you do? Okay, yeah, thanks. Yeah, so I'm an AI engineer at ProSys.

And the last year, one and a half years, my main focus was on agents. You know, building agents, testing agents, verifying their use cases, and building them out in real products. So it was mostly like these cycles of like, we have an idea, you know, we want to build this POC or MVP, and we want to know if it works.

And in some cases, you know, I stuck around for a bit longer and I actually, you know, did some work to get it in production. But it's, yeah, it's a lot of experimenting. And I think you mentioned already early in the podcast the applied R&D. Hmm.

I think is a good way of positioning. So yeah, so very privileged to have this position. I'm always curious what, because we used to have a lot of talk around what an ML engineer was. And is that somebody who's modeling? Is it a data scientist who is specifically working on ML? And now there's the new term of AI engineer. So what is that? Like, what is the day in, day out? Are you building evals? You're working with agents, you're creating agents. And what's the...

Yeah, so I would say like an engineer is there to solve problems. Like an engineer will say like, okay, this is what we need to have, you know, build it. I don't care how we reach it. Now with AI nowadays, it's almost all software. So you're part software developer, but you're also part thinking like, you know, how can we position this product? How can users interact with it? You know, it's a bit more...

you make a bit more decisions, uh, than a normal software engineer because normal software engineers are like working on a task to task basis, but we're, uh, or, or, yeah, as an AI engineer, I'm more like, I'm solving this task using AI and how we would, how we would fill that in. Uh, that's most, most, most of the time it's a blank piece of paper. Yeah.

or a Miro board, and we should just start building. So I think that is kind of the nowadays AI engineer. Over the last two years, you've been trying to play with agents. What are you looking at? Yeah, well, it will be a larger number than I think many will expect. And I think using the word playing with agents is quite right. We've been exploring. It doesn't always need to be a good idea. We're just trying to work the muscle here.

But yeah, I think over 20 projects that were related to building an agent that was solving a specific use case that at some point we thought it was a really good idea. And we'll get back to that because I want to ask a lot of questions around why did you ever think this was a good idea? But the other thing that is worth noting is how many now exist?

are actually still being used or are real projects, I guess. They made it past that filter. Yeah, so there are two that actually made it, but there is like a caveat there is that a few of them were merged into one. So because these were exploratory projects,

We saw a value in it. But on a standalone feature, it was like, okay, it's not adding any value. But if we bundle this all or we add it to our token, then it adds value again. So token, for those who don't know, what exactly is it?

Yeah, so Kaan is our general assistant. The idea started kind of like having this extra coworker. So it started also on Slack, now it's also on the web. It's been evolving a lot, still is evolving. But it started as you just send a Slack message to this agent and it will do part of your work. And of course it started with just the simple summarizations

and now we're building it out into more complex systems where it can do full analysis and you know you can save stuff and yeah kind of build this project on top of token having this interaction of back and forth like you would have with a real colleague and the other one that still exists to this day is the SQL analyst yeah the token analyst yeah it's mostly used it's used for SQL

And yeah, that one is really successful because we really saw it was adding value and we were saving people's time and money. - Yeah, nice. And we're gonna do a whole episode on that, like a deep dive case study. Now, what I wanna talk about, you've seen over 20 use cases. What are some green flags and red flags of an agent that is going to work versus fall flat on its face?

Yeah, so to come back to my earlier comment of like, you know, we bundled a few. I think when we were really trying like, okay, let's try many ideas. One of our experiments was, what if we did an agent that could do, you know, less, you know, so more specific. So we call this verticalized agents.

Will the accuracy be much better? Consistency be much better? So people trust it more and use it more. So the test we ran was we had an analyst agent which was making plots, Python, it was reading Excel sheets, doing statistical analysis, anything.

But we saw that sometimes with cleaning data, I'd like make mistakes. So they're like, okay, let's make a cleaning agent separately. So you first go to the cleaning agent and it will clean. And then you can come back to the analyst and it will do the analysis. So you have the separation. But actually...

People were not using the cleaning agent more because they said like, yeah, but, you know, I'd rather have it that 80% of the time where we just finish the task in one go is so much more easier than me having to switch. An extra step. Yeah, so the extra step was not worth it. Makes sense. Yeah. What are other red flags? Yeah, I think...

Every agent, it was really hard to test, saying like, you know, it's right or wrong. Because then, you know, I had a colleague that was always saying like, you know, we're measuring vibes. I think that was a really good measurement of things you should avoid when building agents. If it's not binary, if it's not like the code runs. And I've heard that a lot with coding agents and assistants that one of the reasons that they are such a strong use case is...

That it's like the code runs or it doesn't. It compiles or it doesn't. And you know if the AI generated code or the agent that assisted you worked or it didn't. Yeah, exactly. And you see it also in 01 now. You know, it's really funny that you see that same things that we see in agents that you see that in 01 and 03. Because OpenAI itself...

you know, said like, hey guys, if you want to do creative stuff, still just use 4.0. Because 4.0 is still preferred by humans as being better at creative writing. And that is exactly due to the same issue where 0.1 is being trained

on being right or wrong. And the moment there's no right or wrong, it cannot improve itself. So O1 is amazing at all these analytical and more beta tasks. But the moment you're getting into the creative stuff, it's... Anything subjective. Yeah. And that's something we saw in the agents as well. It's like when you're more in this creative side, it's like...

You know, how do you know it's right? Yeah. Yeah. So making sure that whatever the task is, there's a clear way to evaluate if the tax was executed or it wasn't. That's another green flag, I would say. Red flags for any other that you have that come up.

to mind? Yeah, so it's actually quite funny because now it seems that the tables are turning, but like a year ago, we had this WebSort agent

And that is one of those agents that also got like, you know, it merged with Dokkan. But at the beginning, it was a separate agent. And I can still remember the feedback of like, yeah, this can never work because, you know, the latency is way too big. You know, it was doing...

deep research and maybe you recognize his name sounds familiar it's something that Gemini or Google is now doing and they release like this deep research and now people are fine with like you know oh yeah seven minutes if I get a cool report with a lot of sources that works but you know we were doing something similar like a year ago but people were saying like it takes too long you know I don't know where it's done sometimes it took like 10 minutes

And I think also the biggest change, what happened is we were on Slack. So we couldn't provide this multi-page document. In theory we could, but at that time that was not the way how we were thinking. We just wanted to have a concise message on Slack. So that's why we said, for just this message, it's not worth the waiting time. So we didn't kill the project, but...

We didn't kept it as an agent. You know, we just distilled it a bit and moved it to the more general one. That's funny because I find that my own workflow, I tend to ask AI a question and then go do something else. And so I'm...

in the camp of, I'm totally cool with just waiting, seeing what happens, coming back to it when I get to it. Sometimes I forget and then come back a day later and it's like, oh yeah. Yeah, but we were in an era, you know, where JetGDP was the norm, you know, and that responded immediately. You know, you had the streaming. And the streaming, it was within 300 milliseconds, you know, the first word started and you started reading.

So if you then introduce a system that you need to wait five minutes for, people are like, no, we can't. Too much. So it's also like the public adapting to this view of these agents doing stuff. So the more people know that there is work being done, the more they appreciate that waiting time. And they actually are like, oh, yeah, but it's normal. And like you're saying, you're doing this asynchronous work.

I even have some like three tabs open and ask it three questions at the same time and you're really multiplying your asynchronous work. - Multitasking. - Yeah, yeah, yeah, yeah. Multitasking has a new definition now. It's quite funny. - What are the ones that died?

So we had this hackathon, you know, since, you know, we're really like, you know, there are no bad ideas, just develop. So the idea was like, let's get the whole AI team, you know, 24 hours, or yeah, it was a bit less, and make agents. And one of these ideas where I thought like, yeah, this is going to be the agents from the hackathon. Because it was kind of the idea, like have a hackathon. It's the home run. Yeah, exactly. Yeah.

I would have invested. But it was the Jira agent. Of course. So it was doing the Jira tickets. Nobody likes Jira. That's so true. And the thing was, you know, we saw it working. So in this test setup, you know, where they built the Jira board with the agents and they started adding tasks and changing tasks and asking summaries about tasks. It was working really nice. You know, again, in Slack, it was super useful.

But the fun thing was, the moment we connected it to our Jira saying like, okay, we're going to be the first beta testers. Yeah. You know, it completely broke down. You know the reason why? What happened? It was the human text, you know, all like these acronyms and all these like really short sentences, like minimal information that was needed for a human to understand.

uh it was it was messing up the agent it was saying like you know this is not description you know like like half of it didn't even have a description of the task but all the humans in the team were like yeah of course we're building this project so this must be that yeah but all that context that was not in that gyro agent so that's why it completely did not work and that's why we're like okay you know um

let's not continue with this because we need to change too much. I wonder if today you would take a different stab at it and you add in some kind of a knowledge graph with Slack messages and with emails or with other context, do you think it would have been more successful? Yeah, I think

I think today I would, I would make like an interview agent as well. And first interview the team saying like, you know, provide me all the current information. Nice. Then convert that into documents and then, you know, substitute the Jira agent with that. Like, you know, if things are missing, you know, look at this interview I had with your colleague. Yeah.

Maybe that clarifies. That is a more stable approach. But another approach would be, and that is, I think, something that we'll be seeing much more of, is if you go AI first. Because what is the reason why that port was messy?

was because we needed to type every word ourselves and uh then you're you're doing like basically uh uh the sms language you know you're trying to do as many as at the least amount of keystrokes um but if you have nine back in the day where people were texting on flip phones yeah like instead of oh so i write this out or just add an emoji no yeah no but it was uh so if you go

AI first, you know, and you said, I built this board with AI and then I maintain it with AI, you know, then there's this chance. That's what, that's what, what, that was the reason why the test was really, the test was really successful. Yeah. Because, uh, the test was built with AI and then questions with AI and then it understood his own language.

um so that is also a route that you can say like okay you just need to force people to remake the the entire board but it would be a board it would be a whole separate tool it wouldn't be using jira or a design decision you know it can still use gyra but then just a fresh board uh or or you can make your own ui i think that's a maybe something that we've um

we've learned over and over again is that as you bring these systems into existing workflows or, you know, ways of working in particular, when we go into the e-commerce world, like people have expected expectations on patterns of use. And that's, um, um, of course it's not surprising, but it's so important to get that right. And, you know,

In our world, we give Tocantin access to our GitHub and it would sort of comment on code and comment and so on. And we switched that thing off in no time because it was so noisy. And then we tried other products like Code Rabbit. And it was very similar because at the end of the day, it's very easy. It's cheap to generate content and comments, but there's still cognitive load to go through it. And you want to spend that on...

you know, high value information. And so these things, there's not, you know, one of our missions is to become sort of the best AI first team, right? So we have AI assistants everywhere. We've got our own AI statistician. We got all these little AI layers. So we test everything. But very frequently on some of these workflows, we let go of the tool because it doesn't make sense yet, right? It doesn't work. And I think part of it's our expectations and how we interact with the team

with each other and the tools, but also the tools like Jira in Floris' example. It wasn't made for interacting with these agents as it is today. Maybe it will in the future, but not there yet. Linking that back to what you were talking about earlier on the cost per intelligent unit, or what did you call it? The cost per unit of intelligence, yeah.

And you think about how that is not a unit of intelligence. It's outputting something, but it's actually a unit of distraction. Yeah. In this case, it's costing us cognitive load and you want to do the opposite, right? Yeah.

Every right question saves time, but every wrong question spends it. Yeah. One thing to add here is that, to come back to this theme of cognitive load that we add sometimes, you know, without thinking about the current status, we've got a big...

platform called Olex, a classified platform, millions of listings being uploaded every day and a natural place for us. A good example of the kind of work that we do is we try to see how agentic systems can help people transact goods. And I think it's one of the strange consequences of chat GPT is that we've tried, everybody's tried to basically make a chat GPT for X. For everything, right? So

We also naturally, when we started that journey, said, hey, we need a chat GPT for OLX, right? For people that try and buy and sell stuff on classifieds. And we realized that, you know, people today, I mean, hindsight is obvious, but they, you know, they go to a website, they see a ton of images already, they have a search bar, and that's how they discover. And then we said, you know what, we're going to, you know, introduce a conversational agent.

But what's the cognitive load you're now placing on this user, right? This user needs to come in and say, I'm looking for a piece of furniture for my home, which is such and such style, and it needs to be under... And people just wouldn't use it. And so there was a huge...

let's say, drop off because of that additional friction we'd introduced, the cognitive load for people to input things. And even when they did use it, they'd put in blue couch. Yeah. So it's the same as search. So it becomes, it's just a search bar. And then the agent would come with, you know, 500 tokens of questions and content. And then the user would say, yeah,

Cheaper. Yeah. There is another thread that I wanted to pull on where you're talking about different layers of if an agent project makes it into production. And one is you design it a certain way.

And you take as the creator of this agent certain design decisions. And then the other is later it's out in production and maybe it's increasing the cognitive load. But it could be that it's increasing the cognitive load on the user because they don't know how to properly use it. Or it could be because it's a shitty project.

So you have to decide later which one of them is it. Do we need to educate the users more or do we just need to kill the project or take a different design decision? Yeah, I think that blue couch example of Paul is amazing. Like, you know, 100% all the developers working on that project were nicely filling in the full prompt, you know, like typing like, hey, I want this couch that looks like this.

And the moment they indeed gave it to real users, they were like blue couch. Which seems so clear in hindsight. Of course, I don't want to have to fill out a form. I don't want to have to put more than I need to to get what I want. If I can do it in one click, that's better than typing out words. Now, there's another side to it right now. I think to the kind of things that we learned as we built these agents is that

these systems have a much better ability to understand complex queries. So as soon as you put in something like modern couch, most search engines today fail. But a Gen AI-based system can actually understand what modern is and what that may look or feel like. And so we've sort of leveraged that and said, okay, well, actually we need to represent our catalogs

in ways these agents can link more complex queries, whether a modern couch is an example in the classified space, but in the food space, you could say something light and healthy. Like no search engine today in any food ordering platform knows how to handle that, right? But we actually can. Like these LMs can suggest something light and healthy, give me five suggestions of what it will look and feel like, but they need to match that to the

underlying catalog. So then you need to have a system that does this sort of what we've called magic or smart search to retrieve that. And then you have another layer which is like, well, if we can actually understand that and we want to overcome that friction, there are places in the world where we work, like Brazil and India, where people work with voice. And so if people can actually say through voice, hey, I'm looking for a quick meal tonight in my house for two people,

That's fairly frictionless, right? If you can send it through a voice message and actually an agent can decipher that and say, oh, house is there. This is where they live. This is what the, I mean, these are meals that would satisfy two people. And so you can actually take a much more unstructured, different modality of input from a user, give that to an agent. They can process that and translate that to a set of items that then can be presented to the user. So there are,

other opportunities that open themselves up because you've got stronger reasoning capabilities, multimodality and so on. Yeah, yeah, I do like that, how you don't have to think in the traditional way. And that's what's becoming clear is that

If you're trying to fit the agent into old workflows, it almost feels like it's a square peg in a round hole type situation. But when you start thinking out of the box and you think, okay, well, since the agent can do just about anything that we throw at it, what can we try to make that is a new workflow that doesn't

that the user isn't already trained on how they're used to using the app or how they're used to interacting with this or that. And so you're gonna get inevitable dead ends on that path, which I think you saw where the glorified search bar. But then like, yeah, the voice note sounds incredible. If I could just send a voice note to an agent that would give me suggestions all the time, that is a really awesome use case.

What are some other use cases that died though? This is where I want to hear. We want to get back to the graveyard. Yeah, I want to hear more. The Halloween episodes. Because it's almost like that's where the best learnings are, right? You always see people writing blog posts and especially companies of your size writing

they're writing blog posts about the successes but you don't really hear companies talking about the graveyards and what they had to do to get to that success uh i think i have i've won one more uh that is uh quite interesting because normally we were always like so heavily focused of like kenneth skill but here is is we had one example where we kind of forgot that part and uh it also ended up the graveyard but it was uh we called it the the ux researcher

So we had real people coming to us with an issue. They're saying like, "Hey, I have all these open form questions that we get as a review or comments on our products, but there are too many for me to process. Can we build a system? Can we build an agent that goes through those comments?"

and kind of summarize them. Like, hey, name me the top three features people dislike about our site. You know, these are questions that we foresee. We're like, okay, this is indeed something agents can solve. And, yeah,

When we started with this, we built this whole tool that was analyzing, that was going row after row, doing map reduced. So it was first checking what are the subcategories, then dividing into subcategories, and then for each subcategory, finding, given the user's objective, what is then...

the answer. So it was combining all these techniques. It was super fancy. Wait, can I stop you right there real fast? Because if it was, was it a clearly defined workflow each time or was it that you asked the agent and the agent would figure out the workflow first?

on its own? So it could manipulate which workflow, but it was quite repetitive. Okay, so it almost choose its own workflow. Yeah. And that was where the agentic part came in. And you know, we tested it with, you know, excels of like 100 rows, 1000 rows, and we're testing this, it was working fine. And then we went back. And

They were asking stuff for excels of 100,000 rows. And...

It could not do that, you know? It was like... We knew it was going to be large, but we thought like 100 or 1,000, you know? Because it could do like 10 without any sophistication. So we already multiplied it by 100. But it was devastating. And the worst part was also that we designed this in a row by row and...

they had their answers in like a verticalized, they just transposed their whole table and it also broke the entire thing. And it was just a mess because we thought we were so enthusiastic. We're like, this is a great use case. And we saw all these ways how we would solve it. But we completely forgot how to,

how they would solve it and ask, we didn't ask enough questions. The basic question. Yeah. How big of a file are we talking? Yeah, yeah, yeah. So, but there wasn't a period, you know, where we were like, you know, there are no stupid ideas, you know, where we just need to make agents, agents, agents to see what sticks. Yeah.

Yeah. You did mention one thing about why you have that mentality, which I thought was pretty cool. And you all look at it like it's going to the gym and you're building the muscle of creating agents. And you're trying to figure out

How you can create these new workflows, these new products that are agentic first. Yeah, because if at some point, if you made enough agents, you know, it's like learning physics. You know, if you learn something new in physics, you walk around the real world and you see that formula taking form in real life.

And the same happens with agents. You know, if you build a few, you will be, instead of, if you enter a site, you know, instead of seeing the UI, you know, you start to see tools. Yeah. You know,

You know, you're like, hey, this can be a tool and this can be a tool and then I have a chat window and then I can just remove this entire UI. You know, that's how you start thinking. Until a user asks for a blue couch. Yeah. But it's like, there's this whole new way of thinking and looking at things. It's something that you need to practice because the first time I saw the agent, you know, I remember I was, I worked a week at Process and I was sat in this room and I was like,

The only thing we knew is like, we're going to test a new agent. You know, Ahmed built this one. It was the analyst. It was doing like all these pattern analysis. And they just gave like the chat window and like, good luck, you know, go test it. We need to load test it. You know, does it scale? Blah, blah, blah. And that was amazing. You know, it was, it was like, I didn't know, uh,

I didn't know how it was doing it, but it was doing it. And, uh, but it was also like, you know, where are the limits? You know, it was really, really hard to, to find those, you know, in the beginning because you didn't, you didn't know what the system was. And the more you were, and at that time, you know, it,

You just did some things, but you really saw that after three months of working with it and developing it, you know, where you were better, much better at testing it, you know, finding those edge cases because, okay, this is how it works. So this is how I can annoy it, or this is how I can, uh, make sure it works and, and,

That was quite interesting to see this muscle grow. Let me add to that, because you asked why do we do that, right? And it's exactly like for us, we need to understand what makes these things work and why is it important for us as proses is because we fundamentally believe that these things

Agents are going to be able to help us build better products for our users. And we've made predictions around this, right? You were at the marketplace. So one of the predictions we made was in a year time, 10% of the actions done on our platforms will be done by agents on behalf of our users.

That's a pretty bold prediction. And whether it's in 12 months' time or 36, it will happen. We're fairly confident about that because these systems, you know, why wouldn't you send out your agent to help you, you know, get whatever you need, whether that's food or other things, if it can do that reliably for you. But we can only build those things if we fully command the technology and have a very good intuition and you can see how Flores works.

you know by basically trying a ton of things with you know with the rest of the team has developed that intuition yeah right we said that that we're not ready for that but this sure that tool we can build that'll get us to 80 accuracy we measured these things we test the tools and so on so that's the larger picture of why it's also because it's awesome you know it's uh i just i just want to play with it yeah it's a wow effect you know i think

that agents won't give me that wow effect, I think that will still be a wow before that will be removed. And so maybe you can give us some just tactical things when you're putting agents into production and you want to make sure that you've covered and you've checked all of the boxes. What are some things that you've learned or you've done that have helped you to make that jump?

On the B2C side, I think there are people who know much more than I do. But we did work with a few agents that were around data. So the data analyst. There we really saw that improved in prompting and security is when we would repeat, you know, after we gave the answer, it's like we gave this answer under these assumptions, right?

And so, because we really tried to ask as many questions to make sure it was not an ambiguous question. But that's just hard. And there were still, you know, a few questions that came through that, like, that defense mechanism. But,

At the end, we're like, okay, then let's just recap and saying like, okay, you gave this question, you know, I did this. So that means that I made these assumptions and listing those at the bottom is also a way, a security way of saying like, you know, maybe I made a mistake.

Because you want to minimize mistakes, especially when you're doing data analysis. Because we want to position this tool as like, you want to make decisions based on this. You want to make everyone be able to make decisions on data. Now, the better those decisions are good. So one of these mechanisms is like the assumptions. And I think we haven't seen that in other tools yet.

So you're just asking as a final step in the prompt, like tell us what you did, tell us what the prompt was. Yeah, it's a separate mechanism. So it's not the agent itself. We really want to kind of let the agent do its thing. But there's like a second LLM call or agent call that basically reviews the steps and say like, okay, user started with this question.

but i've seen you also added this filter in the sql query i see you uh change the date format you know maybe that changes things you know so it's like a proofreader basically right like a layer of checking before and validation before it gets sent back to the user yeah it's critiquing everything but it's not saying like it's wrong or right you know but it adds to the user like it's is it right that i made these assumptions

Because mostly those assumptions are made because there was no other way of calculating it. Or it was because it's some rule and some document that we added to the agent, you know? Okay. Other thing that I want to finish on is...

your view of the evolution of prompting? It's come a long way, you know. So, like, the first time I was using the large language models in some kind of, like, assistant way, it was, like, the NeoX. It was an open source model, 20 billion parameters. And I remember, you know, prompting it, like, as if I was writing a paper and then stopping at some point and then it would, like, finish some complex question. Mm-hmm.

because it would do like it would write the paper that would answer that question which was insanely trickery you know it was it was it was basically we're tricking the lms you know and then uh three point the da vinci came and still you know we needed these tricks we needed examples we needed to kind of must massage it into this pattern yeah and uh

Then the era came of instruct models, you know, and that's the beginning of ChatGP, where you could just ask a question and it would understand that it is an instruction. But this development kept continuing. So people thought like, okay, the moment you can ask a question, it works. But we've seen it in these system prompts, as you call them, you know, that in the beginning we needed to

tell them every single thing like this is how python works you know this is how you use uh this is how you be friendly this is how you use emoticons in your message you know it's like we were writing like the tiniest bits of Corrections that we want to see consistently we needed to write down so he had system prompts of like 3 000 tokens and and maybe even more uh for for some agents but

over time, you know, we had, we had struggle converting these, these prompts from model over model. But what we actually saw is like, if you just removed everything, you know, and started again, like, you know, with empty prompt and then adding the parts that indeed were failing, you ended up with a shorter list. Hmm.

So what actually was happening, you know, OpenAI was training these models better and better in doing many of these things people were forcing it into to be part of the native behavior of agents or of models. And that is a trend that I really see. Like if I now build an agent, you know, I literally start with three lines of system prompt. Wow.

That was unimaginable in a time of 3.5. It was not possible. I think that's a really great outline of prompting, right? Going from these base models that were just dumb token, next token predictors, the core autoregressive function to actually instruction fine-tuning. Well, actually, you at first had the few shot, then you had instruction fine-tuning, then you had alignment, now you have...

you know well now you have the o1s which basically do like you know chain of thought suggesting before they actually start executing but

Because we're talking about agents, we also see that if you look at the prompt that we use in our agentic systems, they're essentially, you know, it's like it's a piece of code where you start inserting all sorts of parameters, right? So it's basically like dynamic prompt building or, you know, composite prompt building where you've got placeholders for all sorts of things that come from agents.

Yeah. And it can be, you know, information about the session or the context or the user or whatever. But also, of course, you've got the tools and the function calling that you need to describe where, you know, the way you describe it, Floris, I think is absolutely right. Like now we're kind of, you know, you can't put in 2000.

functions and describe them it doesn't work yet you can do a couple we know kind of where the sweet spot is depending on the model where have you found is it like 10 it depends on the model but no it's more you can do it depends how complex the functions are and so on and if you need to chain the functions if they look alike you know if they're super far apart you can add as many as you want it's when they look alike right then it gets confused and it's like yeah that's the same isn't it

So we build evals on, one, can it actually pick the right function at the right moment? But then the next step is you've picked the right function. Can it actually provide the right parameters for that function to be executed? Typically, if you do code execution, you need parameters. If you go to the web, you need parameters, right? Search queries and so on.

And that's a second evaluation, right? Can you actually ensure that when you've identified the right function, you pass the right information to come back? Now, all of that stuff comes out of that prompt, right? So, in fact, your question of how this prompting changed is super relevant for folks billing agents because the way you think about a prompt and the...

orchestration around it, what information you pull in, what information you get back if you do sequential chaining of tools in the agentic workflow. All that stuff needs to somehow, it's very stateful, right? It needs to be stored somewhere. It needs to be managed. So anyway, we've ended up in all sorts of worms, cans of worms, because if we try to make these things work,

as you change the model, add a tool. Breaks everything. Well, yeah, you need to make sure you understand what breaks. When it breaks, you learn something. I think that is really important. I think that's almost the first question I ask people that made some agent or an agent system is like, what can't it do?

because that's really, really important. It's something that we saw on a project that was, we constantly knew what it couldn't do. So we knew that that was our next target.

And then, you know, we could, we're able to do that. And then we're, you know, spend a few minutes on like, okay, where does it now break? Yeah. And then like, okay, that's our next target. And then you move on moving target to target until you're like, okay, these tasks are edge cases, but we still know it doesn't work there, but that's off limit. Well, that goes back to that binary execution, right? Because you know, did it complete the task or not?

Yeah, exactly. Like binary, you know, a lot of people say like right or wrong, but indeed a task completion, you know, if I have eight steps to finish a task, you know, I don't really care how it achieves the task. You know, I just wanted to achieve the task at a certain consistency. So that is also a binary thing. It doesn't have to be like, yeah. There's that, but then there's also the...

almost higher view of, are these the right tasks? If you ask an agent to do something, it may complete all the tasks, no problem. But...

the tasks aren't related it's a good one where you're also you know if you take time back into it like if you waste more time uh trying to get that task to work yeah uh but it is then automated you know that's also kind of like is it then is it then worth it i think that the cognitive load on checking it and and making sure it works that needs to be in check uh in the value it delivers

And I think what you mentioned early in the podcast, you know, the computer use, the web use, you know, I think there we are really in a stage where there will be a period where we're saying like, okay, it can do it.

But maybe I will just make that pivot table myself because typing it will probably take longer. And I want to use my computer. Yeah, and I don't want to sit behind it. That's also one, does it save you time if you're not able to operate the computer at the same time? Or you need a second computer just for your hands. I mean, for us, we generally think about make it work first, then make it fast because users don't like to wait too long.

And then make it cheap. And we're typically always pushing the frontier, does it work? And so, you know, it's perfectly fine to spin up 10, you know, basically agents that will try and solve your task and whichever one gets to it first. Because, you know, that's more of having a right answer is more valuable than having, you know, nine or 10 of these things doing this in parallel at the cost of that.

So we're always trying to push for the boundaries of can we make it work? It's also interesting for the process itself to know which tasks are solvable by AI because then we know there's a time factor that within X amount of months or years, this will be viable from a cost perspective. So we just need to know it could be solved. But if it's the right time, is then another question. Yeah.

A huge shout out to the process team for their transparency, because it is rare that you get companies talking about their failures, especially companies that are this big in the AI sector and really helping the rest of us learn what they had to go through so painfully sometimes.

a mention that they are hiring. So if you want to do cool stuff with the team that we just talked to and even more, hit them up. We'll leave a link in the show notes. And if you're a founder looking for a great design partner on your journey, then I highly encourage you to get in touch. We'll leave all the links for all that good stuff in the show notes.

The Agent Landscape - Lessons Learned Putting Agents Into Production 01:08:40 Share

MLOps.community

Deep Dive

Shownotes Transcript

The Agent Landscape - Lessons Learned Putting Agents Into Production