We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Getting to Grips with Web Agents

2025/2/26

MLOps.community

AI Deep Dive AI Chapters Transcript

People

Chiara Caratelli

Demetrios Brinkmann

Paul van der Boor

Topics

Demetrios Brinkmann: 我最初对网页代理的疑问在于其复杂性，以及是否API代理已经足够。 Paul van der Boor: 网页代理是LLM与世界互动的自然延伸，通过浏览器与互联网互动，这在Prosus这样的公司中尤为重要，因为我们与全球数十亿消费者通过网络和应用进行互动。网页代理可以帮助用户导航、发现、交易和购买，这是我们积极探索的方向。我们逐步增加代理与世界互动的能力，从简单的API调用开始，逐步扩展到浏览器和计算机。许多我们想要代理访问的内容并没有API化，因此网页代理是必要的。网页代理需要处理复杂的任务，比如订机票或订披萨，这些任务目前没有API支持。网页代理可以像人类一样与网页互动，而API则无法处理复杂的网页交互。网页代理可以快速扩展其能力，因为它可以像人类一样浏览网页。 Chiara Caratelli: 我参与的第一个项目是构建一个帮助用户订餐的代理，这看似简单，但实际上非常复杂。网页代理面临的一个主要问题是，网站是为人类设计的，许多内容对代理不友好。网页代理的核心逻辑是，它能够访问信息并决定下一步行动，类似于API代理的工具调用。为了成功构建网页代理，我们需要限制代理的选择，使其专注于特定任务。网页代理可以通过学习特定网站的轨迹，逐渐熟悉并优化其操作。我们使用基础模型进行实验，发现规划模型和执行模型可以分离，执行模型不需要复杂的规划能力。构建网页代理时，最重要的是明确你要解决的问题，并尝试自己完成这些任务。网页代理项目更像是一个软件工程项目，而不是数据科学项目，因此需要采用软件工程的最佳实践。确定性方法在某些情况下比使用代理更有效。如果我们能够将代理与平台集成，并利用丰富的市场动态信息，我们可以创建全新的AI优先的电子商务体验。

Deep Dive

Shownotes Transcript

Translations:

中文

Today we're talking web agents. This episode is going to be all about how the process team has been leveraging web agents and why you would even want to go down the path of trying to get web agents to work because that was my biggest question. Isn't it just adding complexity and you can use these API agents? Let's

Let's get into this episode. This is the MLOps Community and Process Collab. We are doing this limited series all on agents in production. All right, man, web agents, here we go. We're getting into what they are, what they aren't. And you told me something yesterday or two days ago about how computer use is full use of the computer, like the name states. And then web agents is almost...

a hybrid of APIs and computer use? What exactly is a web agent? Why are they useful? Let's get into it. Yeah, so we talked about the fact that, you know, what are agents? And my simple definition going back to that is agents are LLMs that interact with the world. And one way to interact with the world is

And certainly one way that we as humans interact with the world is through... The internet. The web, right? And browsers more specifically. And because at Prozis, you know, we're a large tech company, we interface with our consumers, about 2 billion of them all over the world through our companies and their products, which are almost all entirely on the web or through apps. Right.

As we think about what's the next stage of agents, of course, we're also exploring how the web...

can be used by agents to help users navigate, find, discover, transact, buy, and so on. And that's why we have an active, let's say, team working on web use for agents. Why only web agents and why not just go full all-in with computer use? Yeah, I think...

The way we're starting to give these agents the ability to interact with the world is gradual. And that graduality sort of incrementality sits in the fact that you kind of increase complexity and think about the first

access that we gave LLMs to the rest of the world were through simple APIs and function calling. And now we're saying, well, maybe they can actually interact with the world through a browser. And then, of course, after that, it will be very likely the computer, as we're already seeing, and then maybe after that, the physical world. And so web browsers are sort of the next step, the natural step that we wanted to take.

also because a lot of, as I said, our interfacing with our consumers, customers happens through the web.

I know that there is a ton of e-commerce stuff that you all are doing, like OLX, for example, is a huge one. And if you have web browsing with agents, why would you use that instead of APIs? How would you build a e-commerce site that's optimized for web browsing for agents? How do you think about that? Yeah, let's start with the question of why would we want to do that? Well,

As I mentioned, I think it isn't a very important way that we interact with the world today. The way we discover things, we learn, we may buy and shop and exchange goods in general. And so that's why we're trying to solve that. At Prozis, all of our companies have a web interface to the world. And...

then the question is, okay, well, how do you start going down that road? I mean, why not just through APIs? Well, the truth is that actually, you know, most of the things that we want to give agents access to aren't made for agents to access. They aren't API-ified yet. And so... Wait, what do you mean by that? Like...

a photo on a website or what is not API-ified? Well, if I, let's say I want, so one of the predictions we've made at the marketplace that 10% of the e-commerce transactions will be done by agents on behalf of consumers in the next year, right? Whether it's 10% or not, I think it's safe to assume that agents will start to take actions on behalf of us on the web in some near future.

But then the question is, okay, which types of action? Let's say you want to order a flight or you want to order a pizza to be delivered to your house. Today, you obviously go to the browser, you have a mouse, you click around. There's no API for all those actions yet, right? You go to a URL and the agent needs to then figure out, okay, where does it click, for example? Does it do that based on vision, on the site map? Yeah.

And so that's not a regular API. Obviously, there's a lot of APIs behind the website that you're interacting with, but there's no obvious place to kind of plug the agent into in a very structured way that the agent, for example, today interacts with all sorts of other tools that we're giving them access to. Yeah, and I want to call out too that I've heard the pros and cons for both API and

and web browsing or computer use agents. And we're going to be having a full debate on this particular topic to hear different thoughts and viewpoints from engineers on why to use one and why to use the other and why it's the future.

So when I think about web agents, one of the clear value props is that you can build it once and it can go out and do what it needs to do. And you don't rely on APIs, which is a huge selling point because APIs are very finicky. And so I've said this before that you...

Don't have to think about building a bunch of different API calls and then the agent has to choose which tool it's going to use, which API it's going to go out and find. And you don't have to worry about, oh, this API now changed it the way that it works. And so we have to update and then it wrecks the agent. You just build the web agent. It goes out and explores. It interacts with the web in the way that we do.

think about interacting with the web. And so it is more human-like. Now, let's take the second part of this, which is how do you think about building websites, particularly in e-commerce, that are agent-friendly, web browser agent-friendly? I mean, that's...

As agents are developing, it's even unclear what agent-friendly means. But to go back to your question, why just not use an API? APIs today, as you will understand, listen to this podcast, is you've got some kind of parameter input, you send it somewhere else, and you expect a standard response. It could be, even in the Gen AI world, it could be like, hey, I've got a prompt that goes to an API, and that prompt is used to take text to image, to a text-to-image model, and returns an image in a certain format. Yeah.

And you know what's expected in, and you know what you can expect back. There's no API, for example, for Amazon or... Even LinkedIn doesn't have an API. You can't say, well, let's say Amazon, you know, a gift for my niece. There's not something you can say, that's the query. And then, you know, add something to my shopping basket back. There's actually a set of steps that we do as humans to get there. You obviously look at the, you search online.

use the query, you search, you look at certain items, you click around, you maybe read reviews. So a lot of things that go into the shopping experience. And the same is true, by the way, for the food ordering experience, which we do a lot of as we talked about or finding secondhand goods in the classifieds space where you need to talk to sellers. And so all those steps are

are things that we now use the web and the browser for, that we're exploring how well agents could do any of these steps in the value chain. And you can't do that with APIs. That's a very clear...

reason why you would want to do it with a web agent. Yeah, well, if it's already designed for us, I mean, that's how, you know, billions of us interact with the web today is through the browser. If you can do that reliably, then that means you can very quickly start to give these agents a whole bunch of other capabilities because you can say, hey, go check online and book me a restaurant for two days

Right. And then tonight at 7 p.m. and it will be able to look at what's available and so on, because you can just browse the Web like you or I would do that. I will say since we've been talking about agents so much in the last couple of days, this whole week I've had like agent boot camp coming to the process offices and talking to everybody on the team about what they're working on, how they're dealing with agents, what are some of the challenges there?

When I open my calendar to find the location of this place that we're recording this podcast at, and then I have to copy and paste the location into Uber or into Google Maps, I'm sitting there and I'm like, this is so...

backwards because agents will be taking over this or how does my phone at least, it doesn't need to be an agent. How does the intelligence of my phone not know that I have a calendar invite for this time? Maybe I should just already have an Uber that is being ordered so that I can get there on time. Yeah, I think you're already thinking a couple of steps ahead in terms of navigating first across an operating system, multiple apps and

We're looking at frameworks that are made to do that, so really that sort of computer use across apps. Certain benchmarks like OS World and so those are measured against these kinds of tasks that require opening a directory, loading some files, reading them and putting them into Excel and opening a pivot table. Those are sort of multi-app actions. Maybe copying the address like you mentioned and ordering an Uber through their app.

I think that's definitely coming. But even the step beforehand where you, within one website, right? Let's say within an e-commerce website like OLX's website, Glove or iFood or eMag or Takealot, all these websites that we have along the group. If you want to go and buy something, even that is...

difficult because there are pop-ups coming in. There are, you know, if you're trying to order some food, it'll say what, you know, what side dishes do you want? What toppings do you want? Obviously, there will be, you know, a whole bunch of other filters. Sometimes we've designed things. Am I human? Or, you know, are you a robot? Like, CAPTCHAs and so on. Those are all things that...

make it, let's say, hard to solve the entire task of getting food or ordering whatever. Hard for humans, even. Well, hard for some of us, for sure. And even harder for Asians at this moment. Yeah, it just reminds me of the whole conversation we had a few episodes ago about the cognitive load and how we want certain things that we do

want to be the least amount of cognitive load possible so that we can use all that precious brain juice for something that actually is going to take us the whole cognitive load. But let's now skip into the part where we get to talk with Chiara, who has been working on web agents for the last six months. She spent a half year diving into it and getting real contact with what's been working, what hasn't been working,

playing around with web frameworks and learn from her and get her insights. Great. Chiara, you're here. Thank you for joining us. And I would love to start out with a brief overview of the project that you've been working on so that people can know the web agent journey that you've taken.

So the first project that I worked on when I joined the team was to build an agent that will help people order food. And it sounds like a simple task, but it's really complex. So you need to understand the user, what would they like, what are the dietary restrictions, context, also what time of the day it is, where are they located, are there events in the area. And this agent should be able to

order food. So go to food platforms, maybe several of them, advise the user on what's available, where are the promotions, and ultimately being able to order food. And as Paul said before, not everything is available through an API. So we decided to delegate this task to a web agent. And web agents

I mean, there's a reason why they're popular right now. You talked about it already. And they're really powerful. And we saw that there were a lot available at the moment. A few of them were coming every month. So it was really a challenge to understand how to navigate this landscape of agents. So I immediately started to try out all these tools. So

There are a lot, like Multion is probably the most famous at the moment, but there are a lot of open source projects as well from companies and research groups. And these tools are all really nice, but we discovered pretty soon that they have a lot of limitations.

So one thing that is really a big problem for agents is that websites are built for humans. So there is a lot of information that is not agent-friendly at all. You have dynamic content and loads. The DOM of the page could be huge, could change. There are, of course, the standard things like captures. It sounded like something that...

would be really easy for us. So we will just delegate execution of this task to an agent, but it was not at all. MARK MANDEL: I mean, at this point, we've already been building the data analysts and other things, which from our management team, like much more complex to get right because you have to get the database, you have to run queries, you have the code executor, you have to validate.

We're like, well, what we're going to do now is just we're going to send this agent to a web and basically navigate beautiful soup style. Just go and like... So simple. Yeah, yeah. And actually, you know, it took forever. It didn't complete any of the tasks, even Multion at that time. We looked at all the other frameworks. So... Yeah, we saw... One good example is WebVoyager is one of the first web agents that...

They also published a benchmark. It's public. Many agents benchmark against this. And

What they say is that the success rate can change a lot from one website to the other and from one task to the other. For instance, websites where you need to perform a lot of actions like booking.com or Google flights, they're really hard to navigate for an agent. Especially when an action can trigger something else that the agent doesn't expect. Like when you book a flight, you select an airport.

And the destination airport changes depending on the starting airport, right? So this is all really complicated for an agent. So yeah, we started trying out all these tools. We built a lot of internal knowledge about them and all the techniques that they use. And at the end, we decided to build our own, basically. Framework? Build our own web agent, yeah. And how does the...

logic work or what is the backbone of a web agent do? Because I know that API agents use tools and you have the potential to do function calling or whatever, but with web agents, what does it look like? It boils down to something really similar actually. An agent is something that has access to information and can decide what to do next, right?

So the information in this case could be the screenshot of the page or the DOM, HTML, lots of things that you can get from a browser. And what it can do are the actions that a human can also do, like clicking, entering text, scrolling, and so on. Those are like the tools it has. Yeah. Those are nothing but tools that the agent can call. So...

The way it goes is that usually there is a planner that decides, once it gets the task, decides how to perform it. It could be several steps, for instance, involving several websites. And then there is an agent that chooses which tool to use.

Like, if there is a cookie banner and it's a click accept, for instance, it's something that could be unexpected. So that's why you need an agent. You need to be able to react to an open world. By the way, this is a great example because we were benchmarking on, you know, and looking at others benchmarking on WebArena.

which turns out doesn't translate at all to the tests we were doing, right? It was one that the actual average results didn't compare, but also they were super unpredictable, right? So one time they worked, one time they didn't. So we'd have to devise simulations where we'd go to look at how many times out of 20 would this thing succeed on a task that we care about, right? Oh, wow. Yeah, and most of the times we saw agents getting stuck in loops, right?

And, yeah, just not knowing what to do next. You ended up stopped because maybe the task was not clear or the task that the planner gave was not clear. Or the action space was not clear because we also saw that, you know, you're looking at a website. And for us, it's very obvious, right? You look at a website and there's all these, you know, by now, like,

that we're all familiar with. There's search bar. Scrolling is one that was super hard for these agents to do. But you basically have, you've determined this is the website. You've got some reasoning through some LLM that tells you what you want to do next. But where on this website do you click to do that? Just understanding what is the coordinate that corresponds to the action I want to take, right? And that's not something these multimodal image models were good at. Like just taking an image...

understanding what are the kinds of actions are there is fine, but then saying, "Okay, then you need to click on such and such coordinate to execute that action." Or scroll down because it's probably lower on the page because I don't see it. Or you don't see it because there's a privacy or a cookie banner in the way. Yeah, one of the first things we worked on was scrolling, actually.

So we looked at the open source frameworks that were around and we tried to use similar strategies, but we chose only specific strategies that applied to a use case. And I think this is really important because the way to make an agent succeed is to limit the amount of choices it has to do as much as possible. All these tools were optimized for a specific goal, which is being able to surf the web.

But our goal was different. So in our case, we couldn't use a tool like that. It wouldn't work for us. So we needed to build a web agent that could interact with platforms to order food. And that's a different task. And since the scope of this task is smaller, then we could optimize for that. So we built an agent that could get more information about the page.

Depending on the platform that we were working with, we could prompt the agent to behave in a certain way. Like first you search in the search bar, maybe you need to enter your postal address and so on. There are certain things that always go together.

And yeah, this is also another thing we did. We took certain tools and we merged them. So if there are actions that you always do at the same time, why would you use two tools for that? For instance, when you search for something, you type in and then you press enter. So these are two separate actions, but you can combine them into one tool because we needed that.

you basically never type without pressing enter. Another thing was improving the scrolling. When you have long menus and lists of restaurants, you need to be able to fetch all the information. So we adopted this strategy to work with our use case, and we got a good success rate for that. So I think the lesson here is that if you want to build a web agent for a specific task,

Keep in mind the tasks that you have to do and be smart about it. If there are things that you don't need, don't add them to the agent. And so did you go and map out the trajectories and the user flow on these food ordering apps? And maybe it was like you would go to iFood and say...

hey, I want to order pizza, and then go through that flow yourself so that you could use it as a golden data set for the evals of the web agent? So one thing that we did was to do all these flows manually. Like, I think you cannot build something until you try to do it yourself and understand what are the pain points. So this was the first thing that we did.

And then we tried to prompt the agent to interact with the web page in a certain way. And these instructions were loaded dynamically based on the page it was on. This was one thing. So all these methods don't really change the speed at which the agent works.

But what we did was also storing all the trajectories that the agent had done. We defined three modes that the agent could operate. So one was the traditional mode where it would grab a screenshot of the page, load the content, and then decide what action to take next.

The other was faster mode that didn't involve a screenshot and we would do that on pages that we knew. So if we would search for a certain food and the food would be different but the task would be similar, we would not load the screenshot of the page because that was not needed. So the agent would know exactly where to click because it had seen that task before.

And there was a third mode, which we called Reflex Mode, in which we would automate the web actions directly, like sort of a macro. Let's say some parts of this can be automated, so why would you have an agent do it, right? So yeah, we combine all these things and the agent would try to do things in a fast way and then if it would not succeed, would do it in a slower way.

So it was almost like the slow way was the plan B, in case it couldn't do it fast, it would do it slow, and you would give it a little bit more reasoning, or you would give it the screenshots, and it was more thorough. I think that's an interesting, if you try to understand where do we think these things go, is we start with a set of tools and frameworks that aren't really made to interface with the web necessarily. But through those three modes that Kara just explained...

We were able to actually have it, you know, because of these trajectories that it knew were successful on specific sites for specific tasks, access that, you know, that learned, let's say, learned action space for websites it was then sort of, quote unquote, familiar with. Yeah. Right.

And I think that's something that if we think about our world, like you would want to have, like when we go to websites we know, we're familiar, we can navigate, right? You evolve into booking.com, you go there, right? You know exactly what to do. Or if you order your food many times and you can see and you don't need to kind of rediscover that page. And these agents essentially, we're starting to see as we are able to create that persistent or learned intuition about a website, they become...

experts well first familiar then experts and can get you to your desired output much much faster and I know that I'm still trying to just separate like what's different and what's newer with the web agents and with that learned experience something that it's seen how are you saving it and how are you making sure that the agent has access to it are you thrown in a database are you caching it what does that look like

I think the storage itself doesn't matter that much as long as this is something that is privacy compliant and doesn't lead to leaking user information. But you're storing the path that it took?

Or you're storing the action? What are you... because you're not storing the screenshot and then loading that up again, right? No, we store all the paths and the state that the page was on. So the agent knows how the DOM looks like, what are the elements it can click on. Another thing I didn't mention before is that we did some work to understand how to clean this DOM because there is a lot of information there.

But the agent doesn't need all of it. It should get as least information as possible. So we only took the elements that were clickable, for instance, and we combined this with a screenshot. In fast mode, the agent could be a bit more blind, let's say, and knowing where to click because the task was really similar. Nice. Now, the other thing that I think we wanted to talk about was this, the...

Differences between planning and execution and the models that you use for each of these, because we know that there's the reasoning models that you probably are using for the planning, but then do you offload that onto a model that is smaller and just executes or is it fine-tuned? What does it look like? So for this specific task, we use foundation models. We did experiments with models

all the major foundation models. And we saw, of course, some differences. For the planner, of course, it helps to have a model that is good at planning, like O1, for instance. And for execution itself, you don't really need a model that is good at planning, let's say, because as long as it knows what to do, this is a very limited action space, right? It's an important sort of

that's emerging, the separation of the planning and the execution is

as you start to interface with the world, right? We're talking about now, of course, web is one of those interfaces because planning itself requires a lot of reasoning of understanding the intent of the user. If you do the execution, that basically means I actually need to know the action space really well. And to be able to translate that plan into the action space of my world, which could be the broader sense, could be the web, or it could be a domain like iFood,

or OLX, or PayU, or any of those basically websites that we know and understand well, that the second execution agent then needs to navigate to help to get to that outcome successfully. Is that where the simulations were coming in? And tell me more about what the simulations were and how those helped.

In terms of the simulations, I think what is also really nice with the web, which is different from other places we've been applying LLMZU, you can just send these agents out to go and explore the websites. Like a web crawler, right? Essentially. You can basically say, go and find me a blue couch in Warsaw on OLX. And it can then go and explore. And as long as we've defined what success looks like,

Like, for example, found the couch or added it to cart or whatever it is, then we can do this a hundred times and it learns what the trajectories are that are most likely to get it to that state. And that's where it's sort of, it's, you know, it's simulate, it's more exploration to learn what these websites can and cannot do.

that allow you to get to a system that actually is really good at executing within your catalog or your e-commerce environment. It's basically like you're mapping out the space. Correct. And then once you have the map, you can traverse it easier.

Well, no, I think we're seeing it in computer use similarly that people are starting to map out the applications, right? So if you were to know every button on Excel or on Word or the commonly used apps, and if I go to you and say, hey, you know, please make me a presentation in dark mode with such and such font background,

and you've done PowerPoint many times, you know where to go. And I think that kind of mapped action space is something you can simulate essentially because you just go and have your agents explore apps. Are you using hotkeys as tools? You could. Yeah. Tab, for instance, is a very useful hotkey. To know all the actions. You can do most of the things on a web page through the tab. Yeah. And yeah.

This is also...

This is also one of the reasons why it's good to separate planner and execution, because execution only has limited tools and only needs to understand whether it has finished or not. So we would have a planner telling the executor a very specific task. For instance, if I would search for a t-shirt on Olix, first thing would be open the Olix page, search in the bar and so on.

And the planner would take all these subtasks and execute them and stop when it was finished and give the response to the main agent that then would process it and decide what to do next. Oh, nice. So, yeah, you can map the space and give better information to the execution agent from both sides, basically, both from the planner and execution side.

So I think there's a lot of room for improvement once we will get more data. All right, so now let's talk about some of the frameworks that you used. You did mention WebVoyager. You also mentioned Multion. I imagine there were things that you really liked from some of these. And is there anything that just stands out at you from one of these web frameworks that was interesting?

a particularly, maybe it's a novel way of doing things or a good way that you feel like was something that you brought back into your own framework that you created? So something I really liked was WebVoyager. It was the simplest of the other frameworks. Many of them, they built on top of that. But it was really clear the separation between planner, executor. Executor didn't have many...

tools available, just basic web interaction through an SDK like in that case it was using Selenium which is a testing tool. We decided to choose another one but the strength there is its simplicity. So yeah, I think that's really powerful and I also liked the visual approach.

Because the DOM does not always bring you in the right direction because it could have misleading information there. But what the user sees is what is important at the end of the day. So, yeah, that approach I think is really...

is really useful. Other frameworks built on top of that and they added more complexity in terms of planner. For instance, we saw Agent E that added a more hierarchical type of planning which increases the success rate. It does, but it also makes the task more difficult and slower to execute. We saw other approaches like Monte Carlo Tree Search for planning. This is an open source project from Multion as well.

At the end, we decided to choose the simplest possibility because our task was clear. We knew what we had to do and we ended up using this agent sort of an API. We created a code that was very modular so we could delegate things to a web agent that

we wouldn't know what it was doing. It was kind of a black box within our application and would give us the response and with that we could take action and interact with the user because at the end that's what's important. Basically we're six months in the future of your journey. Knowing what you know now, what would you tell yourself if you could six months ago about this whole journey?

A lot of things. So let's start with, I think the most important one is to really understand what is the problem you're trying to solve. Dive in, try to do the things yourself because web agents are automating tasks. So try to do it yourself. Try to see what are the pain points. I would explore all possibilities that are around, but reminding myself that

these tools are not necessarily what I need. The other thing that helps is to approach this as a software engineering problem rather than data science. And I say that because I come from data science background. So this was really, really big for me. What does that mean? What's the difference? Yeah, I'll come to that. So this is

not a data science project, but a software engineering project where there are some LLM steps in. And this means that you can adopt all the good practices of software engineering, like keeping things modular, separating responsibilities, and keeping things as simple as possible, like trying to have more control

I think you discussed this about SQL agent, this is even more important there. But it's important to understand where you need the agent and where you don't. And try to limit the amount of LLM calls as much as possible.

Why I say this? Because when you approach these agent projects, it's really tempting to use all these high-level frameworks with high level of abstraction, do everything through an agent. But this is not the right way to do that. I mean, it's nice to do proof of concept, play around, but if you need to build something that works, you need to have control.

So it really helps to think of different modules. So I have a planner that needs to think very well, but the execution part doesn't need to be done necessarily by an agent. Like if there are things that can go through a deterministic approach, it's much better.

An example: in this tool we had to pull user information because we had to understand user background, whether they had dietary restriction for instance, things like that. We didn't always need it, but most of the times we needed it. And I very naively built an agent that could interact with the database, retrieve the information,

But actually, we didn't need to do that. Why would you use an agent if you can just pull the data and add it to the context, like to the prompt? So that was a big revelation for me because it made things much more simple and gave us control. So make use of the frameworks where you need them, but keep in mind that

You can go low level and have more control over the parts that are important. And test things, try to find edge cases, try to find what doesn't work and have fun. That's what you would have told yourself. You didn't have fun? I had a lot of fun. The idea of where to use an agent and when to use it is...

That's really a fascinating point because like you're saying, you can build an agent to do this thing, but if you can do it without an agent, it's going to be more predictable. I've heard the other side of the argument be,

I can prototype an agent or I can create an agent so fast, it's almost faster if I do it through an agent versus if I do it through traditional software development. Have you seen or do you have thoughts on that? I mean, I think in general for prototyping, that's definitely true. You can prototype a lot of stuff very quickly. But at the end of the day, our main job is to take something that we can see work and scale it.

So our next step is always, because of the size of our platforms, is to scale it to tens, hundreds of millions of users. And their deterministic workflows are much more preferred, if you can, or at least narrowed down. For so many reasons, right? So I think we tend to, yes, prototype in any way we can quickly. But then after that, to Chiara's point...

we need to distill the actual essence of the system to one that can be put into production and scale. And often that may still include function calling and agentic components, but it's not as sort of free reign as maybe we do when we just start exploring it. Yeah, yeah, yeah. Did you create any benchmarks or particular evals for this project?

Yeah, we chose some tasks that were representative of typical e-commerce interactions and we tried to optimize for those and then we added variations of those and yeah, trying different possibilities and until we were happy with it.

Did you have a certain accuracy score that you needed to get above? So our target was 80%, yeah. Okay. But of course it heavily depends on the task and on the website and on the user itself because it really depends what the user wants and needs.

like how specific the request is as well. So of course there is a whole planning step where the agent talks with the user and tries to understand whether it has all the information. And then the other part, the execution needs to have all the right information to be able to perform the task. So this was very important.

For instance, you cannot order food if you don't have an address. The user has to be willing to provide the address. So this was a challenge. And to go back to the deterministic approach...

There is some deterministic component in here as well, because if I need to perform a certain task, I need to inform the web agent. I need to give it the right information to be able to do that. And that's deterministic. The planner needs to know that it needs to provide an address, a list of dishes, and so on. And yeah, that's really important, and it increases accuracy a lot.

We tried both ways and that's definitely better. I think one way to kind of see what we concluded from doing all this work is that, you know, on the one hand, I think if we just take an agent and want to go to, you know, one of our existing websites or platforms, we can get pretty far.

But I think there's a sweet spot where we can also, because there's a lot of things we don't display on websites that may be useful to help a user going through an e-commerce journey. To give you an example, like when we were at OLX, you actually, as OLX, we know, it's a classified space, you know, buyers and sellers of secondhand goods.

We know what the reputations are of certain sellers. We know the location of people. We know what kind of things they've searched in the past. We know what the supply and demand are. We know what reasonable prices are for categories. Those are not things that necessarily an agent has access to if they just go to the website of a marketplace. But if we're building the agent system,

as the marketplace and we've got access to all that rich marketplace dynamics and information and customer reviews that is certainly relevant at the moment of going through a transaction, that's where I think we can create really useful agents. And that's certainly a conclusion we drew away from because, of course, here the experimenting we're doing is just purely from the outside in. But if we combine that and sort of build relationships

an agent that is integrated with a platform and is available to the user at the moment they want to find or exchange things, that's going to create a completely new AI-first e-commerce experience. And we'll hopefully be able to talk about some of that. Well, it goes back to building your website for humans or for agents because you can also expose that data for other agents to see

use or you can choose to not expose that and I know that's a debate topic that we're going to have too because it's like well if this is useful for me then it might be useful for other agents and if we

expose it in a way that a human's not going to see it, but if an agent is using the website for some reason, they will be able to see it. I don't know how that would look, because if it's not exposed in the GUI and you're using a web agent, the web agent has access to the GUI, right? Or it also has access to the DOM. Yeah. So maybe you put it there and then that exposes it. We saw some website has started to add a markdown, for instance, with a description of the page. Oh, nice. That's already really helpful.

So I see some progress in this direction. This helps especially with e-commerce because you might have a lot of items in a page. Yeah. So it's just much faster if the agent can load them. Yeah. The lessons we learned is that we had to be very specific with the instructions, break them down as much as possible. So limit the amount of...

sorry, a limited amount of thinking that the executor has to do, delegate that to the planning agent. So have the instructions as detailed as possible, break them down in steps, try to make use of all the tools you have available, but select them in a smart way, like

If you only need to do a certain interaction in the page, just make only those tools available for the agent. Doesn't need to have all the space. I've learned so much in the six months. I don't know what, like, if I look back six months ago, I'm a totally different person now. So...

Well, we're always looking for more smart people. So interns, others, if you want to come check out what we're doing, reach out to us. Nice. Yeah. Really smart people, except for some of them that sit at this table. That's awesome. You too, Dimitris. That's why we work together. Exactly. Yeah. Here we go. The Process AI team is hiring and you can find all the links to everything you need to know in the show notes below.

Getting to Grips with Web Agents 45:52 Share

MLOps.community

Deep Dive

Shownotes Transcript

Getting to Grips with Web Agents