We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Enabling Agents and Battling Bots on an AI-Centric Web

2025/6/13

AI + a16z

David Mytton: 我认为现在的主要挑战在于区分良性和恶意的机器人流量，尤其是在人工智能代理日益普及的情况下。过去那种一刀切的封锁策略已经不再适用，因为很多AI代理实际上是在代表用户执行任务，例如预订、注册或购买商品。我们需要更精细的控制，理解这些AI代理的目的、行为和来源，才能制定出合适的规则。如果仅仅因为它们是AI就进行封锁，可能会错失很多潜在的业务机会。 Joel de la Garza: 我也认为传统的安全方法已经无法满足当前的需求。过去那种基于IP地址或用户代理的简单封锁，会误伤很多正常的AI代理流量，导致业务损失。现在我们需要更深入的应用层面的上下文信息，例如用户是谁、会话状态以及访问的页面等，才能做出更明智的决策。同时，随着AI技术的快速发展，推理成本也在不断降低，这为我们提供了在边缘设备上进行实时分析的可能性，从而更好地识别和管理各种类型的机器人流量。

Deep Dive

Shownotes Transcript

Translations:

中文

If 50% of traffic is already bots, it's already automated, and agents are only really just getting going, most people are not using these computer use agents because they're too slow right now. There's still previews, but it's clear that's where everything is going. Then we're going to see an explosion soon.

in the traffic that's coming from these tools and just blocking them just because their AI is the wrong answer. You've really got to understand why you want them, what they're doing, who they're coming from, and then you can create these granular rules. Thanks for listening to the A16C AI podcast. If you've been listening for a while or if you're at all plugged into the world of AI, you've no doubt heard of what AI agents and all the amazing things they theoretically can do. But there's a catch.

When it comes to engaging with websites, agents are limited by what any given site allows them to do. If, for example, a site tries to limit all non-human interactions in an attempt to prevent unwanted bot activity, it might also prevent an AI agent from working on a customer's behalf, say, making a reservation, signing up for a service, or buying a product.

This broad strokes approach to site security is incompatible with the idea of what some call agent experience, an approach to web and product design that treats agents as first class users.

In this episode, A16Z Infra partner Joel De La Garza dives into this topic with David Mitten, the CEO of ArcGIS, a startup building developer native security for modern web frameworks, including attack detection, signup spam prevention, and bot detection. Their discussion is short, sweet, and very insightful. And you'll hear it after these disclosures.

As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com slash disclosures.

It seems like what once was old is new again, and would love to get your thoughts on this new emergence of bots and how, while we know all the bad things that happen with them, there's actually a lot of good and really cool stuff that's happening and how we can maybe work towards enabling that. Right. Well, things have changed, right? The DDoS problem is still there. Yeah.

but it's just almost handled as a commodity these days. The network provider, your cloud provider, they'll just deal with it. And so when you're deploying an application, most of the time you just don't have to think about it.

The challenge comes when you've got traffic that just doesn't fit those filters. It looks like it could be legitimate, or maybe it is legitimate, and you just have a different view about what kind of traffic you want to see. And so the challenge is really about how do you distinguish between the good bots and the bad bots? And then with AI changing things, it's bots that might even be acting on behalf of humans, right? It's no longer a binary decision.

And as the amount of traffic from bots increases, in some cases, it's the majority of traffic that sites are receiving is from an automated source. And so the question for site owners is, well, what kind of traffic do you want to allow? And when it's automated, what kind of automated traffic should come to your site? And what are you getting in return for that? And in the old days, I mean, I guess the old providers, we'll say, the legacy providers in this space,

Like it was very much using a hammer, right? So they would say, hey, if this IP address is coming in, it's probably a bot. Or they would say, if this user agent is coming in, it's probably a bot. Very imprecise. And I think the downside of that is that you probably blocked a lot of legitimate traffic along with illegitimate traffic. And now there's very real consequences because some of these AI bots could be actual users. They're acting on behalf of who are looking to purchase your products.

This is the challenge. So a volumetric DDoS attack, you just want to block that at the network. You never want to see that traffic. But everything else needs the context of the application. You need to know where in the application the traffic is coming to. You need to know who the user is, the session, and to understand in which case you want to allow or deny that.

And so this is the real issue for developers, for site owners, for security teams, is to make those really nuanced decisions to understand whether the traffic should be allowed or not. And the context of the application itself is so important because it depends on the site. If you're running an e-commerce operation, an online store,

The worst thing you can do is block a transaction because then you've lost the revenue. Usually you want to then flag that order for review. A human customer support person is going to come in and determine based on various signals by whether to allow it. And if you just block that at the network, then your application will never see it. You never even know that that order was failed in some way. There's been a lot of media releases about companies that have released

solutions in this space, but largely they were based on sort of those old kind of approaches using network telemetry.

Is that generally how they're working now? Or is there some other capabilities that they've released? Because they give them AI names and you just immediately assume that they're doing something fancy. That's right, yeah. So blocking on the network is basically how the majority of these old school products work. They do analysis before the traffic reaches your application and then you never know what the result of that was. And that just doesn't fly anymore. It's insufficient for being able to build modern applications anymore.

Particularly with AI coming in where something like OpenAI has four or five different types of bots and some of them you might want to make a more restrictive decision over, but then others are going to be taking actions on behalf of a user search. And we're seeing...

Lots of different applications getting more signups, businesses actually getting higher conversions as a result of this AI traffic. And so just blocking anything that is called AI is too blunt of an instrument. You need much more nuance. And the only way you can do that is with the application context, understanding what's going on inside your code. I mean, I'd say we're seeing across the industry,

that AI is driving incredible amounts of new revenue to companies. And if you use an old world tool to just block any of that traffic, you're probably doing your business. That's right. Or you're putting it into some kind of maze where it's seeing irrelevant content. And then by doing that, you are kind of downranking your site because the AI call is never going to come back. It's

It's kind of like blocking Google from visiting your site. It's like, yeah, Google doesn't get you and you're no longer in Google's index, but then you're no longer in Google's index. And so anyone searching is not going to find you as a result. Well, and I believe we had sort of standards in the old days that developed or quasi standards like robots.txt.

Right, which would tell you, like, and tell the crawlers, hey, don't crawl these directories. Are we doing something similar for this new age, agentic world? So robots.txt is still the starting place. And it's kind of a voluntary standard. It evolved over several decades ago now. It's been around a long time. Bots have been a problem for a long time. And the idea is that you describe the areas of your application...

and tell any robot that's coming to your site whether you want to allow that robot to access that area of the site or not. And you could use that to control the rollout of new content, you could protect certain pages of your site that you just don't want to be indexed for whatever reason.

And you can also point the crawler to where you do want it to go. You can use the sitemap for that as well. But the robots' text file format has evolved over time to provide these signals to crawlers like search engines from Google and so on.

The challenge with that is it's voluntary and there's no enforcement of it. And so you've got good bots like Googlebot that will follow the standard and you'll be able to have full control over what it does. But there are newer bots that are ignoring it or even sometimes using it as a way to find the parts of your site that you don't want it to access and they will just do that anyway.

And so this becomes a control problem for the site owner. And you really want to be able to understand not just what the list of rules are, but how they are enforced. Totally. Maybe it'd be great to walk through...

Right.

So if we think about OpenAI as an example, because they have four or five different crawlers, there's one and they all have different names and they will identify themselves in different ways. So one actually is crawling to train the OpenAI models on your site. And that's the one that probably everyone is thinking about when they're thinking about, I want to block AI, the training. And you have different philosophical approaches to how you want to be included in the training data.

The others are more nuanced and require more thought. So there's one that will go out when a user is typing something into the chat and is asked a question and OpenAI will go out and search. It's built up its own search index. And so that's equivalent of Googlebot. You probably want to be in that index because as we're seeing...

Sites are getting more signups, they're getting more traffic. The discovery process is being part of just another search index is super important. Gotcha. So like when I ask OpenAI, when is John F. Kennedy's birthday? If it doesn't know the answer, it goes out and searches the web. Yeah, that's right. Or if it's trying to get open hours for something, it might go to a website for a cafe or whatever and pass it and then return the results. So that's really just like a classic search engine crawler, except it's kind of happening behind the scenes.

The other one is something that's happening in real time. So you might give the agent a specific URL and go and ask it to summarize it or to look up a particular question in the docs for a developer tool or something like that. And then that's a separate agent that will go out, it will read the website, and then it will return and answer the query. For both of these two examples,

Open AI and others are now starting to cite those sources. And you'll regularly see, and this is kind of the recommendation, is you get the result from the AI tool, but you shouldn't trust it 100%. You go and then verify and you look at the docs. And maybe it's like when you used to go to Wikipedia and you'd read the summary and then you'd look at the references and you'd go to all the references and check to make sure what had been summarized was actually correct. But all three of those examples...

you clearly could see why you would want them accessing your site. Right. Like blocking all of OpenAI's crawlers is probably a very bad idea. Yeah, it's too blunt. It's too blunt an instrument. You need to be able to distinguish each one of these and determine which parts of your site you want them to get into. And this then comes to the fourth one, which is the actual agent.

This is an agent, the computer operator type feature. Headless web browsers. Headless web browsers, yeah. But even a web browser, a full web browser operating inside a VM. And those are the ones that require more nuance because maybe you're booking a ticket or doing some research and you do want the agent to take actions on your behalf. Maybe it's going through your email inbox and triaging things.

from the application builders perspective, that's probably a good thing. You want more transactions, you want more usage of your application. But there are examples where it might be a bad action. So for example, if you're building a tool that is going to try and buy all of the concert tickets and

and then sell them on later, that becomes a problem for the concert seller because they don't want to do that. They want the true fans to be able to get access to those. And again, you need the nuance. Maybe you allow the bot to go to the homepage and sit in a queue. But then when you get to the front of the queue, you want the human to actually make the purchase and you want to rate limit that so that maybe the human can only purchase, let's say, five tickets. You don't want them to purchase 500 tickets. And so this gets into the real details of the context of each one about what you might want to allow and what you might want to restrict.

That's incredibly complicated. I mean, if I remember back, why we made a lot of the decisions we made in blocking bots was strictly because of scale. So, you know, you've got 450,000 IP addresses sending you terabits of traffic through a link that only can do gigabit, and you've got to just start dropping stuff, right? And you take, you know, it's the battlefield triage of the wounded, right? It's like some of you aren't going to make it, and it becomes a little brutal. That sounds incredibly sophisticated. How

How do you do that sort of fine-grained control of traffic flow at internet scale? So this is about building up layers of protections. So you start with the robots.txt, just managing the good bots.

Then you look at IPs and start understanding where's the traffic coming from. In an ideal scenario, you have one user per IP address, but we all know that that doesn't happen. That never happens. And so you can start to build up databases of reputation around the IP address and you can access the underlying metadata about that address, knowing which country it's coming from or which network it belongs to.

And then you can start building up these decisions thinking, well, we shouldn't really be getting traffic from a data center for our signup page. And so we could block that network. But it becomes more challenging if we have that agent example. The agent with a web browser or headless browser is going to be running on a server somewhere. It's probably in a data center. And then you have the compounding factor of,

The abusers will purchase access to proxies which run on residential IP addresses. So you can't easily rely on the fact that it's part of a home ISP block anymore. And so you have to build up these patterns, understanding the reputation of the IP address. Then you have the user agent string. That is a

It's basically a free text field that you can fill in with whatever you like. There is kind of a standard there, but the good bots will tell you who they are. It's been surprising getting into the details of this, how many bots actually tell you who they are. And so you can block a lot of them just on that heuristic combined with the IP address. Or allow them.

Or allow them. Yeah, I'm the shopping bot from OpenAI. Right. Come on in, buy some stuff. Exactly. And Googlebot, OpenAI, they tell you who they are. And then you can verify that by doing a reverse DNS lookup on the IP address. So even though you might be able to pretend to be Googlebot, you can check to make sure that that's the case or not with very low latency lookups. So we can verify that, yes, this is Google. I want to allow them. Yes, this is the OpenAI bot that is doing the search indexing. I want to allow that.

The next level from that is building up fingerprints and fingerprinting the characteristics of the request. And this started with the JA3 hash, which was invented at Salesforce and has now been developed into a JA4. Some of them are open source, these algorithms. Some of them are not. So essentially, you take all of the metrics around a session and you create a hash of it and then you stick it in a database. Exactly. And you look for matches to that hash. You look for matches. And then the idea is that the hash will change based on the client. So you can

allow or deny certain clients. But if you have a huge number of those clients all spamming you, then they all are the same. They all have the same fingerprint and you can just block that fingerprint. So this is almost like if you think of, you know, I always think of things in terms of the classic sort of network stack, like, you know, layer zero up to layer seven. Like this is almost like layer two

level identity for devices, right? Right. It's looking at the TLS handshake on the network level, and then you can go up the layers. There's one called J4H, which looks at the HTTP headers. And the earlier versions of this would be working on the ordering of the headers, for instance. So an easy way to work around it is just to shift the headers. The hashing has improved over time, so that even changing the ordering of the headers doesn't change the hash. Yeah.

And the idea is that you can then combine all of these different signals to try and come to a decision about whether you think this is or who it is basically making the request. And if it's malicious, you can block it based on that. And if it's someone that you want to allow, then you can do so. And this is before you even get into kind of the user level, what's actually happening in the application, right? That's right. Yeah. So this is the logic on top of that. Yeah.

Because you have to identify who it is first before you apply the rules about what you want them to do. Gotcha. So it's almost like you're adding an authentication layer or an identity layer to sort of the transport side. That's right. Yeah. And the application side, I guess I should say. Yeah, the application. Yeah. But it's throughout the whole stack, the whole OSI model. And the idea is you have this consistent fingerprint that you can then apply these rules to. And identity kind of layers on top of that.

And we've seen some interesting developments in fingerprinting and providing signatures based on who the request is coming from. So a couple of years ago, Apple announced Privacy Pass, which is a hash that is attached to every request you make. If you're in the Apple ecosystem and using Safari on iPhone or on Mac, then there is a way to authenticate that the request is coming from an individual who has a subscription to iClack.

And Apple has their own fraud analysis to allow you to subscribe to iCloud. So it's a very, it's an easy assumption to make. If you have a subscription and this signature is verified, then you're a real person.

There's a new one that Cloudflare recently published around doing the same thing for automated requests and having a fingerprint that's attached to a signature inside every single request, which you can then use public key cryptography to verify. These are all emerging as the problem of being able to identify automated clients

increases because you want to be able to know who the good ones are to allow them through whilst blocking all the attackers. Yeah, it's just like the old days with Kerberos, right? Every large vendor is going to have their flavor. Right. And if you're a shop and you're trying to sell to everybody, you've got to kind of work with all of them. That's right. And you just need to be able to understand, is this a human and is our application built for humans? And then you allow them. Or is it that we're building an API or do we want to be indexed and we want to allow this traffic? It's just

giving the site owner the control? Yeah, I mean, I think what's really interesting to me is that in my own use and in my own life, I interact with the internet less and less directly, like almost every day. And I'm going through some sort of AI-type thing. It could be an agent, it could be a large language model, it could be any number of things. But I generally don't query stuff directly as much as I used to. And it seems like we're moving to a world where almost...

The layer you describe, the agent type activity you describe will become the primary consumer of everything on the internet. With 50% of traffic is already bots, it's already automated, and agents are only really just getting going. Most people are not using these computer use agents because they're too slow right now. They're still like previews, but it's clear that's where everything is going. Then we're going to see an explosion soon.

in the traffic that's coming from these tools and just blocking them just because their AI is the wrong answer. You've really got to understand why you want them, what they're doing, who they're coming from, and then you can create these granular rules. I mean, I hate to use the analogy, but these things are almost like avatars, right? They're running around on someone's behalf. Right. And you need to figure out who that someone is and what the objectives are. Right. And control them very granularly. And the old school methods of doing that assume malicious intent.

which isn't always the case. And increasingly, it's going to be not the case because you want the agents to be doing things. And the signals just no longer work when you're expecting traffic to come from a data center or you're expecting it to come from an automated Chrome instance. And being able to have the understanding of your application to dig into the

the characteristics of the request is going to be increasingly important in the future of distinguishing how criminals are using AI. What we've seen so far is either training and people have that opinion of whether they want to train or not, or it's bots that maybe have got something wrong. They're accessing the site too much because they haven't thought about throttling or they're ignoring robots.txt rather than looking at agents.txt, which is distinguishing between an agent you want to access your site and some kind of crawler.

And the examples that we've seen are just bots coming to websites and just downloading the content continuously. There's no world where that should be happening. And this is where the cost is being put on the site owner because they currently have no easy way to manage the control, control the traffic that's coming to their site. Directionally, things are improving.

Because if you look back 18 months and the bots have no rate limiting, they're just downloading content all the time. Today, we know that these bots can be verified. They are identifying themselves. They are much better citizens of the internet. They are starting to follow the rules.

And so over the next 18 months, I think we'll see more of that, more of the AI crawlers that we want, following the rules, doing things in the right way. And it will start to split into making it a lot easier to detect the bots with criminal intent. And those are the ones that we want to be blocking. So with the transition process,

of bots from being these entities on the internet that represent third parties and organizations to this new world where these AI agents could be representing organizations, they could be representing customers, they could be representing any number of people. And this is probably the wave of the future. It seems to me like detecting that it's AI or a person is going to be an incredibly difficult challenge. And I'm curious, like, how are you thinking about

proving humanness on the internet, right? Proofing is a tale as old as time. There's a NIST working group on proofing identity that's been running, I think, for 35 years and still hasn't really gotten to something that's implementable. There's 15 companies out there, right? The first wave of

of rideshare services and gig economy type companies needed to have proofing, right? Because you're hiring these people in remote places where you don't have an office. And it's still not a solved problem. I'm curious, like, it feels like maybe AI can help get us there, or maybe there's something that's happening in that space. Right. Well, the pure solution is digital signature, right? But we've been talking about that for so long. And

The UX around it is basically impossible for normal people to figure out. And it's why something like email encryption, no one encrypts their email. You have encrypted chat because it's built into the app and it can do all the difficult things like the key exchange behind the scenes.

So that solution isn't really going to work. But AI has been used in analyzing traffic for at least over a decade. It's just it was called machine learning. And so you start with machine learning. And the question is, well, what does the new generation of AI allow us to do?

The challenge with the LLM type models is just the speed at which they are doing analysis. Because you often want to take a decision on the network or in the application within a couple of milliseconds. Otherwise, you're going to be blocking the traffic and the user is going to become annoyed. And so you can do that with kind of classic machine learning models and do the inference really quickly.

And where I think the interesting thing in the next few years is going to be is how we take this new generation of generative AI using LLMs or other types of LLM-like technology to do analysis on huge traffic patterns.

I think that can be done in the background initially, but we're already seeing new edge models designed to be deployed to mobile devices and IoT that use very low amounts of system memory and can provide inference responses within milliseconds. I think those are going to start to be deployed to applications over the next few years. I think you're exactly right. Like, I think

So much of what we're seeing now is just being restricted by the cost of inference. And that cost is dropping incredibly fast, right? We saw this with cloud where like S3 went to being the most expensive storage you could buy to being free, essentially free. Glacier is essentially free, right? Free as beer, right? Whatever. And so like we're seeing that even at a more accelerated rate for inference, like the cost is just falling incredibly fast.

And then when you look at the capabilities of these new technologies to drop a suspicious email into chat GPT and ask if it's suspicious, then it's like 100% accurate, right? Like if you want to like find sensitive information, you ask the LLM, is this sensitive information? And it's like 100% accurate. Like it's amazing. Like as you squint and look at the future, you can start to see these really incredible use cases, right? Like to your point of inference on the edge, like,

Do you think we all end up eventually with like an LLM running locally that's basically going to be clippy, but for CISOs, like it pops up and says, hey, it looks like you're doing something stupid. Like, is that kind of where you think we land? That's what we're working on is getting this analysis into the process so that for every single request that comes through, you can have a sandbox that will analyze the full request and give you a response.

Whereas now you can wait maybe two to five seconds to delay an email and do the analysis and decide whether to flag it for review or send it to someone's inbox. Delaying an HTTP request for five seconds, that's not going to work. And so I think the...

The trend that we're seeing with the improvement cost, the inference cost, but also the latency in getting the inference decision, that's going to be the key so we can embed this into the application. You've got the full context window, so you can add everything you know about the user, everything about the session, everything about your application alongside the request, and then come to decision entirely locally on your web server, on the edge, wherever it happens to be running. As I listen to you say that and describe this process, all I can think is that advertisers are going to love this.

It just seems like the kind of technology built for sort of like, hey, he's looking at this product, show him this one, right? Yeah. Super fast inference on the edge, coming to a decision. And for advertisers, stopping click spam, that's a huge problem. And being able to come to that decision before it even goes through your ad model and the auction system. Who would have ever thought that non-deterministic, incredibly cheap compute would solve these use cases, right? We're in a weird world.

That's it for this episode. Thanks again for listening. And remember to keep listening for some more great episodes. As the AI space matures, we need to start thinking more practically about how the technology coexists with the systems and platforms we already use. That's what we try to do here. And we'll keep examining these questions in the weeks to come.

Enabling Agents and Battling Bots on an AI-Centric Web 26:02 Share

AI + a16z

Deep Dive

Shownotes Transcript

Enabling Agents and Battling Bots on an AI-Centric Web