We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

GenAI Traffic: Why API Infrastructure Must Evolve... Again // Erica Hughberg // #296

2025/3/14

MLOps.community

AI Deep Dive AI Chapters Transcript

People

Erica Hughberg

Topics

Erica Hughberg: 我在Tetrate担任社区推广员，我的工作是倾听业界人士的意见，确保我们构建的解决方案能够真正满足人们的需求。互联网及其应用的通信方式在过去二十多年里发生了显著变化。从早期基于线程的代理到事件驱动的代理，再到微服务架构，以及现在由生成式AI带来的新挑战。早期基于线程的代理在处理大量并发连接时效率低下，事件驱动的代理通过共享线程和异步处理解决了C10K问题，提高了效率。微服务架构将大型单体应用分解成更小的、独立的服务，提高了资源利用效率，但也带来了新的网络挑战。Kubernetes等容器编排技术使得服务动态迁移，这给代理带来了新的挑战，需要动态更新代理配置以适应服务的移动。Envoy Proxy等动态代理应运而生。生成式AI应用的兴起带来了新的挑战：模型响应速度慢、模型体积大，与以往优化小型、快速工作负载的模式不同。应对生成式AI带来的大模型工作负载，一个有趣的方面是如何将轻量级模型部署到网络边缘，以提高响应速度。生成式AI带来的网络流量变大，请求和响应体积增大，这给网络安全带来了新的挑战，需要对请求和响应内容进行安全检查。生成式AI API 的开放性和非确定性给网络基础设施带来了新的挑战，需要处理更大的请求和响应，以及更长的响应时间。生成式AI服务的流式响应与传统流媒体服务不同，需要对输出进行安全检查，这增加了复杂性。为了应对生成式AI带来的挑战，需要改进现有的网关技术，例如使用Envoy Proxy等高性能代理来处理大量并发连接，并结合Python等脚本语言进行智能化处理。Envoy Proxy 通过其事件驱动架构能够有效处理大量并发连接，而 Python 编写的网关在处理大规模并发连接时会遇到瓶颈。将 Envoy Proxy 与 Python 扩展结合，可以兼顾高性能和灵活的智能化处理。处理生成式AI流量需要在网络层面和模型调用层面进行多层次的优化，例如智能路由、动态扩展等。Envoy Proxy 的架构能够有效地支持这种多层次的优化。Envoy AI Gateway 在 Envoy Proxy 的基础上，提供了控制平面和扩展机制，简化了配置和管理，并针对生成式AI应用场景进行了优化。Envoy AI Gateway 项目是一个开源项目，鼓励机器学习和生成式AI工程师参与贡献，共同解决生成式AI流量带来的挑战。在POC阶段，人们应该尝试各种可能性，但不要过早地进行大规模扩展。Envoy Gateway 1.3版本中已经加入了动态使用限制的功能，这不仅对生成式AI应用有益，也适用于其他类型的API。 Demetrios: 作为主持人，我与Erica的对话中了解到互联网和计算机技术的一些演变，以及大型语言模型如何改变了互联网的现状。 supporting_evidences Erica Hughberg: 'So I do think it's helpful to sometimes take many steps back to how the networking of the internet and how our applications are communicating over the last... since 2010, really. Or even before, no, before then. Since the early 2000s' Erica Hughberg: 'So now we're taking a long old step back. And then during this microservices era was also cool. When we start to optimize, we want to do loads of little requests and they're light and they're fast.' Erica Hughberg: 'Yeah, so that also was really interesting is the workloads for, that's an entire topic on itself, like how to manage the LLM workloads in your compute.' Erica Hughberg: 'Have you noticed a lot of AI or ML engineers starting to come into the Envoy AI Gateway project and submit PRs?'

Deep Dive

Chapters

This chapter explores the evolution of internet architecture, from thread-based proxies struggling with scaling to event-driven proxies and the rise of microservices. It highlights the shift from large monoliths to smaller, more scalable services and the challenges this presented for networking.

Shift from thread-based to event-driven proxies to handle increased concurrent connections.
Evolution from monolithic applications to microservices for better scalability.
Introduction of Kubernetes for dynamic resource allocation and the challenges it posed for networking.

Shownotes Transcript

Translations:

中文

So my name is Erica Huburg and I work as a community advocate at a company called Tetrate, which means that I listen to people out in the industry and ensure that the solutions that are being built actually address people's real needs. And how do I take my coffee? Unfortunately, I normally take it cold, not by choice, but I do take it black. So I do often end up with a cold cup of coffee on my desk because I didn't drink it fast enough.

We are back in action for another MLOps Community Podcast. I am your host, Demetrios. And talking with Erica today, I learned a few things about the evolution of the Internet and how we do things on computers in general, but also her theory on how we've been

plowing towards one reality of the internet and because of large language models that's all been flipped on its head. We're not going to spoil anything. I just want to get right into the conversation. I thoroughly enjoyed talking with Erica. She has the best sense of humor and I appreciate her coming on the pod. Let's do it! I want to jump into the tofu and potatoes on this. It sounds horrible. We'll just go with it.

Yeah, I'm not a big fan of tofu. I'm very particular about how I eat tofu. Only in a pad thai when it's crispy. Yeah, but there's so many ways you can make tofu. I've made some really bad tofu in my life. I tried to make some scrambled tofu. That was not a good idea. Yeah, that's hard. Yeah, I failed. So I'm not doing that again. So anyways, let's get into the meat and potatoes of this conversation, which is...

The internet is changing a little bit. You've been spending a lot of time on gateways and thinking about different gateways, I think. Yeah. Can you break it down for us and then we'll dive in deeper on it? Yeah. So I do think it's helpful to sometimes take many steps back to how the networking of the internet and how our applications are communicating over the last...

since 2010, really. Or even before, no, before then. Since the early 2000s, I got the years wrong. Look, I'm getting old. I was like, oh, maybe it's not that long. So the last 25 years at least,

roughly has been really interesting in the evolution of the internet so if we go back to the early 2000s say 2001 2002 2003 what was happening during that time and how this is relevant to gen ai will hopefully become obvious in a little bit but if you were around then i do appreciate that everyone are older than 25 years old there are people who are younger than 25

But for those of us who were around, it was a very exciting time. We were going away from dial-up to broadband and more and more people were getting access to the internet and it was getting more popular.

common that companies had websites. We saw the rise of the social media and forums. We were on forums back then. We were on web forums talking to people we didn't know. Very exciting. And IRC channels. Very cool. If you know what that is, I hope you take your supplements. But I... I am taking it. That's so...

It rings true. Oh my God. Well, so if you do remember these things, what was happening was that with more people getting devices and connecting to the internet, that also meant that we had a greater need of handling many, many concurrent connections to web servers. So imagine Facebook, right?

You couldn't really have a social media app with only like a few thousand people on it. That wouldn't be very fun. Like when I say a few thousand, we're talking single digit thousands, not hundreds of thousands. So as we were scaling up and having this more interactive and multi-connection on the internet, we were hitting a problem that the traditional thread based proxies

were struggling. So what does that, what is a thread-based proxy? It would mean that you had a single connection per thread. So imagine you went to a restaurant and you were ordering food. So you come in and you get a waiter for your food. You order your toffee or whatever you like. Maybe you find your fancy toffee dish.

And at that point, instead of the waiter just going around serving other tables while you're waiting for your food to cook, the waiter would stand there by your table and wait for your food to be ready and then come back, go back to the kitchen and get your food and deliver it to you when it was ready. So that waiter would be busy the entire time your food was cooking.

But what we moved into when the internet exploded, there's a problem that is known as the C10K problem of handling 10,000 concurrent connections. And this just then continues to explode as the internet grows, right? This problem then isn't just isolated to 10,000 concurrent, but that problem continues to grow as the internet grows and the more users are on there. But that was the beginning of that problem.

So then we got waiters that actually didn't have to stand by the table anymore and wait for your food to be ready. Like instead of you just sitting waiting for the food,

the table to be ready, the waiter had to stand there with you. But then we got a better model where waiters could just put in the order in a system and go and wait other tables. And even if your waiter was busy, there was a runner who could go and deliver your food when it was ready because they know what you ordered and what table you're at because there was a system. This was the move to event-driven proxies.

So now we were able to get orders in. So that's the event. The order came in, the request came in and we sent it to the kitchen. That's your backend target server you're getting to. They're processing it, cooking, cooking, cooking. And then you are sharing threads. So many connections could share threads and different workers could pick up and deliver the response. You didn't have to have your specific waiter.

So that was the cool thing that happened during that time. And when we started, and then what happens after that? So they're cool. We solved the connection problem. We can handle lots and lots of connections. That was exciting, right? No one has to sit at the, and wait. Well, we still have to wait, but the waiter doesn't have to wait. And then what happens is,

is this move to like break down all of our apps. So this happens around the early 2010s. So this movement really starts kicking in around 2015. And that is, we used to have big monoliths.

And what is a monolith really in software engineering? I like to think of it as, imagine you have a big box and all the little components that make up your software is in this box. So how you log in is in the box. It's a little, maybe imagine it like a little teddy bear you shoved into that box. And then you have another box full of hair ties in the box and you have lots of little functionalities in the box. You throw it all in one big box.

The problem with these big boxes was as the number of users increased, we needed to start to scale horizontally. And this is where we come into the world of cloud engineering and like scaling horizontally. But imagine

In this big box was one, was many different features from viewing images to logging in to if you think about the social media part, like loading your feed, like get all the posts from contacts or connections or friends and show them. That is one feature of this big box, but this box did so many things and

We're like, oh, if we're going to scale horizontally and with more and more users, this is getting more and more expensive. So scaling this big box, so cloning this big box, over and over again, started to get computationally expensive because we needed to reserve so much memory to this box. And there's a bunch of waste. Yeah. Yeah. So we're like, okay, well, what if we took out the Ted about out the box and instead we

of cloning the whole box. We just clone the teddy bear that we put in the box, right? So now we can have

with that's what we start breaking down the boxes into smaller boxes so the teddy bear now has its own box it doesn't have to be in the big box with all the other junk you threw in there now it's got his own box and it's a much smaller box so now we can scale up that much more resource efficiently so that's drove us to the move to change to microservices so we went from monolith big box with lots of stuff

shoved into it to lots of smaller boxes with just one or a few items in each box that could scale together. So it became much more resource efficient. But as we started breaking apart these boxes, we ran into a networking problem, of course. Uh-huh.

because when we had the monoliths, it was all very straightforward from a networking and proxy perspective. So yeah, we'd sold the multiple connections to the proxy. Well done, everybody. We've done that. But now, well, where are we going to send the traffic to? So now we had a new problem because as we tried to make all of our systems more resource efficient and we were scaling up all of these services, we had this fascinating problem with that. Wait a minute.

stuff's moving around, they're on different addresses now. And as you scale up and have more teddy bear boxes and then the teddy bear boxes are moving around as well, because someone came up with this bright idea called Kubernetes and you really know what Kubernetes does. Really what it does is moving your teddy bear box and your other boxes around on servers, like actual computer hardware.

They're called nodes in Kubernetes, but what it really does is playing box Tetris. That's what Kubernetes is really good at. It's just stacking these boxes to maximum and most efficient resource utilization on your nodes. So now you may be moving your teddy bear around as it feels like because maybe it fits better over there because, yeah, Kubernetes is just about playing box Tetris with computing resources.

and putting them in. So, okay, so now stuff's moving around. So the network engineers, they're pulling their hair out. They're like, we can't keep up with all of this moving around and changing addresses. This is where the, this is also what becomes interesting because we didn't now just need to handle multiple connections. Now we have to be able to dynamically and quickly update the proxy to where it's pointing to.

traffic to you from a logical perspective. Oh, you want to reach the teddy bear API? Okay. These are all the places to serve the teddy bear API. And now we have to. So this is again, a big shift in how we handled proxies. Because up until this point,

Proxies like NGINX were statically configured. They were not dynamically updated as your situation changed. So that's when we saw that that is actually what drives the introduction of proxies like Envoy Proxy that can dynamically reload configuration so it doesn't need to restart at all. So as your targets are moving around, your teddy bears are moving around on the different boxes,

They, Envoy Proxy doesn't need to restart. It just gets dynamically updated. And now I say Envoy Proxy was really created for that. It was created in the era of breaking things apart.

So now we're taking a long old step back. And then during this microservices era was also cool. When we start to optimize, we want to do loads of little requests and they're light and they're fast. And these little services like our now teddy bear service. I imagine this little cute pink teddy bear in my head, by the way.

This teddy bear service, it had to be really fast and it needed to be really lightweight because if you want to do also really good horizontal scaling, we need things that start fast, response. I mean, the response fast kind of comes with it if you can build a really self-contained and thought-through service. But...

So we now optimised how we think about networking in this microservices era to be fast. The process itself should respond within single digit milliseconds, right? It shouldn't take longer. Keep this in mind now because what's happened with Gen AI, a gen, like a service, a model service,

It's fast if it returns a response in 100 milliseconds. So it's, so if you think that the Teddy Bear API service responded inside of itself in the space of one or two milliseconds or five, maybe, but we're talking about anywhere from like a hundred times slower to 10 times, like between 10 to a hundred times slower at its fastest when we're looking at

LLM services, for example. The time... No, we're not talking network latency. We're just talking about the process itself responding. So it's a lot slower at responding from first byte. So then we have other fun stuff. We talk about the first byte, time to first byte, because LLM stuff also tends to be streaming. And we're really then focused on time to first byte.

That was the best explain it like I'm five description of the evolution of software engineering I've ever heard. That is amazing. From the restaurant analogy to the teddy bear to Kubernetes just being Tetris. That is amazing. And I love each little piece of that. And so now in this world of everything being an API and our microservices. Yeah.

paradigm that we're living in. One thing that is fascinating is not only that, okay, these models are a bit slower, but they're gigantic, right? Yeah. That's a little different too. I think you were mentioning that we've been building for super small, super fast type of workloads and APIs. And now we're looking at

really big and really slow type of things. Yeah, so that also was really interesting is the workloads for, that's an entire topic on itself, like how to manage the LLM workloads in your compute. I think there's also this very interesting aspect where people are looking at how small can they be?

Because again, we are very much gravitating to how can we make things smaller and more scalable. There's some really interesting ideas on how you can even maybe run really lightweight models towards your edge stack of your networking. For those who may not know, in the world of the internet,

we didn't just have to deal with lots of many connections we have something called content delivery networks out in on the internet so when you go on a website and there's images on there so imagine you just go to facebook and on facebook or instagram there are lots of photos and every time you are loading your phone and you are looking at photos on instagram

When I go and look at it on Instagram in the United States, those photos were most likely cached on an edge through a content delivery network system. Whereas when my mother back home in Sweden, because I'm from Sweden, when she opens Instagram and looks at photos, they don't come all the way from some data center in the US or in Germany, most likely.

A lot of them would be coming from a CDN from really close to her in Sweden so that it can have that perception of fast delivery. So there's a lot of interesting thoughts of how can we bring at least lightweight Gen AI services closer to those edges to be closer to people so things are faster. But I am not an expert on that. I just know that a lot of people are thinking about

How small can you make these actual workloads? How efficiently can you get them closer to people? Even when people are talking about what can you realistically run even on a client device? Yeah, exactly. And how you see that a little bit with what the Apple intelligence wanted to do, right? Yeah. Yeah. All of the...

simple queries we're going to do on your own device and then if you need a bit more we're going to bring it to the cloud and it will get done but it's going to be all private and in our cloud type of thing yeah and uh yeah i don't have the i think my favorite problem is how big the requests and responses have become the actual network traffic is getting heavier i guess the the interesting question there is like what happens now with the network traffic

Yeah. So first of all, we were really optimizing for small requests. A lot of gateways out there, by the way, have request and response size limits, especially if you want to be able to interrogate the request or response body and the actual content of the request. And why would you want to do that? For security reasons? Yes, specifically for security. And in the world of

large language models, people are very interested in interrogating both request and response content to either protect information from leaking, protect your, protect, um,

information from coming in, like malicious information coming into your system. But that's very common as standard web application firewall challenges. Another interesting thing that people want to do as traffic comes in. So imagine you are making an LLM request, a query. So you have an application developer who's like, I'm going to build cool at

that people can write cool stuff into. But the person is building the application. They are no LLM expert. They're like, I want to have this cool thing. So then imagine that they're like, I just want to have a simple like LLM API to hook into. I don't want to care about picking a model or service provider. That's like not my expertise. Imagine that the developer just wants to send a LLM like request to a API.

Then why would someone want to interrogate the actual content of this request? So I seen some really interesting things that people are doing out there, which is like the SIM mat, like literally using a lightweight LLM almost to pick a model based on the request. So to be able to run an analysis on what is the most appropriate model for this type of request, you need to have access to the body to do that decision. Otherwise,

You can't, right? You need to be able to access the body. So this obviously differs on like how big this body is where you've started running into problems. But

you can't assume that everybody's going to be small, right? So having architecture and infrastructure that allows to be able to do this for more unpredictable sizes of the request body, and then you have the response body being able, that can get really complicated, especially with streaming, how you're going to try to protect that. But yeah, I find it very fascinating. So that's like one problem in terms of,

how the network traffic shapes up differently with it being bigger, both requests and responses. And the fact that the APIs we are dealing with are incredibly open-ended. They are not our old traditional microservices APIs. When I say they're old, like, now that sounds sad because they're fairly new as well. But I think we can call them old now because they were very much about being deterministic, having very clear goals

This type of request is always going to take X amount of milliseconds. And you wanted to be very clear on that. The slowest one is going to be 50 milliseconds, for example, to then you may be able to do one set of faster, but having a very clear idea. Whereas in the world of LLMs, this is incredibly unpredicted. If you go and then you go beyond LLMs, you're going to...

Inference models for images, it gets even more mental. We start even working with bigger amounts of data. We're going to go into images, then you can go into video. Now we're dealing with even larger data. So LLMs, even though LLMs are pushing us, imagine where we're going with the media that isn't just tech. Yeah, and what kind of infrastructure we will need to build out for data

The future that we all want and you hear...

AI influencers on the internet talking about how, oh, well, we're just going to not have movies be made anymore. They'll be made for us and our individual wants and what we're looking for off of a prompt or whatever. If we do want a world where there's going to be a lot of these heavier files being shared around or heavier creations from the foundational models,

And that's more common. What kind of infrastructure do we need for that? Because it feels like we've been going in one direction for the past 15 years. And we've really been trying to optimize for this fast and light. Yeah. But now we're needing to take a bit of a right turn. Yeah. Yeah.

and think a bit differently around how do we hold how do we deal with long running connections long open connections also because there's a difference when you look at what we did with streaming like when you're using netflix and you are streaming video when you have signed into netflix and you are connecting and you are streaming

When Netflix is streaming out their media to your TV or your device, they don't have any need at all to interrogate the output. So therefore you can bypass a lot of the things that you wouldn't have to worry about when you do, when you look at output from LLMs or large visual models and things like that.

you actually want to interrogate the alpha before giving it to a user, which is different from what you see with streaming Gilmore Girls on the internet, right? They don't need to check that the Gilmore Girls episode is actually what it is, right? So therefore, because at one point I thought, hmm, Erica, I said to myself, this streaming problem and doing that at large scale fast, that's been solved, hasn't it? So then I was like,

Well, that has been solved in the scenario where you're having nine controlled...

data that you that the data you are sending is known and controlled you know exactly what it is you are not worried that it's going to send out something that you don't know what it is right so therefore that kind of streaming is very different from what we're looking at when we're looking at streaming responses from gen ai services because the security layers and and control of the what we're sending becomes different now i was thinking about

how we move towards an internet that is supporting more gen AI use cases. What does that look like from the networking perspective, from the gateway perspective? I know you all are doing a ton of work with Envoy AI Gateway. I imagine it's not just all, hey, let's throw an awesome gateway into the mix and that solves all our problems, right? Yeah.

Yeah, it doesn't. Because even the existing foundation, like if you look at how Envoy Proxy itself operated, we had to make enhancements to Envoy Proxy itself. And then we started the Envoy Air Gateway project to further expand on the...

on-road proxy capabilities and control planes. So I'll explain a little bit what that is in a moment. But if we take one step back and look at what happens in the world of gateways, what I think is super fascinating was with the rise of Gen AI, there was a lot of Python gateways, Python-ridden gateways that came out. Because it seems natural, right? Like, oh, we wrote all of these other cool stuff in Python, and a lot of people who came from the machine learning and...

the NAI world are incredibly comfortable with Python. And if you want to do cool stuff like automatic model selection on an incoming request, you kind of got to do that in Python because now you're back into the machine learning and NAI space. So it becomes this weird thing, like how do we do the smart stuff we want to do, but also don't run into the single waiter per table problem? So because of...

Python gateways fundamentally runs into problem and it's not about how smart the people are writing these Python gateways, not their fault. The problem comes back into like the foundations of the Python language where there is only effectively there's an interpreter because Python is an interpreted language. It is not a compiled language. So therefore you have this process that is interpreting Python and that is

So when we come into threads, now we're coming to the single-weight-at-the-table problem. And therefore, even if you try to, you can start trying to be clever and simulate the

what is referred to as event-driven architecture. Don't confuse that with Kafka. But when you look at the event-driven architecture of a proxy, it's when you can have one waiter serving multiple tables and just putting things into the system. So it's like this little event notification within the restaurant, within the proxy. So...

You can simulate a lot of that with Python, but you are going to eventually run up into the issues and the constraints. Well, we can call them issues. The issues aren't. Python itself isn't the issue. It's just you will run into issues because of the constraints of the Python language if you want to do large scale of handling loads and loads of connections.

So that's where we run into problems. And how do we solve this problem? How can we like combine the multi, you know, the restaurants where waiters can serve multiple tables with the smart stuff that Python can do? Then we start, that's how we end up leaning into Python.

Envoy proxy, which fundamentally can handle, which is very much the restaurants where waiters can serve many tables, right? That is what Envoy proxy really is great at and has this event-driven architecture. Envoy proxy with Envoy gateway, Envoy AI gateway, allows us to create an extension mechanism where we can bring in the cool logic, like automatic model selection that's written in Python,

and still have the, you know, one waiter serving multiple tables proxy and then being able to, only when we need to, go and get that special order from the Python extension to do smart stuff. Only if we need to and only if we want to. So that is where getting the network benefits as well as being able to bring some of the smart stuff in and that is what makes it really interesting on how we can bring

these worlds together and but we are we are definitely going to see I still believe we're going to start seeing challenges with some of the the time that the the response times like time to first bite and the complete streaming I just imagine that that's going to continue to grow as a need and we're going to have to start looking further into the internals of our proxies

I'd be surprised if we're not going to continue growing in that space. You know what it reminds me a lot of is the folks who build streamlit apps.

It's like, yeah, I built this Streamlit app because I just code in Python and I like Python. It's like Streamlit is probably not the best for your front end, but it gets you somewhere. And then when you want to take it to really productionizing it and making it fancy and doing all that front end stuff, then you can go to React or whatever. You bust out your Vercel jobs or your Next.js and you actually make something that is...

nice for the end user or nice on the eye and has a bit of design to it. But sometimes you just are like, yeah, all right, cool. Well, Streamlit gets me this MVP and that's what I want. In a way, it's almost like the parallel here is that there's been proxies made with Python. They get you a certain...

way down the road and you can validate ideas and you can see if it works. And then at a certain point, you're probably going to say to yourself, okay, now I'm

We want to productionize this and Python is not the best way to do that. Yeah. Like I think it's interesting because if you have like a small user base, just building something internally for your company, you can probably be fine with your Python gateway. Like you've got a few hundred users just inside your own company. You're probably fine. But if you have...

even internally if you have a lot of concurrent connections and you start needing to deal with this you can't have one waiter per table that is standing there waiting so you gotta have to figure out how you can make more use of all of the connections so it's sort of interesting as we've talked about what happened at the beginning of the 2000s with the

10,000 concurrent connection problem, the Python gateways are effectively, as we were bringing more Gen-AI features to the mass market, ultimately runs into that problem. And the good news is the answer of the concurrent connection problem has already been solved. So then it says, how can we bake these things together to get the core features? But at least the good news is that concurrent connection problem has been solved. Mm-hmm.

So there's almost like two levels that you're talking about here, which is the trafficking and the network trafficking, but also then understanding the request and being able to leverage some of this cool stuff like, oh, well, this request probably doesn't need the biggest model. So let's route it to a smaller model. Maybe it's open source and we're running it on our own. Maybe it's just the smallest open AI model, whatever it may be. And so...

the levels that you have to be playing at are different, right? Or at least in my mind, I separate the whole networking level with the actual LLM call and what is happening in that API. Yeah. But then you have the smart stuff that has to happen before you send it to the LLM service. And that's where the exciting part, like combining things like Envoy proxy with...

a Python service, a filter, a Python filter to be able to do smart stuff to make those drafting decisions. And you don't think that that's just kicking the can down the road? Like you're still going to run into problems

If you're passing it off to Python at any step along the way? Good question. So we don't have the connection problem in the same way at that point, because now we can have dedicated either Python services that can run and we can scale them independently of the proxy.

So now we are not dealing with client to proxy or proxy to upstream connections. Now we have the proxy that is making a call out to another service. So like a Python service. And that service, you know how we talked about boxes in Kubernetes earlier? That little Python service box, we can scale it up and scale it down based on demand. So that thing now becomes horizontally and elastically scalable on its own. And it's not going to have any...

They won't be any noisy neighbor problems because they're not going to block each other. And we only run into the waiter problem when we have a limited set of waiters in our restaurant.

that are associated with one request response lifecycle. Because now we just go to this little Python service. When it's done, it's out of the picture. And also we weren't blocking other requests because of how the Envoy architecture works. We weren't blocking the other requests while we were consulting this Python process about what to do. I understand what's going on then and...

It's like we call them and ask, hey, little Python service, what do you think about this request? I'm going to hang up now. And when you have an answer, I'll pick up the phone and we'll continue this journey. Yeah. And that bottleneck only is happening potentially at the proxy area. It's not happening once it goes through the proxy area.

Then it's like free range and there's all kinds of stuff that you can be dipping your fingers into. It says that the request lifecycle inside of Envoy is the important part and how you can include external filters so that you can have this basically event driven. It's talking about like you basically put stuff into the central system of Envoy and then know what to do. Okay. So it's...

I hope this is an oversimplified explanation, by the way. So people who are really in the depths of Envoy proxy could probably be like, oh, Erica, that is not entirely exactly in detail true. But if you want the like 10,000 feet view, I hope that is enough. And if you really want to dive into it, like you can spend a long time.

learning about the internals of this. But the good news is that it's very clever in how it handles resources and manages connections. So that is really cool.

But then Envoy proxy is really hard to configure. So I'm going to go on a slight tangent, but an important tangent and like why Envoy AI Gateway is really interesting because Envoy AI Gateway does two key things for when it comes to helping you leverage Envoy proxy to handle traffic. And you have two, every time you have a proxy, you have a two side problem.

You don't just have, you need to configure this proxy and you want to be able to configure it in a way so even if it scales up and down. So you have like many proxies, so one proxy. You have a control plane that can effectively and resource efficiently configure this fleet of proxies. So Envoy AI Gateway brings you this control plane that is extending Envoy Gateway control plane.

And then it's like really interesting. I know you can't have time for spending on that, but how that control plane is being very efficient in how it configures on WebProxy and helps you propagate configurations across all the proxies that are running. And then the other part that we've added into on where AI Gateway is an external process that helps with configuring

specific Gen AI challenges like transforming requests. So one of the things we've done is to have a unified API so that if I'm an application developer, I don't have to learn all these different interfaces to connect to different providers. Because we don't have to put that cognitive load on people who want to build cool apps, let them build cool apps.

And then we can worry about the pipes. Yeah. And do you need one to be able to leverage the other? I guess, do you need Envoy to be able to leverage Envoy AI Gateway? So when you install Envoy AI Gateway, it actually installs Envoy Gateway, which gives you Envoy proxy and the Envoy Gateway control plane. And so when you go through installation steps, you actually first install Envoy Gateway. And it would run on a Kubernetes cluster.

I call this like a gateway cluster, by the way, for reference, if you ever look at any of my diagrams in blog posts. And then you've deployed that. Then there's a Helm chart and you just like install on-way AI gateway. And that deploys an external process and an extension of the control plane.

So it expands on the functionality of Envoy Gateway and Envoy Proxy. So you don't really have to know that you are deploying all of those things, but that is what happens if you deploy it. It's all part of the Envoy CNCF project. It's not like a set. It's part of that ecosystem. So yeah.

It's not like a separate entity per se. It's part of the Envoy CNCF ecosystem. Have you noticed a lot of AI or ML engineers starting to come into the Envoy AI Gateway project and submit PRs? Do you look at that type of stuff? Because I wonder how much of the...

or of this scope falls onto ML engineers, AI engineers, or if those folks just throw it over to the SREs and the DevOps folks? It's a good question. Where we really need the people who have good understanding of machine learning and Gen AI, it really becomes, look at these intelligent people

Python, maybe Python based extensions. If there are people out there who have ideas on how to do, for example, semantic routing to be able to decide that model, not doing it in Python, please let me know. Very interesting. This is outside of my area of expertise. So if there's someone who can write it in Rust, please let me know. That would be amazing.

But yes, we are seeing people who come in who are like, hey, let me show you this Python extension and how it fits into Envoy AI Gateway. And the smart stuff, that is definitely outside of my expertise. I find it really fascinating and interesting. So if people have that expertise, I would love people to come and help.

build those extensions into Envoy AI Gateway and bring those features to the community. Because fundamentally, when we look at the Envoy Gateway initiative within the Envoy project, the Envoy AI Gateway initiative is really about

What we're facing right now and how this traffic is changing and shaping up, it is not a single company's problem. It is not a single user's problem. We are all running into this. So coming together and solving these things together in open source and maintaining it together, I think is really exciting and seeing both vendors and users coming into the space and collaborating.

And yeah, really sharing knowledge, as you said, like sharing the networking knowledge with sharing the knowledge of, well, what LLM functionality do we have out there? How can we actually bring some functionality into the gateway to make it even smarter and people who really understand the challenges of how they are running LLM workloads, I think is really, really interesting combination. And I've learned so much over the last year.

by six, seven, eight months, so from collaborating with people. So it's been really cool and exciting. So I will say that what you are talking about here, it has been 100% validated by the community and even

The way that I know it's been validated is when we do our AI in production surveys and we try and do them every time we do like a big conference or almost once a quarter. And we ask people, what's going on in your world and what are the biggest challenges? What are things that you're grappling with right now? A lot of people have written back and said some of the hardest things are that there's

this new way that we are working with software and working with models and almost like what you were talking about where these models are so big. So it makes things much different to handle and it makes everything a little bit more complex. And then on top of that, you don't really have anywhere that you can turn to that has, uh,

definitive design patterns and folks who have figured it out and are sharing that information with the greater developer community. And so I find it absolutely incredible that A, you're working on it in the open. It's great that like

the open source project is trying to do that. But then B, you've thought about this idea of, okay, the models... Like traffic to models. The traffic is so different. They're so different. They bring so many...

other ways of having to deal with software engineering, not just the, uh, the fast and light way that we've been looking at it and trying to get to, but now there's a little bit heavier one. And now we get these timeout problems or the model doesn't fit. And there's like payload constraints, all of these, like these constraints or these problems that people are running into because, um,

You try to bring this new paradigm onto the old rails and you see where there's a few cracks. Yeah, like it's I, so I in my actual work, in my day-to-day work, like I am a community advocate, which actually means a lot of what I do is understanding what's happening out in the technology community.

And so I work at a company called Tetrate and we are very invested in the Envoy project, for example, with

having engineers and people like myself being out there building in open source. But as a community advocate, a lot of what it is for me is like talking to people, listening to what's really going on and the challenges they're running into and advocating for the community when we are looking at how we are building things going forward.

So I think what could be maybe misinterpreted being a community advocate is I'm advocating for the solutions to the community. No, no, no.

I am advocating for the community so that the solutions that are being built are addressing needs in the community and maybe sometimes even needs the community haven't realized they've got yet. Maybe they're running into interesting problems. Have I heard people say they are running into challenges with their Python gateways? Like, why does it seem like the gateway itself is adding latency? Well, let's talk about event-driven proxies. But

This is where it's interesting how can you advocate and understand the challenges of the community so that the solutions are being built meet those needs.

So yeah, I find that really interesting for me specifically. So I find it really nice to hear what problems people are running into. So it's really exciting to hear like what you're seeing as well in your community and advocating for people's real needs out in the real world. So we don't build science projects. And I think that is what's so exciting. You know, you said you had Alexa on here earlier, right? Like collaborating with Alexa and the team.

In open source and really bouncing ideas together and really getting to real needs, I think that's really fun and nice. And, you know, we don't work at the same companies, but we get to collaborate and drive solutions together. Oh, it's really fun. The thing that I'm wondering about is, do you feel like you have to be technical to be that type of a community advocate?

That's a good question. I am very technical. I started coding when I was 12 and I've been in the gateway space for many years. I started being in the platform engineering and gateway space back in 2015, 16.

As we started 2015, when we started breaking down those monoliths into microservices and the need of gateways became apparent. So I've been in that space, yeah, since 2015. And I was in the fintech space. And I need to be honest, this Gen AI and this type of technology,

traffic patterns we're seeing now, I feel really validated because definitely between 2015 and about at least 2023,

I was in a situation where people said those financial analysis APIs that I was dealing with, there was incredibly open ended and you can be like, oh, I have this portfolio and I want to analyze it. But there's a big difference between analyzing a portfolio with 10 US equity holdings in it versus a multi-asset, multinational portfolio with 10,000 holdings in it.

So analyzing those, even if it's the same API endpoints you're hitting, I hope you can understand that one of them is going to be very fast and easy to respond to. And one is going to take a lot of time processing and both the input potentially and the output, especially the output, we're talking very big outputs that were starting to hit the limits of what

API gateways, you know how we talked about the body, the response body limitations. We started to run into those problems because they were so big responses that were hitting the 10 megabyte limits of many, many gateways out there. So I was like, for years, I felt that people were telling me, Erica, you are doing your APIs wrong. Clearly those financial analysis APIs are

There's something wrong with your design. That's the problem, not the gateway. And to be honest, I truly believe that we must be doing something wrong, right? That we have these APIs that...

was so unpredictable in the time to respond and the size of the response. But then Gen AI and LLM APIs came around and I'm like, wait a minute, I've seen this before. An API, you can put stuff in and it'd be very different what comes out. And it could be slow, it could be big. And now everyone seems to be on board with like that this is a real challenge.

But actually a lot of the things we are handling, dealing with now with gateways in the world of LLMs is actually the problems we are solving benefits these financial analysis APIs as well. So these changes in how we're looking at both limiting usage, like in LLM world, we're using token quotas. So maybe you're allowed to use 10,000 tokens in the space of,

whatever timeframe you like, but if you want to say a day or a week or a month and you can then start putting these, but those are not number of requests. You can make just a couple of requests and hit your quota or you can make loads of small ones and then you'll hit your quota. So that is different. So we actually had to change how we did rate limiting and

Like normal rate limiting and usage limiting in Envoy proxy to be able to allow for this more dynamic way of measuring usage. So you don't just measure usage based on number of requests, but you can measure it on another data point. So in this case, in the LLM world, it would be tokens, like word tokens. So that is fascinating to me. And even just our observability in how we measure usage

how fast we are responding. So if you think about an LLM and you're streaming a response and you're streaming tokens back, what you're actually interested in from a performance point of view is response tokens per second.

because then you know it's working, like stuff's happening, responses are streaming through. If you then correlate that to some of the challenges we saw in financial analysis APIs, we didn't have word tokens, but we were very concerned about time to first byte. So the first byte we started streaming in the response

So instead of looking at the time it takes for the entire requested process in the financial analysis API, imagine we can monitor bytes per second because that will tell us stuff is moving. It's not standing still. So similar to how we are interested in response tokens per second when we're streaming LLM responses, if we take the parallel and look at the financial analysis APIs,

Just shifting the way we're thinking about performance and observability is also important when we look at this infrastructure to understand the health of our system. Yeah, you really are. I like that idea of, hey, the performance metrics have to almost, you want to try and get more creative with them. And it reminds me of a conversation we had with Krishna who works at Qualcomm.

And he was talking about how sometimes when you're putting AI onto edge devices, you want to optimize for battery life. And ways that you can do that are by

streaming less tokens or making sure that the tokens aren't streaming at 300 tokens per second because people can't really read 300 tokens per second. So why stream them that fast if people aren't going to read it that fast? And if it can mean less battery consumption, then you stream at 20 tokens per second or whatever it may be, whatever that happy medium is. Yeah, absolutely. That's a really good perspective. So yeah, like they all add up. They were like...

I think it's fascinating in this space that has so many things are related. It's like, I normally don't think about battery times because I work in places where, you know,

power is plugged in. It's off the mains, right? There's no charged up battery. We don't have battery driven service, right? But yeah, I do think though, like coming back to your question, do you have to be technical to be able to be advocating for a community in this space? Yes, I do think it would have been very hard for me to not have my

technical software engineering and software engineering manager leadership background. Interesting. I think that that would be very hard because I'm in such a technical space, right? It's so deeply technical.

And I believe it would be very hard to not have experience in this space because sometimes I can hear problems people are describing. Like, oh, this Python gateway is somehow very unpredictable in the latency the gateway itself adds to my requests. Cool. I hear that.

they necessarily can't express that the challenge they're running into is that they have ran out of waiters in the restaurant. So their requests are waiting outside for an available table and waiter. And that is why sometimes the request takes 100 milliseconds and sometimes the request takes 700 milliseconds. They can explain the problem they're observing, but you have to understand, okay,

Okay, are you using a Python gateway? What are the limitations of the Python language? There may be the actual foundational cause of the problem that you are observing. So be able to advocate for the user in that case. I need to understand the cause of the problem.

pain to then advocate for solutions to be built in the industry to address their need. Hopefully that makes sense. I think it'd be very hard if you don't. Often you are dealing with users that are observing challenges but aren't in a position to see the opportunities as well and the courses of the challenges they're observing. It's almost like you're part product manager, part

like customer success or support in a way and in part community or externally facing. I don't know how you would qualify that. And so it's interesting to think about how these deep questions, deep technical questions or discussions are coming across your desk and you're seeing them and then you recognize, wow, there's a pattern here.

And maybe that means that we should try to build it into the product so that we can help these users with this pattern

because I keep seeing it come up. So let's get ahead of it. Let's help create a feature or whatever to help the users so that they don't have to suffer from this problem anymore. Yeah, and I think on that, like being ahead of a problem, because I used to be in platform engineering, in like internal building API platforms, service platforms and such in my old roles. And...

Often the challenge always is that you need to try and build things before people are really in pain. Yeah. Because when people are in true pain, you are too late. So being able to see that someone see the symptoms of something that's going to become really painful later,

early enough that you can address them so you have the cure when people come around but what has been really great is that fortunately unfortunately however you want to see it some people had already started running into these problems but if you can work to solve those problems for a small set of people early and then being able to showcase that and then help

the sort of second wave of adopters not have to experience the growing pains that the first sort of pioneers had to experience but yeah generally it's like sometimes hard to even explain to the the wider community the purpose of doing something because they might be like

this problem you're talking about, I don't have it. And we're like, that's great for you. That's amazing. That's awesome. I'm happy you don't have it. Yeah, talk to me in advance. So that's the other tricky part of advocating for a community because sometimes you have to advocate for people in the community where they don't feel like you need to advocate for them yet. They're like, well...

And also they don't need to know what I do, you know, when I talk to people. Hanging closed doors. Exactly. But I do think it's an important read to see where the community and the industry is at, how they are, where are they now, how long do we think it is until the pains will start being felt. One thing that's got to be hard about your job is the shifting sands that you're building on and the

I say that because a friend of mine, Floris, was saying how he spent in the beginning of the LLM boom. He did so many hacky things to allow for greater context windows on their requests.

And he spent so much time on that problem. And then next thing you know, context windows just got really big. And so he was sitting at his desk like, damn, all that time I put into this, I could have just waited three months or six months and it would have happened on its own. And so I think about in the world that you're living in, as you're thinking through some of these problems,

How do you ensure, if at all, that you're working on the right problem and you're not going to get into a situation where something that you've worked on for the past six months now doesn't matter because of another piece of the puzzle changing and totally making it

obsolete in a way. I think for me, when I look at how the, when I started looking at features to aid and enable Gen AI traffic, I got to be honest, the first time someone said like AI gateway to me, I was like, really? Like it's just network traffic.

I was like, really? Like, what are you talking about? It's just network traffic. We've been dealing with network traffic for a long time. Like, what's up with you? Like, this is just someone, you know, sticking an AI label on a network component. Because they need to raise some funds. Yeah. And I was just like, this is ridiculous. It's just, and I even was like, I even had that moment where I looked at their problems of these like,

the big payloads in requests and responses and the unpredictable response times and the high compute utilization. I was like, yeah, I had that problem for years. And I looked at it, I was like, yeah, that's not a Gen AI problem. I've had that problem in FinTech for a long time. Congratulations that you joined the party. That was my initial reaction. I was like, okay, fun for you. Welcome to the club with these problems.

But it's almost that what made me really excited when I saw that the problem that was being ran into in the Gen AI space, I had experienced for over five years. And I was like, well, if I've experienced it for five years and everyone told me that the problem I had was my self-inflicted problem. And then now we had this problem. The amount of people who was experiencing the problem had just grown.

That made me feel like, okay, these features, even though we can drive the innovation at this point, because of the explosion of Gen AI, these features don't just benefit Gen AI. They benefit beyond Gen AI. And therefore, I do not currently fit. This feels like fundamental problems in how we handle network.

traffic connections and processing of request and response payloads. And so therefore in this particular space, I think we're just actually late. We should have solved this a few years ago. Yeah, I guess a lot of the energy too and the advancements and the money are going into the model layers and application layers, but not necessarily the

The nuts and bolts and the piping and tubing layers. It's almost like this is a bit of an afterthought when you do hit scale and when you do realize that

oh yeah, we want to make this production grade. How can we do that? And you start to think about that. Hopefully you don't think about it once you already have the product out and it's getting requests. Hopefully you're thinking about it before then. Yeah, like when people come out of their POC workshops, how do they get it industrialized? Exactly. It's a...

And I do think that people should start in their POC workshops, right? They should be playing around, seeing what's possible. I don't think people should scale their systems before they need to scale them. That is... Like, what if you have a bad idea? Like, leave it in the POC workshop. Don't take it out. It can stay there. On the shelf, you can be like, I built that once. That was cute. No one needed it. And that's okay. So...

It's don't scale too early, but now at least I feel that there's enough people out there that need the scale. And I'm really excited about how we don't just bring these features into Envoy AI Gateway, but really this notion of having usage limits that isn't just number requests. That feature is available in Envoy Gateway now.

You don't need to use Envoy AI Gateway to leverage that. That is now part of Envoy Gateway as of the 1.3 release. So if you have a financial analysis API, you can actually now, you don't have to go and install Envoy AI Gateway to have more intelligent, well, intelligent is the wrong thing to call it. Dynamic is the right thing to call it. More dynamic way of enforcing usage limits beyond number of requests.

GenAI Traffic: Why API Infrastructure Must Evolve... Again // Erica Hughberg // #296 01:06:24 Share

MLOps.community

Deep Dive

Shownotes Transcript

GenAI Traffic: Why API Infrastructure Must Evolve... Again // Erica Hughberg // #296