We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

#114: Best practices for building a multi-tenant system with Khawaja Shams

2025/3/12

Real World Serverless with theburningmonk

AI Deep Dive AI Chapters Transcript

People

Khawaja Shams

Topics

Khawaja Shams: 我在职业生涯早期从事硬件工作，后来转向软件开发，并不断寻求更高效的解决方案，例如在2008年至2009年期间，我将整个图像处理流程迁移到AWS，利用S3和SQS等无服务器服务来应对图像处理峰值。在亚马逊期间，我领导了DynamoDB团队，并在AWS Elemental负责媒体服务的工程和产品工作。现在，作为Momento的联合创始人，我们致力于帮助客户轻松构建高性能、关键任务、大规模应用，尤其在缓存和消息方面，我们简化了在前端应用中添加消息传递的过程，解决了WebSockets带来的挑战。我们采用蜂窝架构，将区域划分为更小的单元，从而降低单点故障的影响，并使各个单元更相似，更容易进行测试和管理。蜂窝架构的核心价值在于减少故障的影响范围，并为架构设计提供更明确的限制。它允许将大型客户部署到独立单元中，从而获得更大的灵活性。Momento的蜂窝架构与AWS的不同之处在于，Momento为大型客户提供专用单元，而AWS则在单元内容纳多个客户。我们利用多租户技术提高资源利用率，并根据客户需求进行定制化调整。蜂窝架构会减慢部署速度，需要谨慎设计部署策略。AWS的最佳实践是将每个单元部署到独立的账户中，但这种方法在单元数量较多时会带来管理上的挑战。我们也面临着账户限制的问题，需要不断申请资源配额的提升。服务等级协议(SLA)应该基于客户的体验，而不是整体系统的表现来衡量，应该基于每个客户的体验来衡量SLA并提供相应的补偿。多租户架构能够更好地处理大规模流量突增，大型多租户系统能够更好地应对流量突增，因为它们已经过多次测试和优化。即使没有明显的客户影响，也应该深入分析性能问题，持续改进服务。Momento的单元是自包含的，目前没有单元间的通信。我们对单元大小设置了工程限制，并努力提高每个节点的性能，从而实现更大的水平扩展。如果客户的请求量超过单个单元的处理能力，则需要将其拆分到多个单元中。单元信息包含在请求令牌中，用于确定请求路由到哪个单元。尽早投资蜂窝架构，特别是基础设施即代码和可观察性方面。根据单元特性选择合适的实例类型，例如C7GN实例。将服务从Java重写为Rust，提高了性能，并使一些高级优化技术成为可能，例如AWS Placement Groups、核心绑定和流量路由等技术可以显著提高性能。将应用从Java迁移到Rust并不一定能立即带来显著的性能提升，仍然需要进行性能调优。Momento正在与S3集成，并探索在AI领域的新应用场景。 Jan: (无核心论点，主要为提问和引导讨论)

Deep Dive

Chapters

This chapter starts by introducing Momento and its co-founder Khawaja Shams. It then delves into a recent incident where a customer unexpectedly sent a million requests per second, highlighting Momento's ability to handle such spikes with minimal impact. The discussion also touches upon the importance of developer experience and the ease of use of Momento's services, particularly in real-time communication systems.

Momento handles millions of transactions per second (TPS) with minimal impact.
Cellular architecture is key to Momento's scalability and resilience.
Developer experience is a core value proposition for Momento.

Shownotes Transcript

Translations:

中文

Hi, welcome back to the end of the episode of Real World of Serverless, a podcast where I speak to real world practitioners and get their stories from the field. And today we are joined by Dekwaja Shams again from Memento. Hey, man. Welcome back. Hey, Jan. Always good to see you.

Yeah, we spoke a little while back about Memento, and I've been a big fan of what you guys are working on. And recently, I saw a post on LinkedIn from one of your engineers talking about how one of your customers suddenly just decided to do, I would say, like a million requests per second. And it has some small, small impact. I think you were talking about a nanosecond difference in terms of your architectural response time. Yeah.

And something came out interesting from that conversation, which was about cellular architecture and something that I know you guys have been doing at AWS previously. So yeah, maybe before we get into it, tell us a little bit more about Memento and yourself again and how you got from, you know, AWS to Memento. Yeah, happy to. My name is Kawaja. I started my career building cameras on board the Mars rovers. I...

I'm impatient, so I moved on to building software. Software is much more instant gratification than hardware. I built the image processing pipeline for the Mars rovers. Impatient again because I didn't want to wait for servers. So in the around 2008, 2009 timeframe, I moved my entire pipeline onto AWS using a lot of the original serverless services, S3, SQS, to really power those image processing spikes.

I joined Amazon, ran the DynamoDB team, and then worked at AWS Elemental to run product and engineering for the media services.

And today, I am a co-founder at Momento. We take a lot of the lessons that we've learned building high-performance, mission-critical, high-scale applications and make it easy for our customers to do all that without having to worry about the low-level infrastructure minutiae, specifically as it pertains to caching and messaging.

Yeah, I really love the stuff you guys have done around the topics. The topics just makes building real-time communication systems so much easier in terms of building some kind of like a push model for your website background processes to tell the front end that something's done.

or something more interesting like real-time chat or games where you need to communicate in real-time updates from one place to another. Topics just works really, really well for those use cases and it's so simple to use as well. I know AppSync has released a new event API to try to make it a little bit more accessible for folks who want to have that.

nice managed WebSocket messaging without having to use GraphQL. But I think compared to Memento topics, it's still lacking some of that fine-grained access control for one. But also, I think just in terms of how easy to use, you guys have done a really good job around the topics.

Thank you. I think developer experience is key for us, and that includes how quickly people get started, but also how little distractions they have to deal with when things get big and spiky. Our main value prop is just to get customers going and keep going as their events become successful. And we constantly look for sources that distract customers

customers. And WebSockets, like fighting with WebSockets tends to be a pretty frequent occurrence for anybody that is trying to build real-time notifications to end applications. And we saw that as a great opportunity for us to add value to the world by kind of simplifying how quickly people can add messaging to their front-end applications.

Talking about distractions, thinking about uptime and redundancy and all of that is also a big concern as well, especially as you start to build things for millions of concurrent users, really high throughput systems. And that's one of the things that you guys are really good at in terms of just that scaling and resilient architectures.

And a big part of that is something I want to talk about today, which is around cellular architecture, something that I've been reading more and more about and something that I guess folks like Mark Brooker has talked about quite a lot in terms of multi-tenancy and AWS. And I know that you were involved with a lot of those discussions while you were at AWS.

And yeah, so maybe tell us a little bit about the incident that happened at Memento and how you guys are using cellular architectures to kind of, I guess, limit the blast radius of what that one noisy customer was doing. Yeah, I mean, I think let's go backwards in time and talk about why Amazon went all in on cellular architectures. When

AWS initially came out, the Virginia region was quite popular. I mean, it was the first region and with the default on the console. So a lot of the customers ended up just defaulting to deploying their services in Virginia. James Hamilton always talks about the power of defaults. And I'm always surprised by how much people just pick whatever is the default. Yeah.

And what happened over time was that the Virginia region kept getting bigger and bigger. And you'd basically learn about all of the scaling limits. If you were a service provider inside of AWS, all of your limits of the architecture of your software, everything would get exposed first inside of Virginia region because it was the biggest. And

you know, if you look at some of the biggest outages that have happened in AWS, most of them are in the, at least historically, most of them were in the Virginia region. And a lot of them were just due to the scale. And you can do all kinds of staging and, you know, throttle your deployments and so forth. But by the time things get to Virginia, they had never been tested at that scale. And

For AWS, one of Mark Brooker's principles is do constant work. And if you look at it, the region with the highest density of customers was most unlike every other region.

So when that region messes up, you mess up a lot more people, but that's also the least tested. So the brilliant folks at AWS came up with this notion of, well, we can cellularize the regions and the services within the region. All that means is you divide up the region into small cells, right?

And this way, your cells start to look a lot more similar to each other. You know, IAB might have or Virginia might have lots and lots of cells that are closer to the size of the second biggest region or and so forth.

But it also gives more engineering constraints for the people who are designing the architecture as well. So around circa 2015, right after AWS purchased Elemental and we were launching a bunch of media services, the edict inside of AWS was to start pre-installing

shipping all the services to be built with cellular architecture from the ground up. So I got to experience building those services and it's kind of mind blowing how much just thinking about reducing the blast radius goes in terms of increasing the overall availability and improving the design of your service.

Yeah, I remember back in the day, it wasn't just the US East 1 itself, but the US East 1A. Specific AZs were so much bigger than others, again, because it's a default. And somehow, I think it was Corey Queen was joking at one point that if you want to run the chaos experiments, just deploy your workload to US East 1. Yeah.

Because essentially, it's going to be doing that because all the failure is going to happen over there compared to the other regions. And I remember, what was it, about seven, eight years ago, they started to shuffle around different AZs so that your US East 1 is not mine, US East 1. So that way, they were able to, I guess, again, just moving customers around to different AZs so that they're not all concentrated on the US East 1A, which is, again, the default AZ for everybody.

So in that case, a question that comes up a lot is how does stellar architectures differ from something like microservices? I guess from what you described, stellar architecture is basically taking your one maybe microservice but deploying different

copies of it and then you put different customers into each of those cells so that when one customer is doing something crazy, they may be a noisy neighbor, but they're going to be a noise neighbor for 10 other customers as opposed to all the other customers in the same region. Is that the idea that you're dividing your customer into different sort of groupings as opposed to dividing your application to different services as is the case with microservices?

I think that's one way to think about it. I think you hit the nail on the head with microservices. Usually when you think of microservices, there's some heterogeny, right? There's different variety of microservices. Cellularization is to take the same service and then break it down into lots and lots of cells. And

The core value prop here is to just reduce the blast radius of what happens when a cell goes down. Cells can go down for lots and lots of reasons. It could be a bad customer or a customer having a bad day. It could be because you have a bad deployment. And all of this, again, comes back to reduce the blast radius and then giving the team engineering constraints that this is how big you expect your cells to get to.

So that's why people cellularize. And the idea here is you can cellularize and put customers in specific cells. And that's a good idea for many reasons. It helps you do better capacity planning and so forth. But if you really go all in on cellular architecture, you can get to a stage where you can put your largest customers within their own dedicated cells as well.

And if you do this from the ground up, this actually helps you in many ways. And this is one difference in the way Momento does cellular architecture than a typical AWS team does cellular architecture. AWS teams will have lots and lots of customers within a given cell, and their job is to improve the utilization of each one of those cells. The way Momento does this is for our largest customers, we give them their own private cell,

Both ourselves and AWS's cells are multi-tenanted, and multi-tenancy helps you improve realization and so forth. In our case, our customers benefit from the increased realization by leveraging multi-tenancy within their own organization, and we can make trade-offs that are very customer-specific in terms of how hot to run a particular cell.

So having this cellular architecture allows you to have degrees of freedom across different customers as well if you really, really go all in. And fundamentally, at the bottom of every cellular architecture is how do you quickly deploy customers?

cells as the customers demand. And having a really good infrastructure at CodeStack really allows you to be nimble and to deploy a cell for every customer without having to spend engineer weeks building out a cell for each region or whatever it may be.

Right. And that's one of the multi-tenancy strategies that I think AWS has often talked about in terms of having essentially every tenant gets its own copy of the same application. And one of the trade-offs there has always been, okay, now you have to put a lot more work into automation and make sure that when you push out a code update,

It's going to get deployed to all the customers quickly. And, okay, what do you do when one of the tenants, one of the sales fails to update? What do you do at that point? What are some of the sort of, I guess, real-world problems you guys have run into with these kind of things like deployment taking a very long time because there's so many different

and what do you do when maybe there's some problems happening? You don't know if it's to do with the new code change. Do you just keep going? Do you roll back? What are some of the kind of real-world problems you've run into? Yeah, I think Amazon ran into this problem as well. We run at it into a much more modest scale. When Amazon in the early days just had a handful of regions and

The trade-off that we made on how we design deployments, how do we do staging and so forth, the trade-offs that work when you have five regions don't work when you have five dozen regions. And we're running into some of those problems now because, you know, if you're putting...

a day of baking across each of your stage, now either your stages get very, very wide or it's taking you weeks to deploy software, to do forward deployment. So cellularization is, it gives you more cells, but it actually slows down your deployments a lot.

And that's a big problem. And the challenge here is that if you overcorrect and say, okay, now I'll just deploy to 50 cells at a time, well,

you defeated the purpose of reducing blast radius as well, right? So then you're back to the same problem. So you've got to be very, very mindful about your deployment strategy with cellular architecture. The other AWS tenant that we're starting to question a little bit in terms of, you know, as we evolve our old cellular stance, AWS puts every cell in its own account. That's the internal best practice. And that's a really good practice. But, you know,

If I have to log in or if I have to track my resources across lots and lots of accounts, that gets very challenging as you start hitting into, you know, nearing 100 cells like that becomes like now you have to find the account, you have to make sure you're in the right account.

any diagnostics that you're doing, capacity management across the fleets, that becomes much, much more challenging. So the new feature where I can log into multiple accounts from my browser in the console, that's very useful to us. But even then,

it's really difficult to deal with the number of accounts. So we're now starting to question, do we really want each cell in its own AWS account or do we want to get better at deploying units of changes for our customers, but stick within, you know, have fewer AWS accounts along the way?

Yeah, I've heard that as well about AWS internal best practices in terms of security team only trust the account boundary as the only kind of boundary that they fully trust. Is that the only reason why everything has to be deployed to their own accounts? It comes back to blast radius. So imagine if the account gets hijacked or if the account gets deleted for...

for any reason, or somebody does a bad deployment where they reset the password of the account or they delete all Dynamo tables in the account, right? Like anything, and hopefully you have more than one because blast radius with FTDs is pretty bad, like single table design. But anyways, the point being that

It's a really nice isolation boundary for blast radius reduction. And one bad deployment doesn't mess up anything else. And now the account boundary serves as a namespace. So in each one of my cells, I can have a Dynamo table called

control plane metadata or whatever, and that's totally fine. Whereas if you're keeping each cell, multiple cells in the same account boundary, then you've got to have their names associated with it. And then your access control becomes a little trickier because now you've got to create IAM policies for each deployment so that particular account has only access to its resources and so forth. So things get

a little more challenging to enforce and to reduce the blast radius if everybody shares the same account. That's where the trade-off becomes even more exciting and interesting. Okay. And was it true that even when you guys were doing multi-region applications, every single region had to be in a different account? So it's not just multi-region, but it's also multi-region and multi-accounts?

Yeah. Each region should have multiple cells and each cell should be its own account. And that's true in Momento today as well. Like we have lots and lots of AWS accounts in lots and lots of regions.

Another real world thing that we run into limits. We make a brand new account. And then today, I'm actually really proud of our infrastructure as code capabilities. We can pull up cells in a matter of an hour, but it takes a little bit longer to get it productionized. Why? Because we're waiting for tickets to get our limits increased on a brand new account.

Oh yeah, remember those days? Every time you open a new account, there's a bunch of lemurs. You have rates around EC2 and VPCs and all this other stuff. Another thing I always wondered about cellular architecture is that, I mean, with AWS, it's already quite difficult to think about SLAs and availability when a given service is never really quite down per se because it's always available in some other region. Is it?

But when you have multiple sales within the same region, how do you start to then reason about SLAs and your uptime, especially when there are monetary refunds and credits involved when it comes to SLAs and when you have violated your SLA? Do you have to do that at a cellular level or at a customer level? Or is that something that you only measure at this kind of service level or region level?

We have to measure SLAs at the customer level. And I learned this the hard way. I used to do this very, very wrong. I remember I showed up to the yearly planning for Dynamo and we very proudly said, hey, for the last year, we've had 100%, like we met our fleet-wide SLAs at 100%.

the EBS team was in the room and I think it was David Richardson who raised his hand and said, bullshit, that's not true. We know we've had an impact because of you guys. And it was really eye-opening for me because he was absolutely right. And what was happening in the Dynamo team was street-wide error rates

were fine. They were always below the 99.99% SLA that we took internally. But what happens is from a customer perspective, if a single partition is down, that customer is meaningfully impacted even though it's one of the hundreds of thousands of partitions in the region. So while as Dynamo team, we could come in and say, "Yeah, we won. The fleet-wide error rates are super low."

that customer is clearly not being represented by the internal SLAs that my team had. So then we very quickly moved towards measuring SLAs on a per customer and per table basis. And that's a lesson that we've carried on in Memento. So you want to measure your availability and your error rates on a per customer basis. And you need to give them monetary compensation

for the SLAs that you agree upon based on their experience as opposed to your fleet-wide experience. Okay. So in that case, now that we've kind of laid the land around the cellular architectures, I want to talk about that particular incident you guys had with that crazy customer.

I wouldn't call them crazy. Crazy scale. Yeah, crazy scale. I mean, this is the power of multi-tenanted architectures. Look, if you deploy your services on Dynamo OnDemand and S3 and you hit a million requests per second, chances are that it will work. But if you had concocted your own EC2-based framework and you are confronted with an unanticipated spike,

Chances are you're screwed. So these multi-tenanted serverless services, they just are way more capable of handling the scale. Why? Because any spike you can do, we've probably seen it, right? So if you show up today and start running 10 million TPS on a Dynamo table, you wouldn't be the first. And the system as a whole has been architected and tested repeatedly.

repeatedly for it. So that's the key value prop of these large-scale multi-tenanted systems. In our case, we had a customer that had a good day. They're a gaming company. They had a great day and their traffic went, I think, from 10,000 TPS to over a million TPS. And the funny part was Prateek, who saw this, he was on call and he just woke up and he

He realized that, okay, like, but when he woke up in the morning, that overnight, this big spike had happened and we didn't even...

Notice it. And this was nice. I mean, it's a big change for a company like ours because anytime a customer would have spikes, we would love to in the past know when they're anticipating those spikes. And, you know, we work hard. We'll make sure like everything is there and ready and we have run our load tests and so forth. And now it's getting to a point where it's like, oh, cool, you did well.

a couple of million TPS and we absorbed it and we didn't even notice. There were times, right when we launched the service, we would have alarms go off if usage went above a certain threshold, just so we can keep an eye, even if something wasn't wrong, just so we can... Because if a customer is spiking up that high, maybe it's a really crucial time for them and we don't want to let them down. But this was a really welcome change. Now, the paranoid operator in all of us

isn't going to let that go either, right? Like, okay, you bumped up over a million TPS. How did the fleet do? What were the corner cases that we ran into? And we observed that, hey, we did have a 200 microsecond bump at our P99.9 latencies. So then we started digging into, well, why? It's not like we were capacity constrained, but why is it?

that these latencies went up. And, you know, that started the conversation around just diving into the metrics and asking the five whys until you get to the root cause of where your performance issues are coming from, even if there isn't a major observable customer impact because that's how you continuously improve your service.

Right. And in that case, with your architecture, I mean, with one cell, how far does that cell go in terms of, I mean, with microservices, you have, you know, every service that's got its own deployment, its own cluster and all of that, its own databases, but then there's still that communication between different services.

So if you are taking one of your services and you deploy different cells of that service, you still have to talk to maybe other services. There's still those interdependencies between services. How do you then, I guess, orchestrate that? Do cells have to talk to specific cells on the other side? Or is that just all down to the routing or based on a customer ID that the other service will have its own service?

you know, seller network of different clusters and they would route based on a customer ID or something to the correct place. Yeah. So today, Memento cells are self-contained. So for reference, you can think of us as a caching fleet. We have a, you know, it's a two-tier architecture. There's a routing fleet and there's a storage fleet. And

The cells don't, there's no cross-cell communication today. When we build cross-region replication and things like that, we might have it. But for now, a cell is fully contained and can only talk to resources within the

within the cell itself so requests come in uh the routing fleet terminates the request and then sends it to the to the uh to the storage node based on after authentication authenticating the customer looking up the metadata of the cache that they have access to authorizing their requests and then um and then sending it that particular um direction um

But one of the, as you start thinking about scaling yourself, you know, we're a horizontal caching, horizontal scaling company. Vertical scale makes a big difference in how far up you can scale the horizontal aspects of it. And we gave our engineers a very specific engineering requirement, which is be able to deal with, you know,

you know, 100 routers in your cell and be able to deal with at least 100 partitions and each partition will have some replicas. So that's how we kind of started. And today, our, and just at 16 cores, like our routing fleet does like a couple of hundred thousand plus QPS. So,

100 nodes gets you 20 million QPS, assuming perfect balance. So that on the order of 20 million TPS on a single cell is the engineering constraint that we live with. Our storage nodes do the same. It's 200,000. And that's a number that we have come a long way from. When we are running in the Java world, the number was 20K per node. And we were using four XLs for it. Now we're, you know...

working our way down to 2x cells and doing hundreds of thousands of TPS. But the idea is very simple. Now we put engineering constraints on what a size of cell could be from a physical node perspective, and then we work really hard to drive what each node can do so that we can handle bigger horizontal scale. The customer shows up and wants to do more than 20 million TPS, chances are we're going to tell them to split between multiple cells

Right, gotcha. Okay, and when you said the QPS, that's queries per second, so basically TPS for the routing layer. Okay, so in that case, I forgot which service I was using. I think it was the Timestream service, which you have to actually, as a user of the SDK, you have to choose which cell you want to connect to. And is that what you meant when you say when a customer wants to do 20 million TPX, they have to make a conscious decision?

which request goes to which cell. Is that what you meant?

Yeah, I mean, we haven't had a customer show up yet above 20 million QPS. So if somebody does show up or TPS, like if somebody does show up and wants to do 100 million, chances are we're going to have them split across. In our world, the cellular information is baked into the request token as well. So the JWT that we vend you has all of that information. So each cell has its own. Again, it's fully automated.

fully isolated, fully contained. So your tokens today don't work across the cells either. So you'll have different tokens for each cell and that's how you make the decision between which cell to go to. Okay, gotcha.

I mean, so with the, well, bear in mind that all these extra complexities you have when it comes to cellular architecture, where would you say are the kind of the, I guess, the threshold when someone should consider, okay, when we're building a service, we should build it with cellular architectures versus just, you know, try to squeeze everyone into one cluster, essentially. I think...

We got lucky because we started cellularization in the very early days. There's a life principle that I live by, which is you'll always be busier tomorrow. So the temptation is always to say, let me build this thing out and then I will get back to cellularization later. So if you have multiple customers and if you are going to deploy in multiple regions, you

It's worthwhile to, even if you model your initial service as a single cell, there's just good habits that get built into your setup, having the right infrastructure as code, having the right deployment setup. All of that is incredibly helpful to invest in and easier to invest in at first. And that keeps you focused. That's when you go from that single cell to then the second cell, this is where you then start to make decisions

you know, changes in your deployments and, you know, you're having your staging environments and things like that as well. And in our case, our dev environments is a cell. So every developer at Momento has access to their own cell that they can whip up just by, you know, running their CDK scripts. And our staging environment, it's a cell. Our load testing environment, it's a cell.

Every region has multiple cells, and that has been an incredible productivity boost for us. Whereas in Dynamo, at least when I was on the Dynamo team, we used to have dedicated regions or deployments, and developers would check out

and say, hey, everybody, I'm going to go work on this cluster and nobody else touch it. Whereas our investment in cellularization at Memento have paid off where every developer can just spin up their own cell at their whim.

Yeah, it's funny because what you talk about is basically what I've been preaching and teaching my students about ephemeral environments. I don't call them cells, but essentially it's, again, it's just a copy of your entire application that you can bring up in its own stack, but you can have multiple stacks.

And so you will have your main dev test staging production environment in their own account. But say in your dev account, you have just bring up a copy of your application, your service for development for individual developers or for feature or for CICD pipeline run so that at the start of the pipeline, you create new

temporary environment, you run all of your tests against that. And when you're done, you tear it down so that you don't have to worry about polluting, you know, your main dev environment with test data and every developer and every feature has got its own environment or cell in this case. So it feels like more or less the same idea so that if you've got your infrastructure as code, right, so that you can, you know,

make sure that you've always got something like a stage name or environment name or cell name as part of the resource name so that you can always tell that, okay, even though I've got multiple copies of the same down DB table, they belong to different cells so that I can...

And then within that, I can still do multi-tenancy so that I can follow the same practices around using the tenant ID as the hash key and doing a tenant level IAM permissions and things like that so that I can still use those techniques. But then I can create different copies of my application as different cells so that when I need to, I can then

bring all of it together, create different cells and scale out in terms of cells as opposed to just bigger and bigger down-db tables and cognitive user pools and so on. That's such a really interesting way to think about it. I haven't really quite connected the two together, even though now that you mentioned it, it's basically the same thing. It's absolutely the same thing. I'll just share a different perspective. You call them ephemeral environments. I think that is one attribute of those environments. The other attribute is they're isolated environments.

right? They're isolation units and a cell is an isolation unit. And sure, the ephemerality is nice because it's temporary, but you know, you could also have the same, like if your students had an ephemeral environment, but they wanted to keep it forever, that would be a cell. Right. Yeah, exactly. Because of the isolation. Essentially, it's a permanent environment, but it's using the same technology and technique for building ephemeral environments.

That's right. And if you build this ephemeral environment, then your load testing becomes easier, your ability to reproduce issues becomes easier, your abilities to develop becomes easier. It's a really worthwhile thing. And I keep getting back to this, like, the reason why you do cellularization is to reduce blast radius. And the way you do cellularization well is you invest in infrastructure as code.

Yeah, and again, with serverless, the pay-per-use pricing model as well, one of the benefits, and I think one of the reasons why this whole practice became more prevalent with serverless, at least for me, is that it doesn't matter whether you've got 10 copies of the same application or one copy, you only pay for what you use. So if you're doing 100 requests, well, that's 100 requests across one environment or across 10 environments. It doesn't matter. It's the same cost.

There's no cost for extra uptime for different redundant copies of the same application.

Absolutely. The consumption-based environments work really, really well for us. In fact, one of the best ways, a lot of customers end up saving money when they move over to us because of the fact that they don't have to have the fully provisioned resources in both of their staging and in their dev environments. They can just only pay for those resources when they're actually running a test.

Okay. Yeah, that's great. This has been an eye-opening moment for me in terms of thinking about the cellular architecture. I guess in that case, you guys have been doing some pretty heavy loads already. What has been some of the lessons you've learned so far besides, okay, you should always start with cellular architecture so that at least you are ready for it when you need to start to bring out the second cell and the third cell and so on?

Yeah, I think make the investments early. And the best way to make investment towards selling architecture is to invest in your infrastructure as code pipeline.

and your observability pipeline, because the more cells you have, the more important the observability practice becomes, and it pays off later. The other part, as our real-world experience, a lot of it has to do with performance tuning and scaling up the environment. And we've learned a lot about how to save money and how to be more efficient. And

Sometimes paying attention to that actually improves your performance and the scale that you can handle. And there's all kinds of things that we're learning around

workflows are better with different EC2 instance types. Like we run into customers that are very, very heavily network oriented, but they don't consume any RAM. They just have, you know, a few keys and they just want to hit them as hard as possible in their big data. So, you know, for those, like there's a C7 GN instance. And if you use that versus an R7, you

you know you'll end up wasting all this extra ram that's just sitting there because they have way more network the c7gn has a quarter of the ram but a lot more network than than an r7g so being very very thoughtful about what instance types and then if you have the infrastructure as code right like which i think we do we made you know we're not perfect but we made a lot of investments

Different cells are running different instance types based on the characteristics of those cells. So we run our routing fleet as C7GN in some regions, C7Gs in other regions, and it's totally fine. Okay. And what are some of the specific optimizations that you've learned besides picking the right EC2 instance for each cell, but also for the, I guess, the query, the router layer as well?

Yeah, so we work a lot in terms of reducing the cost of the extra hop from the routing to the storage. And we spend a bunch of time, we re-route our entire service in Rust. And one of the biggest lessons learned there was when we moved from Java to Rust,

Our ability to handle the load was like neck to neck. We got like maybe a 5% increase. But over time, that's when you start getting into the hundreds of thousands of transactions per second on a single node. That's where techniques like core pinning and flow routing start to actually help you drive the box hotter and hotter.

Other techniques, including placement groups, like AWS placement groups, are incredibly underutilized. It started as a high-performance computing construct, but basically, if you want nodes to be very tightly grouped together and have low latency between each other, you can put them in a cluster placement group, and that meaningfully, you know, it'll drop your latencies to tens of micros between the two nodes, which sounds...

pretty inconsequential, but it really comes into play when you're trying to drive a lot of throughput between two nodes on a very small number of connections, right? Because there's the round trim delay actually reduces how many TPS you can do on a single connection. So placement groups, core pinning,

And flow routing have been helpful once we moved on from the Java world into rewriting our entire stack with Rust.

Okay. I've heard about the core pinning. So that is mostly to do with trying to maximize your cache affinity so that essentially the same user hits you again. You go to the same core, which you most likely have got some of the data in the L1 and L2 cache already. So that's the idea of core pinning. I've not heard about the flow routing before. What's that? Yeah. So core pinning is more to make sure that... So look, we have...

Let's say you have an 8-core Graviton instance. And what we would do is we would run six of those cores for our threads, and we would say, hey, thread, don't leave this. But the really important part is to then also say, hey, kernel, your networking is only happening on these two cores.

Because when you start processing too many packets on the same node, the packet interrupt can actually disrupt whatever, like it can cause context switches in the cores that are doing useful work for you.

So that's where real core pinning is not just core pinning your threads, but also the kernel and specifically the networking interrupts, the RAQs to specific cores as well. Flow routing just means that, hey, like you can say, hey,

This core is going to continue processing the request that's coming for the same user. You can do that based on port and IP address, as an instance, for the source. So that you're not handing your stuff to a different thread for the same user over and over again as they send more data back and forth.

And this becomes more important in different architectures where threads are just stealing work from each other and so forth. But flow routing is much harder. If you want to optimize, start by picking the right EC2 instance, check out placement groups, then core pin. You can do core pinning in the matter of you can learn it from scratch and deploy it within the same day.

But it's only useful if you're really driving your machine hot. Like if you're actually doing 100,000 plus transactions per second, core pinning goes a long way. Right. I actually got us two mixed up. What I was thinking about as core pinning was actually a flow routing. And core pinning is the stuff that I think the early and runtime users do because they only run as many concurrent threads as there are cores in the CPU precisely to minimize the percentage of CPU time gets spent on the context switching process.

Exactly. I think some of the Java frameworks usually do the same thing as well, that precisely for the same reason to try to minimize the amount of context switching. Okay, that's really cool. What about in terms of...

Rust itself, because the techniques you talked about, again, that's also applicable to Java and other languages as well. If, like you said, you didn't get that much win out of just switching from Java to Rust, was there anything specific about Rust? I know Evan talked about how Rust is high performance, all of that. But in your experience, what were some of the biggest wins from switching from Java to Rust?

You can write really slow code in Rust. That's just reality. And so don't expect that just by rewriting your application in Rust, like especially for the complex application, that suddenly everything is going to be much, much faster. Look, our team is filled with Java experts. And, you know, we have spent our entire lives like tuning GDMs to do whatever they could do.

we learn Rust from scratch in the middle of this company, right? So we, there's a whole lot of techniques that we didn't know. And, you know, like,

There are issues in Rust libraries as well when it comes to, like, there are issues that we're actively debating with on the HTTP libraries inside of Rust where in certain scenarios it can actually be slower than Java. So...

At the end of the day, you're not going to get... There's no easy answer for performance. You still have to profile your application. You still have to find the sources of contention. You still have to deal with multi-threaded optimizations and so forth. So it's not an easy answer. You've got to keep tuning and...

Going from a well-tuned Java environment to Rust, you may not get something great out of the box because you also may not know all the tips and tricks inside of Rust yet. It's been a worthwhile journey. It just took actual investment to improve.

Yeah, I wish more people would talk about that. I think the way that a lot of people talk about ROS is as if it's the silver bullet, the magic dust that you can just sprinkle in your application and suddenly it's 10x faster. I mean, sure, like if you're just doing a hello world type of application, like, yeah, it's pretty fast. But, you know, things get complicated when the application is more complex.

Yeah, absolutely. Absolutely. Yeah, I think that's all the questions I've had. And anything else you want to sort of add before we go? Anything, I guess, any newer things happening at the Memento? I know you guys were talking about Object a little while back. How's that coming along?

It's coming along really well. We've been working on our favorite service, S3, and adding a cache on top of it. We're finding all kinds of new use cases, especially in the AI landscape, about temporary data that needs to be really, really fast when it's fresh, but needs to be persisted for long term. And S3 is incredible for that.

low-cost durability at very, very high throughputs. And then S3 plus Momento just pairs like the best, you know, it's just a really nice pairing. So we've been excited about that. We love talking to customers that have deep performance optimization needs. And whether it has to do with Momento or not, we still want to talk to people because we

Then we learn from them and then we allow our customers to get benefit from it. So if anybody wants to nerd out about deep distributed systems performance issues, I would love to get time and learn from you.

Yeah. And I guess in that case, if someone wants to reach out to you and talk to you, what's the best way to find you? Is it going to be on the X or Twitter or LinkedIn or somewhere? Yeah, I was about to say Twitter, but yeah, you can find me on Twitter, KSSHAMS or LinkedIn. I think my DMs are open on both. Just shoot me a note. And then Momento is at gomomento.com. So just go.

You can also find me on the Believe in Serverless Discord. So believeinserverless.com, just join the Discord community. It's a really happening environment with over a thousand members that are just helping each other grow. And that's where I learn a lot and I'm happy to meet more people there too.

Yeah, I'll put the links in the description below. And I have to say the Believe in Serverless community is great. I really love jumping in there and occasionally just jumping onto a thread and talking about the problems people have because there are some really...

talented engineers on there, very experienced developers who are chipping in and helping each other out. It's great. It's a really good place to be if you want to learn about serverless, but also have fun talking about the more complicated, complex problems that people may have. Absolutely. And there's no zealots. Everybody's supportive. So even if you have questions about your Kubernetes cluster, bring them on over. We'll help you out.

All right. Okay. Yeah. Thanks for joining us again. And I will see you hopefully in the next couple of months, maybe in Seattle. And yeah, I will talk to you guys next time. Take care. Thanks, Sean.

So that's it for another episode of Real World Serverless. To access the show notes, please go to realworldserverless.com. If you want to learn how to build production-ready serverless applications, please check out my upcoming courses at productionreadyserverless.com. And I'll see you guys next time.

#114: Best practices for building a multi-tenant system with Khawaja Shams 48:24 Share

Real World Serverless with theburningmonk

Deep Dive

Shownotes Transcript

#114: Best practices for building a multi-tenant system with Khawaja Shams