We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Building Out GPU Clouds // Mohan Atreya // #317

2025/5/23

MLOps.community

AI Deep Dive AI Chapters Transcript

People

Mohan Atreya

Topics

Mohan Atreya: 我认为GPU很难获得和使用，传统上，公司要么购买GPU硬件自建数据中心，要么租用AWS、Azure等云服务商的GPU。对于许多任务，需要特定类型的GPU才能达到最佳效果。如果无法获得所需的GPU或价格过高，用户会陷入两难境地。企业的IT部门通常不理解AI/ML的需求，因为他们按照标准化的方式运作，说服IT部门购买非标准化的GPU需要付出更多努力。在AI/ML领域，很多工作都需要实验，如果难以获得实验所需的资源，会阻碍业务发展。一些云服务商要求用户长期承诺，或者预留定价非常高昂，这阻碍了业务发展。近年来，涌现出一些新的GPU云服务商，如Modal、Lambda Labs和Base 10，市场对这类服务有很大需求。新的GPU云服务商正在改变市场格局，我们与这类公司合作。新的GPU云服务商面临的挑战包括寻找数据中心、电力以及将GPU转化为云服务。我们帮助GPU云服务商快速进入市场，因为他们需要尽快开始盈利。CoreWeave通过一些有趣的金融工程使其GPU更有价值，并希望始终保持GPU的饱和使用。CoreWeave的成功部分归功于微软这个大客户的资金支持。我们合作的GPU云服务商主要面向企业或大学提供服务。一些大学希望建立AI/ML实验室，但缺乏GPU资源和技术支持。新的GPU云服务商为大学提供Notebook、Ray和Kubeflow等服务，帮助他们培训下一代数据科学家。通过与GPU云服务商合作，学生可以获得实践经验，为进入AI/ML领域做好准备。GPU云服务正在赋能下一代AI/ML人才。

Deep Dive

Chapters

GPUs are difficult to acquire, especially the specific types needed for optimal performance. Companies face challenges with IT departments not understanding specialized needs and the high costs and commitment required for cloud-based GPU access. Experimentation is crucial in AI/ML, but limited access hinders this process.

Difficulty in acquiring specific GPUs
Limited options: company-owned data centers or major cloud providers
IT departments may not understand specialized GPU needs
High costs and long-term commitments for cloud GPU access hinder experimentation

Shownotes Transcript

- We're firing on all cylinders. GPUs are hard. You've been dealing with them a lot. Why are they so hard? - The first is they are hard to get hold of. - Number one. - Exactly. The ones you want, they may not be available. And usually, I mean, if you look back,

You historically only had two options. Your company probably would have purchased a bunch of GPUs in a data center, exactly, hardware, and set things up. And if the company is large, if the enterprise is large, they probably have a good number of GPUs that IT has set up and made available to you.

Or, in the last 15 years or so, when you had the folks like AWS, Azure, Google, and they all, you know, they provide GPUs for rent, essentially. These were the only two games in town. And the problem that I think people run into is for many tasks, you need the exact kind of GPU to do the thing optimally.

and you may not get hold of that or it might be too expensive, then you're kind of stuck between a rock and a hard place. Especially if your company is okayed

a lot of spend on AWS and AWS either doesn't have the capacity you're looking for or... So I see this a lot with some, when I talk to some customers, one of the complaints they, actually two things they'll end up saying is, one is IT doesn't understand what I exactly need because IT operates in standards-based world.

So if it's outside the known bubble for them, it's a lot more work for you to convince them and say, hey, I need that, even on a cloud like AWS. The second problem they run into is...

They really don't know if that thing is going to work for them or not. A lot of stuff in AIML is experimentation because you learn about it and then you have to try it. And if it's too hard to get access to something before you can try it, because you may conclude this is not the thing I need.

And then some of these clouds also make it so hard for you to get access, like do a one-year commit. So true. If you want to do something with just reserve pricing, the pricing is astronomical. Exactly. Exactly. And that gets in the way of doing business, right? Now, none of that exists with general-purpose compute. Right.

But we struggle with that with AI infrastructure today. Yeah, well, you did mention there's kind of these two paths, which is you either buy the hardware and set up your own cloud or your own GPU cloud, we could say, or you go to the cloud providers. In the recent, I would say, two years, three years, we've had the explosion of the modals and the Lambda Labs and the base 10s. All of those folks have come out of nowhere online.

or seemingly nowhere. And there's a lot of demand for that type of thing too. Yeah. That, that we believe will be a game changer for the market in, and like you said correctly in the last two years, things have cropped up and then there's a large number of providers that,

who are looking to set this up. In fact, that's the kind of entities we guys work with. Now, it's hard for everyone. And for them, the problem is they do the hard work of finding the real estate data center power because these GPUs are power hungry. Then you have to buy the GPUs, rack them. And they do that. Then they start looking at, okay, I got a bunch of GPUs.

How do I convert this into a GPU cloud? And a lot of them have hardware chops, data center chops, software chops, and running a service like an AWS kind of an experience is not easy. So that gap is what we guys fill because they need to go to market fast because they are probably paying interest on the GPUs.

And if the GPUs are idling and they have no customers, they're burning money all the time. Yeah, if you're a core weave, they've done some very interesting financial engineering to make those GPUs worth the while. And they want to be saturating those GPUs 100% all the time. Correct. They are also a little lucky in my opinion because if you look at their 10K, I mean the S1 that they published before they went public,

About 60-70% of the revenue comes from one customer, Microsoft. So effectively they are lucky because they had one big customer who could fund the whole thing. But if you are a new GPU cloud, you may not be that lucky. There's only so many companies that do foundation models that need that kind of scale.

Microsoft, through OpenAI, I guess, was funding CoreWeave's investment in that. But what about the rest? So the kind of organizations, the GPU clouds we work with, they seem to sell to enterprises or universities. Let me just give an example.

This was a university that said, hey, we want to launch an AI ML lab. We want to train the next generation of data scientists. The thing is, to do that well, you need GPUs. The university doesn't have anything.

And it's too hard to set up. But they want to launch the course now. Right? So they are working with these new GPU clouds, the Neo clouds. And these guys are giving them experiences like, hey, I'll give you a notebook as a service. I'll give you Ray as a service. I'll give you Kubeflow as a service. Of course, underneath the covers, they all have GPUs.

And I think this is the way by which the next generation of data scientists, et cetera, because when they graduate, they need to have practical experience with real things. Yeah, and a collab that you get a free 300 bucks a month or 300 bucks until you burn that 300 bucks, that's not enough. Correct, exactly. And then this all has to be structured, right? Like in a university setting, how do I give you a degree? How do I get you a lab where you can practice stuff?

So these are the kind of examples we see. And then the Neo clouds, the GPU clouds are enabling, in my opinion, the next generation, the next wave of people will hit the market fully trained on how to do AI ML at scale. So you were talking about how basically you've got from the very low end on the infrastructure stack to the frameworks and like the pie torches of the world,

And all in between, you've been playing in that dimension. What are some things I imagine over the last two years that you've seen as far as gotchas? Yeah. Let's talk about a few things. Of course, it's a complex stack. A lot of stuff can go wrong. But let's talk about things that are interesting. So when you buy a computer, you kind of expect the CPU to last a long time. Now...

We're not talking about a single GPU here. We're talking about multiple GPUs connected together, working well together. And when you have that many moving parts in a complex system, failures are possible, right? In fact, Meta had published an interesting stat in their Lama white paper. I think when they launched 3, I believe. It's a year old at this time.

I think they saw about 30% failure rate. You don't expect that on a computer. You never even see anything like that, right? This kind of talks to the complexity of the environment. Now, the question is, if you are in the middle of a training run and you have a GPU failure, what do you do now? Do you abandon the training run that has been running for two weeks? That would be bad. So how do you replace capacity, like literally a swap,

underneath the covers, there you can still deliver an SLA that's amazing. So I think as a GPU cloud, as any provider, whether you're an enterprise or a wannabe GPU cloud, you got to take care of these things because your customers expect that because they just want to do a training run.

Yeah, and trying to tell somebody, expect 30% problems. 30% of the time, shit's going to hit the fan. Exactly. That's a really hard sell. Yeah. And interestingly, you know,

So one of our advisors, he worked on Lama, right? So he kind of gave us some inside stories on how difficult it was as an end user. Remember, Meta is big. They have their enormous internal infrastructure. But all the teams don't get access to it at the same time. You get two weeks. And the two weeks, you do your stuff. You better do everything you can in those two weeks. Exactly.

And a lot of times these teams just have a hypothesis, right? They have a hypothesis that I want to run this. And if it doesn't work, then they had to wait again in line, right? This is pretty frustrating as a user because you don't know. You have a general sense of, yeah, I think I should try this. This may work. And I think that's a problem in general with AIML because you start with a hypothesis. It's not deterministic. And depending on what you find from that,

you tune your hypothesis a little more. This is difficult because the data is informing you what is possible and it's not like a formula that you can say, if I do this, this magical stuff will happen. So this is the day in the life of an end user. And if you bolt on a 30% failure in that and say that too bad, your two weeks are up, I mean, that's terrible, right?

And these failures are coming from just GPUs being finicky? That and the connectivity issues because I think the thing that maybe a lot of us forget is...

It's actually a connected GPU system, right? You know, you may have like 100 GPUs that have to work together, interconnected, and this connectivity is where kind of we guys play, right? How do I, if it's in the same node, if I have eight GPUs in the same node, how can they talk to each other fast? If I have two servers, two nodes, how does inter-node connectivity work?

What if something, what if a switch dies in between? What happens to that connection? How do you have resiliency built in there? How do you detect reroute without causing an issue upstream to the training job that's going on?

So these are all... And then the thermals are a crazy problem, right? Like heating. I remember reading one of the blogs from Meta again on how they would optimize their GPU clusters. And one thing they said was, what about updating? We have these 24,000 GPUs. Basically, at any given moment, we have to be updating the GPUs constantly. And so...

Making sure that you're updating in the most efficient way possible. Correct. And the people that are using those GPUs, if they had a two-week long job that they're running, and now it's like, oh, we need to update this. Yeah. You have to figure out how you're going to kick them off of those GPUs or just swap it out. Like you said, under the covers. Correct. They don't know that anything went differently. Correct.

It's just that you, as that engineer that's working on the clusters, you understand, okay, we swapped this out and now we're updating these GPUs. Or it could be something as simple as maybe I'm swapping out an SSD, swapping out memory. So not just the GPUs, but the stuff around GPUs, stuff happens. Getting back to the gotchas, what are things that...

In your eyes, when you've been playing with it, when you've been seeing it, it's like this is a very common one that will get people into trouble. Yeah. So there's issues at steady state. But let's talk about what happens before that. And I'll talk about what I've heard from some neoclouds when they service certain big customers.

And when some of them described how they onboard them, I literally fell off the chair. It is good to know because this is a state of the art in many Neo clouds, right? And you can actually see this. Like if you go to some Neo GPU cloud that exists and you say, hey, why can't I just simply sign up, log in, put in my card and use it?

Why are you asking me to submit a form and you'll call me in a day or two? And is it because you don't have enough capacity? Remember, the data center provider here, they have a bunch of GPUs sitting in one or two data centers.

What we found is they don't even have a good sense of inventory. They don't even know what's out there. Really? Right. So these are the curveballs we guys had to encounter and say, hey, man, we can help you solve these problems. When I say inventory, I'm not talking about just GPUs. Right.

Which server is it on? What kind of memory does it have? How's it connected? Is it an infinite man or is it some random one? Yeah, what's its MAC address? And we were surprised. And things like firewall policies because there's multiple themes and all of that, right? These are all, many of them manage these in

Google Sheets or Excel spreadsheets. Right? So that's the state of art at many of these companies, right? So no wonder when you submit a form and someone negotiates 500 GPUs that you're buying, this takes two weeks to do all of these things. Actually provision you those GPUs. Correct. And 10 teams have to do their stuff and update the Excel spreadsheet. I mean...

This is not good, right? So what we have done in our platform is we said, hey, why not solve this problem? What if there's a system, a source of truth, where you maintain this inventory, where we don't want to be simply a replacement for an Excel spreadsheet,

The inventory is important because it can dynamically then carve out capacity from the infrastructure and allocate it to a tenant. And so that is now done, right? Now, the next thing is, it's not just the GPU allocation.

You have to also now create what is called as a tenant access network, what the industry calls a TAN, a tenant access network. So what happens is, imagine, I'll take an analogy. When you go to a hotel, there's a room, right? Imagine that the room can auto-expand and shrink based on what you need, right? So how do you automatically put up the walls? How do you put the right things in the room that you need?

All that is done manually today, and what we do is automate that. If I say I need 100 GPUs, I do the east-west network. East-west is the InfiniBand, the north-south because I need a storage network, a high-speed storage network where I store my models, my data, et cetera. And then my east-west network cannot be interrupted because if it is, then latency kills, right? How do I then let people access that network?

And how do I manage multiple tenants on my one single pool of infrastructure, fully knowing tomorrow this university is going to come and say, hey, I need 100 more GPUs. So that has to be elastic. And shrink and contract over time. This is table stakes for a GPU cloud, right? Perfect.

But really hard issues. When you say that, I think, oh my gosh, that is brutal to be able to be that elastic but also have all of these things that are in your mind with that inventory. If it was being done on a Google Sheet, by the way, always a great business idea. If you see somebody doing something on a Google Sheet, you can probably think of like 10 companies that have come because they saw someone doing something on a Google Sheet or just like Excel and they thought, you know what?

We could probably make a product around that. Totally. I mean, like even products like Airtable. Yeah. I was thinking of Airtable too. Airtable is a classic one where it is just such a better experience. But at the end of the day, it's an Excel sheet. Glorified Excel sheet. Yeah, exactly. So...

So you look at the bottom of the stack, this is kind of what we saw. And then what happens? Your inventory also is changing. Servers are dying. Servers are getting replaced. You have new switches. Your data center is not static. You're adding capacity, removing capacity, retiring GPUs, adding new GPUs. How do you keep up with that? So stuff around you is changing. It's like quicksand. Constantly. Stuff on top is changing. So we have a middle layer that is helping you make sense of what's happening on top, what's happening below.

I like this. You're mentioning that on the bottom, on that layer, it's constantly changing because of whatever reason. And then on the top, it's changing because of

The customer's wanting more capacity or wanting less capacity. Correct. Correct. And so being able to be that, in a way, the broker of the capacity, knowing what capacity there is, what's actually online, because the last thing you want to do is give a customer something that's not on... You say it's online, but they're trying to load up and it's like, why is this not loading? Something's broken here. And that would be such a bad experience, right? Like, think about...

We all expect, like when you go to AWS, press a button and say, I want this EC2 instance, it comes up like clockwork. You don't have to make a support call. That's the experience people expect. Exactly. Can you be that tall to have a shot in the market? And that's a pretty hard mountain to climb. The things that you get into as you're trying to provision GPUs are all these hard pieces. The...

ways that when working with GPUs, you've seen them fail. I know that when you gave the presentation at the conference the other day, someone asked about the noisy neighbor situation. Yes. And I wanted to get into that with you because I think that's one thing that people who have tried this, if they're at a company that has the luxury of buying hardware or just being the one who's provisioning this, right?

They get into those situations. I remember three, four years ago when we had the Run AI team on here, they talked about how difficult it was too. And how do you go about solving for these noisy neighbor situations? Maybe break down for myself even. Give me a reminder of what that even means. Sure, absolutely. So this all kind of was what forced us to get into the inventory management space, right? So if you go talk to...

NVIDIA or AMD, et cetera, right? The technologies for this exist, right? You have ways by which you can isolate, like using EVPN technologies, VXLAN, all these stuff exists. I mean, this is how data centers work today, right? Orchestrating those dynamically

Making sure you can do this programmatically based on some higher level ask like, "Hey, I want 100 GPUs. Yesterday I was using 50 GPUs. Where am I going to allocate them? How am I going to connect them to the same VPC?" Essentially the same technologies, like the things you're familiar with like a VPC, like a VXLAN.

an EVPN, how do you do route leaking to my storage network, right? All these are known devils. This is orchestrating them automatically and programmatically and then making sure that I didn't have any misses in the process. - But is this just like Terraform that we're talking about? - Terraform is a way to achieve that. Really, Terraform has no understanding of the inventory, right?

it has to talk to something. So the interfaces we support, Terraform is one of the interfaces we support because people want to automate their way as they want. But Terraform itself has no understanding of inventory. It doesn't know the current state, like how many switches do I have. And then some of these technologies that exist, some of them will support...

certain kinds of automation. Others will support other kinds of automation. So you have to effectively present a unified interface and say that I'm going to shield you from the vagaries of 20 different languages you got to speak. We become the universal translator of sorts. A unified orchestrator and a translator to talk to all these various technologies that exist. And why do they exist? If I'm an enterprise, I can go in and say,

I'm going to go with the gold standard because my universe is small. I'm going to pick vendor A, B, C, or if I want to reduce that even further, I'll say, you know what? I'll pay the extra money. I'll buy an NVIDIA DGX, which is a closed system. I don't want to deal with that, right? I want to avoid all these problems. I get a box, plug it in, but it's limited to eight GPUs. And if I have more money, then I'll go buy a SuperPod, but I'm paying through my nose at that point in time because nothing is modular anymore.

A GPU cloud cannot do that. They need modularity because their demands are different. The margins are the nature where they make money. So they'll have an Arista, Juniper, a Cisco for the network Ethernet switches. They'll have a Mellanox for the InfiniBand. They'll have 20 different storage vendors. They're trying to save dollars here and there because they can't pass this on to the user. They have to remain competitive in the market.

And they have maybe 12 months to actually make money, after which the next generation GPU comes out, and then it's game over. Yeah, then they have to make another investment to bring on the new GPU that folks are going to be asking for. So that variety is what causes them grief. The variety gives them the margin, but because there is no consistency across all of these things, you need something that will glue it all together. Right? Yeah.

Or, like we all use an iPhone or a Mac or something like that, man, it's impossible to change anything in those, right? It's a black box walled garden. Or you can buy a walled garden system, which is not affordable for the GPU clouds. So anyway, I guess we digressed a little more than we thought, but these are some of the challenges they face, which may not exist for an enterprise per se, because their scale is... Let's talk about scale, right?

The typical GPU cloud we work with has 5,000 GPUs, 10,000 GPUs, multi-generations of GPUs, right? Sometimes spanning four or five data centers. Sometimes not just data centers. It's Colo here, Colo there, stitched together with some weird network, right? And they have data center talent. They have no software talent, right? Enterprise, very different, right? They have some homogeneity,

They have 64 GPUs, 100 GPUs. Yeah, unless you're meta, your 100 GPUs is great scale. Yeah, exactly. And a lot more software chops than hardware chops, I would imagine, too. Exactly. Now, even within that 64, let's talk enterprise for a second, right? The GPU cloud's a different scale.

The same problem also exists for the enterprise, which is if I have, let's say I have only 100 GPUs, how do I slice that and give it to a data scientist for X hours and reclaim it and give it to someone else when they need it? That two weeks the meta guy was talking about. Yeah. How do I do that efficiently? Right? In the old days...

you'd get on a queue. Like, you know, if your family would slurm, you know, a lot of universities still do it. This is a 20-year-old HPC tech where he said, hey, I'm going to submit a job. I'm going to describe what I want to my slurm job. It'll wait in line.

And Slurm will say, when that capacity is available, your job will run. Now, how do you achieve that in the modern containerized world? That's the challenge. What's the replacement for Slurm? Yeah, a lot of people still just use Slurm because of that exact point. Exactly. And then you have others who say, all right, well, we're going to go about it with Kubernetes. Exactly. And try and figure it out that way.

So, and Slurm has its own problems, like, you know, all the nodes have to look exactly the same, all the GPUs have to look exactly the same. If I have to update an OS, I mean, I have a customer who's doing that. They took it offline for two weeks because they had to do an OS update across their Slurm cluster. That's what I was talking about with the...

with the meta paper when they were talking about these rolling updates that they have to do. That can cripple you and you don't even realize it until you're in that situation where you go, oh, we'll just update real fast. And then your whole cluster is offline for two weeks. And then you talk to the data scientists there. They say, what am I going to do? Force vacation? Yeah.

Yeah, they get to hang out. Well, that's not bad. It's not the worst thing in the world. But yeah, I imagine they want to be making an impact. They want to contribute. And then, yeah, forced vacation. Yeah, and it's frustrating for some of them, right? Because, you know, again, I'm friends with some of these people, these customers, and they come and say, look, I have a month to finish my analysis, submit a paper. Only then I get to go present at this event, some big bioinformatics event or something like that, right? Yeah.

They have their end lines. What do they do now? This thing got offline because some security guy came and said he got a patch. Yeah. Update. Yeah, that is a great one. So then we were talking about scale and we were saying enterprise is one type of scale, but then these Neo clouds are a whole different level of scale. So what are some of these problems that you'll run into when you're at this type of scale? Yeah.

So we talked about two things. We talked about how do I onboard users? How do I handle them with some elasticity because they demand and require elasticity? In an environment where stuff under me is in flux, right? And how do you keep that going? Whether you're an enterprise or a neocloud, similar problems. The scale is different. And then the next problem people run into is,

You can't give the end user, a data scientist, a bare metal server and say, okay, bye-bye. What's that person going to do? So the problem then becomes, what is the end user experience that an MLOps or a data scientist expects? What do they need?

Now, if I go to a cloud, what do I get? SageMaker. Yeah. And you click a button, you get a Jupyter notebook. You click a button, somehow magically GPU show up, you train stuff. Run some pipelines, you're good. Yeah, it is very managed in that regard if you want it to be. Exactly. Now, how can I bring that experience in my data center if I'm an enterprise? Because I may be regulated. I may not be able to move all the data to my cloud.

You know, there's so many constraints that people have. And at some point, we should talk about data gravity because that is one...

massive intractable challenge that I see organizations struggle with. And it's kind of outside our scope, but it's fun to talk about. Well, it is funny that you mentioned that because we've been going around and I've been asking folks, hey, we're putting together this GPU providers guide. What are some things that you would want to see in it? And during that process, I've been interviewing folks in the community and saying, hey, you bought a bunch of GPUs. You rented, really, I think bought is the wrong word for it. What

when you rented these GPUs, do you wish you would have known? What do you wish you would have asked yourself about? And one of them that came up was,

You know, I didn't realize how much I would have to pay in egress fees to get my data to these GPUs. Exactly. So that is definitely a challenge. And I think, let me kind of double click on that a little bit because that's a fun topic. And I'm a technologist. I love these crazy problems. Now, the problem there is you have this concept of cold data, hot data. And what I mean by that is, you know, let's say you have

one million photographs that you've taken over 20 years. You don't touch them all at the same time. In fact, your viewer, where you're viewing stuff, you're probably looking at a few of them, right? And let's say it's indexed and all of that and stored in some... So what would the provider do? If I'm Apple or Google, I'm not going to put everything in hot storage. You don't need that.

So I'm going to do some tricks there. I'm going to move some to cold storage, cheaper storage, and only keep certain things in memory. That's how most of these systems work. Now, when you overlay AI ML on top, and let me take an example of, let's say I'm a clinical diagnosis company. I have a new, I'm a data scientist looking at, hey, this is new AI.

New kind of treatment, but I want to compare against patient data that is 20-year-old. Now, for me, cold data has no meaning because I had to compare against all the things that exist and get my answer quickly. If I'm going to wait for the data to come back from tape onto hot storage, everything is slow. What do I do now? Now, I'm talking petabytes of data that I need instant access to.

And if you overlay that with the example I took, like that Slurm system going offline, what if I have to move that system to the cloud, like a burst to the cloud or shift to the cloud? How am I going to move my petabytes of data there? Yeah, what do you do? That seems to be... Open questions? Yeah, it's a question that is hard to solve. And you have constraints like the egress charges, et cetera, that people have to deal with. Yeah, I guess at the end of the day, you can always figure out a way to make it work, but...

Does it work within your budget? Yeah, exactly. See, when does this happen? When you have things like I got to do patching or something else, and now this is where Kubernetes comes in handy because it's kind of more flexible than things like Slurm. It doesn't have to deal with 20-year-old tech, right? So people are spending up a second parallel system close enough to the data

The investment you'll have to make here is more hardware. And then you can deal with your upgrades and updates here on the system that needs to be touched. And then you roll over slowly, right? You don't have to do big bang.

So I've seen companies do that where they, instead of operating at 100% capacity, they shrink it down to 80%. At least people are able to do their jobs and slowly, gradually, they back away at the... So at any given time, you can expect 20% to be offline. So really...

You buy 100 million worth of GPUs and you only use 80 million. It's a practical approach. There's no other way out because you're updating something all the time. And then you want to minimize the time it's offline, right? Bring it down to hours if possible, right? And some of this you can do if...

If you do some intelligent things like store all the updates, et cetera, locally, you're not pulling down terabytes from the internet to update. So local cache systems, right? Like if it's a software update, why do you want to download Ubuntu updates from the internet for every system? So people do these intelligent things and we help with that to make sure that

First of all, can I reduce that from 20% to 10% and I keep that really, really quick so that I can sweep through my infrastructure as quickly as I can, like patching, etc. So we also were mentioning before I distracted you with the data gravity thing, how you don't want to just have bare metal. I guess for some NeoClouds, that is one of their

value props. They say, look, we're as close to the metal as you can get. And some folks want that. But for the most part,

You want to give people something, at least the minimal resources, so that they can do their jobs on there if they want. Yeah. It's actually a great point. So let me again use an analogy here because it'll, I think, help explain the user's needs, right? Let's say I'm hosting a party tonight and I'm going to give them pizzas. I have two options, right? I can buy a ready-made pizza

from the freezer at the local store. All I have to do is reheat it and give it. Or I can buy all the ingredients and make it myself. And different users may fall on the two ends of the spectrum. Right? Nothing right and wrong. It's just preference. Yeah, preference. Similarly here, what is the catalog? What's a menu of items that the enterprise or the cloud provider has to offer?

Some will say, hey, just give me bare metal. I'll do everything myself on top. This is akin to someone saying, just give me the tomatoes, flour. I'll make my own pizza base, etc.,

And they may have a genuine reason to do that. Some may say that, hey, I don't need all that. Just give me a notebook. If you only sell the raw material, I mean, you're not an interesting storefront, right? If you sell only the ready-made stuff, again, you're not an interesting storefront from a user perspective. So as an enterprise or a GPU cloud, you kind of want to have both. Yeah, and actually this brings up another point. I guess the metaphor may fall down here, but

A lot of these folks that are these Neo clouds are now offering tokens. They're saying, don't even mess with the Sage makers. We've got the bedrocks of this and just tell us what model you want and we'll give it to you in tokens. That's how we'll try. Exactly. It's actually fantastic you brought that up because the future with Gen AI probably might be a token cloud, right? Where if in a modern Gen AI environment,

workloads that run, they consume tokens. And they consume a lot of tokens. And people don't want to run a dedicated inference endpoint. It costs a lot of money to run Lama yourself. So then they'll end up saying, look, I just want to write my app. I want to write my Python code. Give me an API endpoint.

and a token, and let me consume tokens, and I'll pay you by the tokens I consume, maybe a million tokens a day. Now, why is it interesting for a GPU cloud? A lot of them have, like it or not, they have spare capacity lying around, right? How do they monetize it? Higher profit margins, right? Exactly. Instead of it idling, if I can be running a serverless inference...

and be somehow monetizing that GPU, but I'm selling something different to the user. The user here is expecting tokens. So I'm selling them tokens, not GPUs, but I'm not idling my GPUs. And can you also use the second or third generation, these older GPUs you now can get a bit more juice from? Exactly. So that's a beautiful point you made.

So like we talked about, there's a 12 to 18 month lifespan. And then people say, hey, I don't want to buy. Today, everyone wants an H100. In a few months, probably it'll be something else. Black one. Exactly, right? And the interest level on those

today's generation systems are going to go down. What do you do with that now? You don't want to idle it. You don't want to throw it away. It's a wasted investment. So now you can extend the life of these systems. Well, exactly what you said. If I can run a token cloud on that,

and make sure that I can deliver them tokens. Now I'm protecting my revenue stream. I'm running a higher margin service. And whether you're an enterprise or a GPU cloud, the same problem, except in one case, it's a CFO of the GPU cloud worrying about margins. In the enterprise, it's the CFO of that enterprise service

Talking to IT and saying, hey, what do you mean? You just spent $5 million on this H100. What do you mean it's not useful anymore? Same problems. But yeah, token clouds is where the market, I believe, will go. A lot of the Gen AI use cases...

Because a lot of us who've been in AI ML for a while, when we think about ML Ops, we think you're building your own model from scratch. You're dealing with the data, all those data pipelines. It made me think that it's a very different persona or you're almost attacking...

different folks who work on ml in different ways correct i've heard a lot of people say outsource everything you can when it comes to this if you can just get tokens then start there because it makes your life way easier correct if you don't have to deal with any of these gpus and then set up the services on top of those then great if you do that's one type of persona and maybe you're

setting up the platform on that. You're the platform engineer. That is one persona that's worrying about that reliability. But then if you're just building the app, you're hitting that API and you're creating the product that uses that service. That's a whole different persona. And there's various personas that could be that consumer. Yes. And it's not just the developer, Dimitrios. I think what we see is an interesting pattern there.

I think this token cloud will be an enabler for what I would call as a citizen scientist. We talk about a data scientist. We're seeing a pattern where maybe someone from the CFO team who doesn't know how to write code, what they're expecting is saying, hey, how do I unlock value from my data? I have all these spreadsheets. I don't know what to do with it. It can either sleep there in storage or maybe I can unlock some value out of it.

If only I can train my LLM, an LLM, whichever is suitable, that can somehow understand this data and become a domain expert. As an example, literally earlier today I was talking to one company that said, we have 20 years of investor relations content. Wow.

There's one guy in the company who's been there with them for like six, seven years. Who knows everything? If that person is on vacation. Or gets hit by a bus. Exactly. Big problems. No one else knows because these documents are massive, right? But they're so proprietary to the company.

So they're basically saying, man, how can I build a machine with this knowledge so we can ask questions to it? And I think the default that most people go to is they say, oh, well, AWS has like the private cloud. And so we just do it on AWS in a bedrock type of way. I'm not going to sit here and just talk a bunch of shit about AWS. But some people run into allegedly problems with AWS.

not getting good service, and paying a lot for it. Correct. They offer a good platform, but it's not for everybody, right? There are many organizations and geographies where this is not going to work. Like many of the partners and customers we work with,

they work with like a defense entity and country A, country B. They're not going to go to the public cloud, at least not in this geopolitical climate, right? They're very sensitive. Now, like if I'm a defense agency or if I'm doing something sensitive in that country, I prefer to use a sovereign cloud that I have control over, where I have guarantee my data will never leak. So some of them feel that way, right? Again, right or wrong, right?

AWS is not for everybody or any other public cloud. And yeah, I mean, like for example, at GTC, I heard something interesting. And since you're from Germany, I'll talk about this.

So this person came along and said, hey, we have some factories that have been shut down. And what the government wants to do is repurpose them into AI data centers because they have power. They have real estate. And apparently it's set up in a way where you can set up racks and convert them into a storage to burn. Exactly. I think people are thinking of very innovative ways to get started.

So I'm pretty excited about the situation because if access to AI infrastructure is available to everybody,

Think about all the stuff is going to unlock. I think we are at that inflection point. Is it one year out, two years out? I don't know, but we are at a very interesting point in the market. Yeah, and it is fascinating when we go back to that CFO who just in the basic iteration of now I have my chat GPT that I can talk to and I don't have to worry about the data leaking to whatever. And we've heard that for a while, right? That's the whole thing, the whole reason why open source folks are banging the drum.

But then if you take it a few steps further and you can say, well, now teams can build services that will be useful for our CFO. Products that are going to help make that persona's life easier. And I think a lot of

Different teams and product folks are doing that work right now inside of the organization. They're going around and saying, just show me your job. Show me what you're doing. Let's see if we can plug in some kind of AI service here. Correct. To make that person...

become more productive. Like the example I was talking about, the CFO's team gathering all the data, and they don't have a developer or any, for that matter, anyone who understands what his temperature is in a token or how to quantize a model. I mean, that all is like Greek and Latin for them, right? They are finance people. So how do I get this person to understand

become more productive without having a massive development team that has to do some requirements gathering. I've seen, it's funny you mentioned that because I saw recently a team that has an embedded team

like data scientist ML persona on the finance side. So you've got that finance team of 40 people and they have one embedded engineer in there who is very deep in the AI world and is trying to help them streamline their processes and make it more productive. Exactly. That's one approach. The approach where I see the market shifting, and this is something that we guys are very interested in because we're getting dragged into it from this token cloud thing,

which is what I would call as zero code fine tuning. Because in the end, it's just workflows. So give me my data, select a model. You can choose A, B, C, whatever you want. Say, okay, somehow click a button called fine tune. They don't need to know what it does behind the scenes.

Just optimize. Right. And then it fine-tunes it, and then it says, okay, here's your endpoint. Now here's a chatbot, which is pointing at that endpoint. Now test it. See if it's actually doing its job. And they'll just iterate very quickly, and these models are getting so intelligent that you can fine-tune them quickly. All the complexity that we guys deal with...

can be abstracted away. Yeah, I like that zero-code fine-tuning. I've heard it. My friends at Prem are doing autonomous fine-tuning is what they're calling it, but it's that same general idea. And recently we had Tanmay on here, and he was talking about the benefits of fine-tuning. And I personally have seen a lot of people dunking on fine-tuning. I probably have been guilty of this myself, where it's like fine-tuning is a lot of work, and it can potentially be for nothing. Yeah.

If you're not doing it right, you may get a worse result after you've fine-tuned the base model. And so you have to go into it with a certain understanding of what you're getting yourself into. I think we are at a point now where we can make this a zero-code experience to a certain extent. We can even go and say, hey, for the kind of data set you have, maybe these models are the right kind of models to pick.

The kind of budget you have, maybe you want to pick this because you only have access to XGPUs, right? Or these kind of tokens. Like, if you think about lawyers, if they have to do an analysis of 50 years of cases, oh man, that's a lot of work. What if something can just autonomously help you there and just give you an assist? Fine-tuning is a great tool for certain scenarios where the domain set is kind of reasonably static. It doesn't change all the time.

I don't know what kind of shelf life it has. There are people who say fine-tuning will take over the world. There are some people who say no, it won't. And I think the jury's out on that. We'll see. When you're doing fine-tuning, you're consuming AI infrastructure. To go back to that store analogy that I mentioned, right? You may be buying a ready-made pizza. You may be buying the ingredients or...

You may just buy tokens, right? Or you may say, you know what? I don't care what tokens are. I just want a workflow that is going to upload my data and fine tune. In the process, I may be using tokens and GPUs. I don't care as a user. So that's a spectrum of things I think the market needs.

And if I'm a cloud provider or an enterprise, these are the kind of services I need to offer my users. The spectrum. Otherwise, I'm an uninteresting storefront for the user. ♪

Building Out GPU Clouds // Mohan Atreya // #317 47:57 Share

MLOps.community

Deep Dive

Shownotes Transcript

Building Out GPU Clouds // Mohan Atreya // #317