We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

How GPU Access Helps AI Startups Be Agile

2024/10/23

AI + a16z

AI Deep Dive AI Insights AI Chapters Transcript

People

Anjney Midha

Derek Harris

Topics

Anjney Midha：a16z的Oxygen项目旨在解决AI初创公司在获取GPU资源方面面临的挑战，这些挑战包括GPU短缺、价格飙升以及云服务提供商对长期合同的偏好。Oxygen项目通过整合a16z投资组合公司的需求，与云计算合作伙伴协商获得更有利的GPU资源价格和使用条款，从而帮助初创公司降低成本，提高灵活性，并在与大型科技公司的竞争中获得优势。该项目还考虑了训练和推理工作负载的不同需求，并帮助公司根据实际需求调整资源分配。 Anjney Midha还分析了GPU短缺的成因，包括对AI计算能力的整体需求激增、数据中心建设周期长、供应链问题以及大型科技公司对GPU资源的争夺。他指出，在GPU供应紧张时期，短期GPU容量的价格远高于长期合同价格，这给初创公司带来了巨大的财务压力和规划难题。此外，Anjney Midha还讨论了推理成本下降对不同类型公司（基础模型实验室、应用开发者、微调客户）的影响，以及新型GPU（如英伟达的Blackwell系列）和ASIC芯片对未来GPU市场的影响。他认为，开源模型的兴起也将在一定程度上降低模型训练的成本。 Derek Harris：本期节目探讨了AI初创公司在获取GPU资源方面面临的挑战，以及a16z如何通过Oxygen项目帮助其投资组合公司解决这些挑战。节目中指出，云计算服务提供商对长期合同的偏好以及大型AI公司对GPU资源的争夺，使得初创公司难以获得足够的GPU资源，这使得它们在某种程度上回到了购买服务器的时代。

Deep Dive

Key Insights

Why is GPU access critical for AI startups?

GPU access is critical because startups face challenges in securing GPUs due to competition from large incumbents, long-term contracts, and high costs. Without GPU access, startups cannot train models efficiently, which is essential for their agility and competitiveness.

How does the Oxygen program help AI startups?

The Oxygen program provides AI startups with guaranteed GPU capacity at competitive prices, allowing them to train models on day one without the long-term financial commitments required by cloud providers. This gives startups an unfair advantage over larger competitors.

What are the main challenges startups face in accessing GPUs?

Startups face challenges such as high costs, long-term contracts, and being deprioritized by cloud providers in favor of larger customers. These issues force startups to overcommit financially and make suboptimal capacity planning decisions.

Why do startups struggle with GPU capacity planning?

Startups struggle because they must plan for both training and inference needs upfront, often without knowing future demand. This leads to overcommitment to specific chipsets or capacity types that may not align with future needs.

What is the difference between training and inference workloads in terms of GPU usage?

Training workloads require significant GPU resources for extended periods, while inference workloads are more sporadic and demand-driven. Inference is cheaper but harder to predict, making it challenging for startups to optimize GPU usage.

How does the falling cost of inference impact AI startups?

The falling cost of inference benefits application developers by reducing their compute expenses, allowing them to reinvest savings into product development. However, it can be challenging for startups focused solely on inference infrastructure, as margins may shrink.

What role does NVIDIA play in the GPU market?

NVIDIA dominates the GPU market due to its ability to handle both training and inference workloads efficiently. Its flexibility allows startups to repurpose GPUs between training and inference, optimizing utilization and cost efficiency.

Why is the H100 GPU still valuable despite newer models like the Blackwell?

The H100 remains valuable for inference workloads, even as newer models like the Blackwell excel in training. Startups with strong inference demand can continue using H100s while investing in Blackwells for future training needs.

What are the implications of compute thresholds in AI regulation?

Compute thresholds in AI regulation are arbitrary and lack empirical evidence linking compute spend to model risk. They can unfairly penalize startups that fine-tune existing models, as the aggregate compute cost may trigger unnecessary regulatory burdens.

How does open-source AI models impact GPU demand?

Open-source models reduce the need for startups to train their own models from scratch, lowering GPU demand for training. However, startups still require GPUs for fine-tuning and inference, making GPU access essential for their operations.

Shownotes Transcript

Translations:

中文

It was one of our founders who came up with the name Oxygen because they basically said, look, if I don't have that kind of compute on day one, I can't breathe. So on day one, we were able to then say to founders, look, you have guaranteed capacity at prices you just can't get anywhere else. While saying to our cloud compute partner, look, you get direct access to the world's best foundation model startups and AI startups.

They realize the value in that. For the founders, it's very clear what they get. They're able to raise less and take on less long-term risk while still being able to train really great models on day one. Our goal is always to try to give startups unfair advantages compared to big tech companies. Just by resetting compute to rational, sort of normal market rates, we were able to give these teams an unfair advantage. Hi, you're listening to the A16ZAI podcast.

I'm Derek Harris, and joining me once again this week is A16Z General Partner Anjaneh Midha, this time to discuss the economics of GPUs for AI workloads and a program A16Z is running to help companies in our portfolio access them at a reasonable price.

I promise it's not an advertisement for that program, which is called Oxygen, but more a discussion about how it came to be so difficult for startups and other small customers to acquire adequate infrastructure from cloud providers. One very interesting insight for folks who aren't entrenched in the world of AI and cloud economics is how we have, as Angenet puts it, "gone back to the future in terms of the capital expenditure required to launch a startup."

Whereas infrastructure as a service was supposed to let startups avoid the overhead of buying new servers and the high risk of over-provisioning, cloud provider requirements for long-term contracts, paired with bidding wars from deep-pocketed AI incumbents, have essentially put startups trying to train foundation models back in that same pre-cloud situation. If they must commit to three years and many millions of dollars, they need incredible customer demand in short order or they're left sitting on, and paying for, very expensive compute capacity that they don't need.

If you've heard of what other investors supplying affordable GPUs as a value add to startups, this is why. But you've probably never heard the rationale behind these efforts explained in such detail.

As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com slash disclosures.

So Oxygen is our compute program at A16Z where we help startup founders and our companies navigate their compute challenges, whether it's helping them find the capacity they need for training, whether it's for inference. We have...

You know, a number of options now for startup founders, particularly those working on large scale AI infrastructure efforts who might have very capital intensive GPU hungry business plans to be able to access the kind of compute they need in a timely way with our help.

As the scaling laws in AI were becoming more and more mature, it just started to be hard to ignore just how much of my time I was spending helping founders navigate their compute needs. It started with Anthropic, I would say, in early 2021, when I got a call from Dario and Tom, two of the co-founders of Anthropic, and had been leading the GPT-3 efforts at OpenAI. And around the time they had decided to leave and start Anthropic, they gave me a call and said, hey, we'd love...

to get you involved as an early investor. And I said, sure, you know, what are you thinking of raising for your seed round? And they said, we need 500 million to get started.

And that was a bit of a shock. And soon after that, I started realizing that their needs weren't isolated. So I think this started from a working backwards kind of realization that a number of the customers we serve every day, which are founders, especially working at the frontier of AI infrastructure, all had a common problem, which was as a startup, they were being deprioritized by the large GPU clouds, the hyperscalers in favor of larger customers.

which is really tough. We were in the middle of a supply crunch at the time where H100 capacity was in short supply. And as a result, what was happening was the hyperscalers who run cloud businesses that have very sensitive margins tied to their occupancy rate or their utilization rate for their clusters were basically starting to prioritize long-term contracts over short-term contracts, which is totally the rational thing to do. But if you're a startup,

And now for you to access the same price for hourly GPUs that you could have gotten just six months ago for a six month rental contract.

if you had to now buy a three-year contract, you're often being asked to commit by the hyperscalers more capital than you'd raised or even plan to raise in the next year upfront to get access to those rates. And so illustratively, what was happening at the time was the market rate for short-term GPU capacity, three to four X over that period, I would say of late 2020 to mid 2023. That was a real, I would say a realization moment where

It wasn't like there were some mass conspiracy theory or anything against startups, but just the natural market forces made it such that if you were a startup in foundation model land and you wanted to be able to get access to any significant number of GPUs on day one, it was extremely hard for you to do that at, call it, sane and rational prices without committing to a two, three, four year sometimes commitment on those GPUs. Now, that's a really hard thing for you to do as a startup for three reasons.

One is early on, you haven't raised that much capital. And so it's very daunting to commit more capital than you've even raised. The second thing it does is it makes it very difficult to do capacity planning

when you don't even know what your inference needs are going to be like, right? It forces you to have to make a bunch of suboptimal decisions about your capacity. In the normal scheme of things, you know, how this would work is you get started as a startup, you go buy some short-term capacity for, let's say, six months.

You then need to train your foundation model over six months. You then release the model, you start getting customers, and you have a pretty good sense at that point of your demand from your customers for inference. You know which days of the week it spikes. You understand which regions you're getting the most inference demand from. You understand what the queue times are like when you release new features. And then you use that to inform your purchasing for inference. Whereas if you're having to do all of that capacity planning up front, you're basically guessing in the dark.

And that makes you often overcommit to a chipset or a capacity type that might not actually be at all something you need later on. The third thing it did was really put a lot of pressure on these companies to try and raise at higher valuations than they should have because...

If you need to raise more to pay for these GPUs, then the only way to prevent yourself from getting diluted down is to then raise the valuation. We were in a bit of a lose-lose-lose situation, I would say, for the better part of three years, where founders were having to pay exorbitant prices to hyperscalers. They were having to do long-term planning when they should have just been, as a startup, focused on being agile and nimble about the short term.

And all of this, I would say, came to a fever pitch around last summer when a few hyperscalers decided to really start prioritizing the largest customers much more than startups. In one case, I had a portfolio company who had a signed contract for delivery for a number of GPUs from a cloud. And last minute, we're told, actually, hey, we're not going to be able to deliver that to you for three months.

And then when we asked why, it turns out just a bigger customer had come in and offered 3x more than they could. You know, it came from wanting to solve that customer pain point for the founders. When infrastructure as a service became a thing, the story was you used to have to over-provision all these servers. And you had to buy all this stuff up front. And it cost you a lot of money. And now you can rent it on demand and right-size it and the whole story. And so here we are again. It's back to square one, essentially. It was a bit of a back to the future moment, right?

And what's great about the way we process problems like that is we're quite used to building products like that for founders. When we noticed that marketing or recruiting were these recurring needs, we were able to then aggregate demand across a number of our companies. We have, you know, at this point, 550 portfolio companies and pass on the economies of scale to each individual company much earlier in their life. I would say the primary, I would say,

goal was to try to allow a company, a startup founder and their team to access the kind of prices and short-term duration and flexibility on compute that only much later stage companies, often big tech, could access without the help of somebody like us, who could then step in and actually say, you know what, you don't need to actually, as an individual startup, buy more than you need.

Since we've got 550 portfolio companies on our side, we can actually aggregate that demand in a much more efficient way than you. And we can actually take on some of that economy of scale negotiation on your behalf. And I think, yes, it actually, it sounds like it brings AI startups back to the place where

like any startup, like traditionally where any startup would be starting from to begin with. Right. So what we were able to do then is to construct a win-win situation, both for, I think, actually for all three parties involved, which is the compute partner, often a hyperscaler or a large data center provider or cloud partner who we can work with to source that capacity. We were then able to say to the founders, hey, you can raise less and here's guaranteed capacity for you on day one.

It was one of our founders who came up with the name Oxygen because they basically said, look, if I don't have that kind of compute on day one, I can't breathe. We literally don't have anything to do for our researchers yet, right? So on day one, we were able to then say to founders, look, you have guaranteed capacity at prices you just can't get anywhere else. While saying to our cloud compute partner, look, you get direct access to the world's best foundation model startups and AI startups online.

And the beauty of that is for the most sophisticated cloud partners that we work with, they realize the value in that, which is if you can build a relationship with the best foundation model companies early on as their training provider, you

you have a really good shot at becoming their compute supplier for inference as well. And for the best companies, it's so clear that over the long run, the bulk of their needs come from inference, right? Not training. It's a sort of comfort, the training stay for the inference value proposition for the compute partner.

For the founders, it's very clear what they get. They're able to raise less and take on less long-term risk while still being able to train really great models on day one. Our goal is always to try to give startups unfair advantages compared to big tech companies. And I think that's what our goal was here as well, that just by resetting compute to rational, sort of normal market rates, we were able to give these teams an unfair advantage.

Even broader though, the thing with cloud providers is, well, they can buy at economies of scale. They get all this compute at a steeply discounted rate. But it seems like they were having a capacity issue at some point too. Is that like a COVID thing? Or is it just like NVIDIA can only produce so many of their top-end GPUs at any given time? What's the bigger thing? Why doesn't Amazon Web Services just have an unfathomable amount of capable GPUs on hand?

So there are three problems going on there that were a bit of a perfect storm. One is, look, there was just net new demand that nobody had done capacity planning for. And data centers just take time to build, right? So the average build for a data center at the time was, on the shorter side, about six months if you had all the components ready to go. And if you didn't and you needed new permitting and so on and needed to source components, it took up to a year to actually build a new data center.

And so as a result, most of the forecasts for data center demand were outdated by about a year to a year and a half.

And so what happened was there was this moment, I think GPT-3 came out in July, August 2020. When ChatGPT came out in December 2022, suddenly all capacity forecasts were wrong by a factor of five to 10 because OpenAI basically introduced this research preview to the world with ChatGPT, which is free and had no expectation that it would become this incredibly popular consumer app.

That then demonstrated a really key piece that was missing, which is that there was consumer demand

for foundation models. I would say most data center providers had started planning for increased demand from training runs, but there wasn't this sort of explosive killer app that was driving inference demand from the consumer market, right? And that's when stuff really exploded. And so I would say between starting in Jan 2022, which was OpenAI trying to buy inference capacity wherever they could get it, and then every other foundation model lab going

wow, we need to catch up. We need to build our own comparable frontier language model. And that was meta jumping in into the game. Gemini had invested a bunch in TPUs, but it was still relying on a bunch of internal experiments on H100s. And so I'd say that 2022 to late 2023, there was an 18-month delay between all the demand forecasts that had been done prior to ChatGPT and the supply catching up to that. So that was number one.

The number two thing was a huge supply chain shortage around networking. So for these training runs, you need to network thousands of GPUs. And the H100 chip relies on a particular kind of interconnect. And there was a huge shortage at the time. And that delayed things by about six months. And so demand is spiking.

supply is fixed. It wasn't clear when some of the supply chain shortages would get resolved. And so the willingness to pay and the pricing was going bonkers because anytime you have increasing demand and no change in supply, you have surge pricing. And that's what was going on. And the third thing I would say, there was a significant amount of over-provisioning happening on the part of some of the incumbent labs.

who rightly so now in hindsight did the right thing by basically paying absurd rates to buy out existing contracts that clouds had already committed to. So you had clouds who had already sold capacity for those 18 months, essentially double selling their same capacity for three to four times higher to incumbents. And then basically saying to their smaller customers, like, sorry, we don't have it for you anymore. And that was a bit of a perfect storm. But let me ask you like NVIDIA. I mean, did you get a sense of, are they ramping up capacity?

In the grand scheme of things, I think they've been able to adapt to increasing demand pretty fast. And so the question for them was always, how much H100 capacity do we want to ramp up, given we have the next generation of chips coming shortly, right? So Jensen announced last year, or as early this year, sorry, the...

the Blackwell line, the GB200s, the B200s. Again, we're yet to see live production benchmarks of these, but based on early tests, the B200, the Blackwell line is two and a half times more horsepower than the H100 line. So if you're NVIDIA and you're going, you know, you have a bunch of customers coming to you and saying, look, I want to triple my order on H100s.

you have an interesting dilemma, right? Do you redirect your production from your planned next generation line to serve this new spiking existing line? Or do you stick with plan and say, sorry, we can't deliver on these new H100 orders because we're going to stay on track with our Blackwells. And I think they've done a pretty good job at balancing those. But as a result, I think what we're seeing is a number of customers who did have to do long-term commits

to the H100s, feeling really nervous now about when the black walls hit next year and saying, okay, we've committed all this money up front for a previous generation of chips that are now no longer the best in class. Meanwhile, our competitors who didn't do these long-term commits, they're

and may have paid higher prices for short-term deals, now get access to the Blackwell chips, and they're able to train things two and a half times faster. In the grand scheme of things, I think NVIDIA has the ability to scale up production pretty impressively. The question is always, which part of their product line do they want to scale up at the expense of the other? So when the new line comes out, do you have a sense of like, what is the end of life or the secondary life of these H100s and before that, the A100s? And just like...

down the line? It comes down to whether the customer that's using them, whether the person who's actually put the GPUs on their, you know, these GPU contracts has a use for them or not that can really optimize the efficiency of those chips. And so as an example, if you happen to have been

an image model company or a video model company and you put a long-term contract on H100s this year and you trained and put out a really good model and a product that a lot of people want to use, even though you're not training on the best and

and latest H100 cluster next year, that's okay because you can essentially swap out your training workloads for your inference workloads on those H100s. Because the H100s are actually incredibly powerful chips that you can run really good inference workloads on. So as long as you have customers who want to run inference of your model on your infrastructure, then you can just redirect that capacity to them and then buy new black wells for your training runs. Who it becomes really tricky for is people who bought a bunch

don't have demand from their customers for inference and therefore are stuck doing training runs on last generation hardware. And that's a tough place to be. Yeah, I did want to ask about inference because I read recently, including I think Andrew had a long tweet about it, like the per token cost of inference has dropped precipitously, at least if you look at open AIs pricing over the past year or whatever it is. How does that play into the economics of

a startup or any company really looking at AI from, again, from the CapEx point of view? Yeah. The cost structure that you're describing has different impacts on different kinds of companies, and you can kind of broadly bucket them into three different types. If you're a foundation model lab training models at or near the frontier, then the falling cost of tokens, in a sense, the inference,

benefits you in marginal ways, largely in this, I would say in the synthetic data generation step, right? Because synthetic data generation and post-training is becoming an increasingly large part of the workloads of training foundation models.

And so, you know, what we would call inference times or test time compute scaling is directly a function of the cost of inference, right? And so, cheaper inference certainly helps labs that are doing foundation model training that are doing test time scaling, inference time scaling, or generating large amounts of synthetic data to train their models on. But I would say the magnitude of that impact is not primary. It's largely marginal. I would put it in somewhere in the 20 to 25% range at the moment.

Who really benefits, right, are application developers. Because when the cost of inference drops dramatically in the APIs that you're building on top of,

then your cost structure changes. You can afford to either pass on those cost savings to your customers, or you can afford to reinvest the proceeds of what you would have had to spend on compute in other parts of your product value proposition. You can invest in more features and hiring more engineers to build better product experiences on top. And so I would say application developers are probably benefiting the most.

There's a third category who are people who are customers of what I would call fine-tuning customers. And these are folks who aren't building necessarily applications for the end consumer market. They might be enterprises who need to fine-tune a large model, whether that's a language model, an image model, a video model, and so on. And for them as well, the falling cost of inference basically allows them to fine-tune much more efficiently.

because often what you're doing in the fine tuning step or the post-training step is actually generating using inference sufficiently high quality synthetic data for your tasks. Broadly speaking, everybody's benefiting from the falling cost of inference at this stage in the cycle. I think where it becomes tough is if you are an inference provider, if you sell inference,

and you are in a race to the bottom on tokenomics, and you don't have any sort of real stickiness above base inference, then it can be a pretty tough business to operate. And I do think there are a number of startups who are trying to figure that out right now who don't train their own models, and neither are they building end consumer applications, but they're somewhere in that middle layer where they provide inference infrastructure. And for them, falling market rates of inference typically correspond to drops in margin.

Are there meaningful advances happening on the hardware side right now? Whether it's the new generation of NVIDIAs or whether it's anything any other provider might be doing, are they going to alleviate some of this pent-up demand for NVIDIA or this kind of monopoly demand for NVIDIA? The short answer is yes. I think there's tons of really great research going into making chips faster, cheaper, and more customizable for different types of inference workloads. I would say the most exciting thing is still just the sheer onslaught of Moore's Law, which

So when the blackwells allow a lab to access two and a half times more flops per chip, that's pretty exciting. Because that directly increases the speed at which your models can get trained by basically two and a half times at minimum. What I'm excited to see is when we figure out all the cooling and energy issues around

stacking, you know, 20, 30,000 black wells in a single data center. And then you can run a training run on those. What do we get? What are the capabilities that come out on the other side? We, there frankly, aren't that many training runs of that scale yet that have succeeded.

On the inference side, what's most interesting is speed. When you have a chip that is custom designed for a particular architecture, for a particular model, we're seeing results from some companies where they can do a 200x speedup on inference. And when you unlock that, there's entirely new applications you can build. The challenge, of course, with ASICs, right, and these aren't GPUs because they're special purpose architectures, is that if the model architecture changes,

then you can't use that chip. You got to throw away that chip. And so I think what a lot of people are waiting to see right now is, are we going to have a stabilization of model architectures for different types of workloads?

Is an MOE model, a mixture of experts model here to stay for long enough that you don't need to keep taping out new chips and the chip doesn't get obsolete really fast? Or is image diffusion, is a flux 1.1 here to stay as a best in class architecture for image models such that you can literally build a flux 1.1 chip?

And that's good and has enough longevity over the next two and a half, three years such that you can amortize the fabrication and the CapEx investment it took to tape out that chip over a long enough useful lifecycle.

I'm fairly hopeful that base model generation is becoming pretty commoditized or rather is stabilizing and converging on certain architectures. They're pretty inefficient architectures, I would still say, especially in language model world, like transformers are pretty inefficient at learning things. And so I really do hope that we get a breakthrough in architectures like SSMs or something similar to what Mistral has done with their CodeStrawl family. But barring sort of, I would say three or four quarters of

of showing that model architectures have converged, I don't think those ASICs are going to make their way into production at scale anytime soon. Our last episode, we had Jeff and Bowen from Noose Research. And the primary topic of conversation was this distro research they had done and this idea of training

like language models over the internet. Another part of it was like, what happens if this becomes feasible at some scale? And do we get to a point where people can start training on lower end GPUs, right? Because you don't need all the fast interconnects and whatever that are standard on the H100. Do you think that's like a realistic set of circumstances or a realistic thing that could happen at some point? Or is it like, listen, if you're training a state-of-the-art model, you're running state-of-the-art. I think...

It's going to be a crawl walk run in the following sense, like a full SETI at home where you're training a Lama 3, 4, 5B across like call it a hundred thousand or a million gaming GPUs at home is not impossible, but it's not on a default path to happen on its own anytime soon. What I do think is much more likely is that instead of having to run frontier model training runs in a single co-located cluster, what is happening is multi-cluster training.

So instead of needing one 10,000 GB 200 cluster to train Lama 4 or GPT-5, you can now split that up across four different 2,000 or 2,500 GB 200 clusters. And those are much easier to build. They're much easier to hook up to the power grid. They're much easier to handle cooling for. And then the question becomes,

architecturally, can you run distributed training runs in such a way that there's no material degradation in the performance of the model and there's no material slowdown in the speed at which you can do these training runs?

And it's a really complex systems problem because these chips are not very reliable. Like the burnout rate on NVIDIA cards can be as high as 30% in a data center. And because these training runs are massively paralyzed, they're not very fault tolerant. And so when you have like a single chip go down, it nixes the entire training run. If we can actually figure out how to do fault tolerant distributed training runs across, call it four, five, six clusters of meaningful chip sizes, but these are not hyperscale.

I think that would be a massive unlock because you can basically count on your hands the number of regions in the world today that have GB200 clusters of the kind that we see in hypercenters, right? The 100KH100 equivalents.

On the other hand, there's a very fat tail in that distribution of regions and data centers that have somewhere between 2,000 to 10,000 H100 equivalents. And if you can find a way to network those for training runs, I think we end up having way more capacity to train models in a way that's not just centralized amongst two or three companies. And that presumably would help with some of the energy concerns that are starting to pile up too, right? In terms of like,

The more you can spread this, right? The more you don't need like a city's worth of power. - That's correct. I will say the co-location with energy will probably not go away for a while because even if you don't have all of the chips co-located in one data center, you don't want the different clusters to be too far from each other because power transmission is extremely lossy, right? You wanna have almost like a hyper-centered region

where you have a pretty good, high energy density power source, and then you have a number of clusters located around it. Because it's not that easy to...

transmit power over several miles and it's also not that easy to transmit data in a fault tolerant way for these training runs across large amounts of transmission and optic fiber. Location I still think matters but it doesn't have to live the chips literally don't have to be on a single spine. So can you walk through just kind of the cost differences and the time differences when we're talking about training versus inference? The rough back of the envelope math is

There's about 730 hours in a month and you get charged by a cloud provider if you're running one of these models by the hour for those 730 hours per chip or somewhere between 12 months to three years, right? So...

Let's say the average price of a chip was $5 per hour for an H100 in August 2023, which is really not that far off from what it actually was, what the list prices were for short-term contracts. Now you're basically paying $5 times 730 hours per month per chip.

for how many ever chips you need to run a training run. And if you need the equivalent of 2000 H100s to train an image model for three months, then suddenly you have signed up to spend about $22 million. That's about how much it would have cost you at the peak of the supply crunch.

Inference, on the other hand, you're running a prediction on that model when a customer wants to create an image, let's say an image from your text-to-image model. Let's say it takes about a second to create that image. Training is quite predictable. So while it's expensive, it's a little bit more predictable. What is much less predictable is inference, which is cheaper because the workload is so much smaller.

But it's much more unpredictable because it's a variable cost. You have to serve it as and when your customers want to do it. The hard part about inference is capacity planning around customer demand.

And if you haven't even launched your model yet, you don't even have product market fit, you're basically kind of guessing in the dark for your inference demand. And often what is happening is people will buy inference and then have it sit idle and that's dollars being wasted. For that reason, it's much, much more cost effective to have a single chip that can both do training and inference that you can then move back and forth between the two workloads. One of the biggest issues with non-NVIDIA chips is that they're just not very good at both training and inference.

Actually, in fact, the stated strategy of some of the chip providers like Amazon is to build different chipsets. They have one chipset called Tranium, they have another one called Inferentia. In this current generation, they look mostly the same except for their networking and their interconnect. But TPUv5Ps look quite a bit different from TPUv5Es. One is designed for training, the other one's for inference. NVIDIA's biggest value proposition is that because they're GPUs and they can handle both kinds of workloads,

Let's say you bought 2,000 H100s for inference because you assumed that much demand, and it turns out you don't have that much customer demand. Now you can move those chips over to your training cluster and vice versa. And that flexibility is quite powerful at getting your overall utilization rates high and your cost structure down.

This whole conversation does like make it pretty clear that when someone is concerned about, say the cost of a model, it does drag me that like, well, like it's so arbitrary. You might be paying hundreds of millions of dollars conceivably at some point as if model sizes keep going up, that that's the cost technically to,

to train a model. But to your point, that's the build. That's what it costs to get the infrastructure. It's so disconnected to some degree from the model itself. It's just like, what's the price of GPUs? What's the contract you're forced to sign up for?

We've seen a lot of policy proposals over the last, I would say, year as we've been dealing with more and more AI regulation. Some make sense, others don't, but compute thresholds particularly make the least sense and have gotten the most far when they shouldn't because they don't have any direct linkage to any kind of capabilities. The most common stated goal for AI regulation is to manage risk. But saying, "Oh, well, let's regulate." A model will be more harmful if somebody spent 100 million on compute.

versus somebody who spent 50 million, it just has no empirical evidence, right? Because in reality, what matters more is what are the capabilities that the model was trained to enact and how was it used? You can easily spend $100 million at the peak of a supply crunch and train a tiny model because there was a 5x surge price in the market. There's also, by the way, tons of ways to just waste training cycles on experimentation that don't result in actually any emergent capabilities. There's

Depending on what your definition is, if you're fine-tuning prior models and suddenly the prior model developer spent $95 million in a really inefficient way and you're spending $7 million to fine-tune it,

as a startup, now you're subject to that 100 million threshold because the aggregate cost triggers that 100 million threshold. Now you're stuck with having to comply with a bunch of regulations that don't really make sense or don't really do anything for safety at all. It really is akin to saying a badly made dish can cause food poisoning.

And just because a chef spent 10 times more money buying fresh tomatoes to make pasta that could go bad and poison someone, we're going to now regulate anybody who spends more than $10 on tomatoes. Like, what does tomatoes have to do with how badly or not the dish was eventually made and whether it was, you know, left out in the open for three days or not and it was used, you know, somebody added poison to it. It's just so absurd to try to link those two. One is an ingredient.

and the thing you care about is the outcome. I think it was mostly a function of press and attention. And regulations that throw thresholds like $100 million in have more, I think, mimetic potential than more precise regulation on outcomes in context. So this oxygen program is one thing. But then the other thing that I know we're investing a lot in are open source models and open source foundation models.

The more that these become available and capable and you can fine tune them and you can kind of work on them, like the less money more companies need to spend on training.

- Oh, certainly. No, I think the availability of high quality open source models that have permissive licenses is massive deflation for downstream developers, right? Because then you can piggyback off of all the millions of dollars that Meta has spent investing in the Lama family or Mistral has spent on the Mistral family or Blackforce has spent on their Flux shell model. Those are all compute cycles or flops that are being given away for free to the developer community. And then they get to piggyback off of that for sure.

The fact that Linux is open source now means that the cost to recreate an application is dramatically lower than if you didn't have a proprietary, you didn't have an open source ecosystem, then you'd need to go either purchase a license

for that part of the software stack or have to build it yourself. As a result, the more open source there is, the more the flops are amortized across the entire community than having to be re-spent or repurchased by every individual developer who's building on top. Yeah, it seems like open source helps mitigate some of that in the sense that with AI, the hardware is part and parcel with actually doing the thing. The more you have, the better it is, at least on training.

So if you can avoid not having to train your own model, you just kind of... So there's some caveats to that in the following sense. More is not always better because you could just keep burning compute, honestly, on a model that never gets better if you don't scale up the number of high quality data tokens alongside it. I think at this point, we've had maybe at least four different

waves of scaling laws showing what the right combinations are in which you have to combine compute with data and what algorithmic sort of efficiency, right? There was the Kaplan wave of scaling laws with GPT-3 in 2020. Then there was the Chinchilla compute optimal scaling laws that said, hey, you actually have to scale for a given compute budget. The optimal way to scale it is different

from the way the Kaplan laws demonstrated and that actually the Kaplan laws were quite a bit inefficient. They were overspending on compute by almost 60%. They weren't scaling the amount of data in the training run appropriately.

Then we had the LAMA 3, 4, 5B scaling laws, which showed that actually most models are dramatically under-trained and you should just actually give them way more data than people expect. So don't throw more flops at it if you're not actually giving it way more tokens than a one is to one increase in the ratio. It's not a linear increase in tokens. By over-training a smaller model,

on a higher corpus of tokens, you can get a better performing model than a bigger model that's trained on less data. The fourth wave of scaling laws is probably the most meaningful in my mind on the inference side, which is the scaling laws showed around 01, which is test time compute scaling, which says, or let's say if a task you have requires the model to just think more, then you can actually throw more compute at it at the inference step.

And it'll get better at doing that because there's good reason to believe that if you let the model think for longer and think requires using more flops. And I'm using think in a very, very loose analogy. Let's say you give the model a bigger budget to process the problem before it gives you a final output. Then it's able to improve its precision on a number of benchmarks.

And that's interesting because up until now, most companies and most customers have not been able to say, I'm not talking about now foundation model developers, I'm not talking about people who are training models, I'm talking about people who are consuming the model. They haven't had a way to say, let's throw more money at the problem to get a more accurate answer. That hasn't really been an option so far. But what test time scaling laws show is that you can actually now throw more compute at the inference step to improve the accuracy of the model.

And that will probably increase demand for inference chips. Do you see a horizon where this demand, where the demand lets up in a program like Oxygen is less necessary for startups or like for the foreseeable future as we're still training foundation models and obviously running AI inference? Is this still going to be the status quo for, you know, the foreseeable future? The short answer is yes. I see Oxygen being a pretty core value proposition to our companies as long as AI is growing and an important part of our world. I think that...

What has changed is it's become more and more clear, I think, to the world's clouds and data center providers that AI is here to stay. They're doing really good and much better forecasting than they were when this whole wave got started. What hasn't changed is that on day one as a startup, as a little guy, you still need help to be treated like a big guy.

And that's what A16 will always help you do, right? Because we have a few things that are quite unique. We have a portfolio of 550 companies. And so we get to aggregate demand in a way that a smaller startup just can't.

And that allows us to work with large compute partners and negotiate both great prices, great terms, to be much more flexible on timing, on duration, in a way that is very difficult for a smaller startup to do on their own. The reality of market forces is that as long as there are bigger customers who get better treatment because they're buying in bulk, there will always be a way for us to help and give our companies an unfair advantage of being treated that way, even if they're a tiny seed stage startup on day one.

And that's really what the Oxygen program was designed all along to be, was to help the little folks get the same treatment as big tech on compute pricing, on compute terms, on duration. We have different types of chips we offer them, so it's pretty flexible in terms of what we can get them on hardware.

And eventually our goal is that they grow up and they graduate to becoming one of the big buyers themselves. And then at that point, we then reallocate that capacity to the next generation of founders and startups that come into our program. ♪

While the details may change and while in one year we might need to help them more with inference scaling than training, the rough shape of the program, which is to use the scale of our portfolio to provide individual companies with the kind of access and benefits they only could if they were a much larger later stage company, will probably be an evergreen offering for as long as compute is an essential part of building AI infrastructure businesses.

That's it for this episode. Thanks for listening this far. As always, we hope the discussion was informative and, in this case, helped flesh out your understanding of why GPU access is such a big topic in the world of artificial intelligence. If you enjoyed it, please do share it far and wide and rate the podcast on your preferred listening platform.

How GPU Access Helps AI Startups Be Agile 39:08 Share