We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

How GPU Access Helps AI Startups Be Agile

2024/10/23

AI + a16z

AI Deep Dive AI Insights AI Chapters Transcript

People

Anjney Midha

Derek Harris

Topics

Anjney Midha：a16z的Oxygen项目旨在解决AI初创公司在获取GPU资源方面面临的挑战，这些挑战包括GPU短缺、价格飙升以及云服务提供商对长期合同的偏好。Oxygen项目通过整合a16z投资组合公司的需求，与云计算合作伙伴协商获得更有利的GPU资源价格和使用条款，从而帮助初创公司降低成本，提高灵活性，并在与大型科技公司的竞争中获得优势。该项目还考虑了训练和推理工作负载的不同需求，并帮助公司根据实际需求调整资源分配。 Anjney Midha还分析了GPU短缺的成因，包括对AI计算能力的整体需求激增、数据中心建设周期长、供应链问题以及大型科技公司对GPU资源的争夺。他指出，在GPU供应紧张时期，短期GPU容量的价格远高于长期合同价格，这给初创公司带来了巨大的财务压力和规划难题。此外，Anjney Midha还讨论了推理成本下降对不同类型公司（基础模型实验室、应用开发者、微调客户）的影响，以及新型GPU（如英伟达的Blackwell系列）和ASIC芯片对未来GPU市场的影响。他认为，开源模型的兴起也将在一定程度上降低模型训练的成本。 Derek Harris：本期节目探讨了AI初创公司在获取GPU资源方面面临的挑战，以及a16z如何通过Oxygen项目帮助其投资组合公司解决这些挑战。节目中指出，云计算服务提供商对长期合同的偏好以及大型AI公司对GPU资源的争夺，使得初创公司难以获得足够的GPU资源，这使得它们在某种程度上回到了购买服务器的时代。

Deep Dive

Key Insights

Why is GPU access critical for AI startups?

GPU access is critical because startups face challenges in securing GPUs due to competition from large incumbents, long-term contracts, and high costs. Without GPU access, startups cannot train models efficiently, which is essential for their agility and competitiveness.

How does the Oxygen program help AI startups?

The Oxygen program provides AI startups with guaranteed GPU capacity at competitive prices, allowing them to train models on day one without the long-term financial commitments required by cloud providers. This gives startups an unfair advantage over larger competitors.

What are the main challenges startups face in accessing GPUs?

Startups face challenges such as high costs, long-term contracts, and being deprioritized by cloud providers in favor of larger customers. These issues force startups to overcommit financially and make suboptimal capacity planning decisions.

Why do startups struggle with GPU capacity planning?

Startups struggle because they must plan for both training and inference needs upfront, often without knowing future demand. This leads to overcommitment to specific chipsets or capacity types that may not align with future needs.

What is the difference between training and inference workloads in terms of GPU usage?

Training workloads require significant GPU resources for extended periods, while inference workloads are more sporadic and demand-driven. Inference is cheaper but harder to predict, making it challenging for startups to optimize GPU usage.

How does the falling cost of inference impact AI startups?

The falling cost of inference benefits application developers by reducing their compute expenses, allowing them to reinvest savings into product development. However, it can be challenging for startups focused solely on inference infrastructure, as margins may shrink.

What role does NVIDIA play in the GPU market?

NVIDIA dominates the GPU market due to its ability to handle both training and inference workloads efficiently. Its flexibility allows startups to repurpose GPUs between training and inference, optimizing utilization and cost efficiency.

Why is the H100 GPU still valuable despite newer models like the Blackwell?

The H100 remains valuable for inference workloads, even as newer models like the Blackwell excel in training. Startups with strong inference demand can continue using H100s while investing in Blackwells for future training needs.

What are the implications of compute thresholds in AI regulation?

Compute thresholds in AI regulation are arbitrary and lack empirical evidence linking compute spend to model risk. They can unfairly penalize startups that fine-tune existing models, as the aggregate compute cost may trigger unnecessary regulatory burdens.

How does open-source AI models impact GPU demand?

Open-source models reduce the need for startups to train their own models from scratch, lowering GPU demand for training. However, startups still require GPUs for fine-tuning and inference, making GPU access essential for their operations.

Shownotes Transcript

In this episode of AI + a16z, General Partner Anjney Midha explains the forces that lead to GPU shortages and price spikes, and how the firm mitigates these concerns for portfolio companies by supplying them with the GPUs they need through a program called Oxygen. The TL;DR version of the problem is that competition for GPU access favors large incumbents who can afford to outbid startups and commit to long contracts; when startups do buy or rent in bulk, they can be stuck with lots of GPUs and — absent training runs or ample customer demand for inference workloads — nothing to do with them.

Here is an excerpt of Anjney explaining how training versus inference workloads affect what level of resources a company needs at any given time:

"It comes down to whether the customer that's using them . . . has a use that can really optimize the efficiency of those chips. As an example, if you happen to be an image model company or a video model company and you put a long-term contract on H100s this year, and you trained and put out a really good model and a product that a lot of people want to use, even though you're not training on the best and latest cluster next year, that's OK. Because you can essentially swap out your training workloads for your inference workloads on those H100s.

"The H100s are actually incredibly powerful chips that you can run really good inference workloads on. So as long as you have customers who want to run inference of your model on your infrastructure, then you can just redirect that capacity to them and then buy new [Nvidia] Blackwells for your training runs.

"Who it becomes really tricky for is people who bought a bunch, don't have demand from their customers for inference, and therefore are stuck doing training runs on that last-generation hardware. That's a tough place to be."

Learn more:

Navigating the High Cost of GPU Compute)

Chasing Silicon: The Race for GPUs)

Remaking the UI for AI)

Follow on X:

Anjney Midha)

Derrick Harris)

Check out everything a16z is doing with artificial intelligence here), including articles, projects, and more podcasts.