We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Building Out GPU Clouds // Mohan Atreya // #317

2025/5/23

MLOps.community

AI Deep Dive AI Chapters Transcript

People

Mohan Atreya

Topics

Mohan Atreya: 我认为GPU很难获得和使用，传统上，公司要么购买GPU硬件自建数据中心，要么租用AWS、Azure等云服务商的GPU。对于许多任务，需要特定类型的GPU才能达到最佳效果。如果无法获得所需的GPU或价格过高，用户会陷入两难境地。企业的IT部门通常不理解AI/ML的需求，因为他们按照标准化的方式运作，说服IT部门购买非标准化的GPU需要付出更多努力。在AI/ML领域，很多工作都需要实验，如果难以获得实验所需的资源，会阻碍业务发展。一些云服务商要求用户长期承诺，或者预留定价非常高昂，这阻碍了业务发展。近年来，涌现出一些新的GPU云服务商，如Modal、Lambda Labs和Base 10，市场对这类服务有很大需求。新的GPU云服务商正在改变市场格局，我们与这类公司合作。新的GPU云服务商面临的挑战包括寻找数据中心、电力以及将GPU转化为云服务。我们帮助GPU云服务商快速进入市场，因为他们需要尽快开始盈利。CoreWeave通过一些有趣的金融工程使其GPU更有价值，并希望始终保持GPU的饱和使用。CoreWeave的成功部分归功于微软这个大客户的资金支持。我们合作的GPU云服务商主要面向企业或大学提供服务。一些大学希望建立AI/ML实验室，但缺乏GPU资源和技术支持。新的GPU云服务商为大学提供Notebook、Ray和Kubeflow等服务，帮助他们培训下一代数据科学家。通过与GPU云服务商合作，学生可以获得实践经验，为进入AI/ML领域做好准备。GPU云服务正在赋能下一代AI/ML人才。

Deep Dive

Chapters

GPUs are difficult to acquire, especially the specific types needed for optimal performance. Companies face challenges with IT departments not understanding specialized needs and the high costs and commitment required for cloud-based GPU access. Experimentation is crucial in AI/ML, but limited access hinders this process.

Difficulty in acquiring specific GPUs
Limited options: company-owned data centers or major cloud providers
IT departments may not understand specialized GPU needs
High costs and long-term commitments for cloud GPU access hinder experimentation

Shownotes Transcript

Demetrios and Mohan Atreya break down the GPU madness behind AI — from supply headaches and sky-high prices to the rise of nimble GPU clouds trying to outsmart the giants. They cover power-hungry hardware, failed experiments, and how new cloud models are shaking things up with smarter provisioning, tokenized access, and a whole lotta hustle. It's a wild ride through the guts of AI infrastructure — fun, fast, and full of sparks!

Big thanks to the folks at Rafay) for backing this episode — appreciate the support in making these conversations happen!

// BioMohan is a seasoned and innovative product leader currently serving as the Chief Product Officer at Rafay Systems. He has led multi-site teams and driven product strategy at companies like Okta, Neustar, and McAfee.

// Related LinksWebsites: https://rafay.co/


Catch all episodes, blogs, newsletters, and more: https://go.mlops.community/TYExplore

MLOps Swag/Merch: [https://shop.mlops.community/]

Connect with Demetrios on LinkedIn: /dpbrinkm

Connect with Mohan on LinkedIn: /mohanatreya



Timestamps:

[00:00] AI/ML Customer Challenges

[04:21] Dependency on Microsoft for Revenue

[09:08] Challenges of Hypothesis in AI/ML

[12:17] Neo Cloud Onboarding Challenges

[15:02] Elastic GPU Cloud Automation

[19:11] Dynamic GPU Inventory Management

[20:25] Terraform Lacks Inventory Awareness

[26:42] Onboarding and End-User Experience Strategies

[29:30] Optimizing Storage for Data Efficiency

[33:38] Pizza Analogy: User Preferences

[35:18] Token-Based GPU Cloud Monetization

[39:01] Empowering Citizen Scientists with AI

[42:31] Innovative CFO Chatbot Solutions

[47:09] Cloud Services Need Spectrum

Building Out GPU Clouds // Mohan Atreya // #317 47:57 Share

MLOps.community

Deep Dive

Shownotes Transcript

Building Out GPU Clouds // Mohan Atreya // #317