We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

Handling Multi-Terabyte LLM Checkpoints // Simon Karasik // #228

2024/4/30
logo of podcast MLOps.community

MLOps.community

Shownotes Transcript

Join us at our first in-person conference on June 25 all about AI Quality: https://www.aiqualityconference.com)

Simon Karasik) is a proactive and curious ML Engineer with 5 years of experience. Developed & deployed ML models at WEB and Big scale for Ads and Tax.

Huge thank you to Nebius AI for sponsoring this episode. Nebius AI - https://nebius.ai/)

MLOps podcast #228 with Simon Karasik, Machine Learning Engineer at Nebius AI, Handling Multi-Terabyte LLM Checkpoints.

// Abstract The talk provides a gentle introduction to the topic of LLM checkpointing: why is it hard, how big are the checkpoints. It covers various tips and tricks for saving and loading multi-terabyte checkpoints, as well as the selection of cloud storage options for checkpointing.

// Bio Full-stack Machine Learning Engineer, currently working on infrastructure for LLM training, with previous experience in ML for Ads, Speech, and Tax.

// MLOps Jobs board https://mlops.pallet.xyz/jobs

// MLOps Swag/Merch https://mlops-community.myshopify.com/

// Related Links

--------------- ✌️Connect With Us ✌️ ------------- Join our slack community: https://go.mlops.community/slack Follow us on Twitter: @mlopscommunity Sign up for the next meetup: https://go.mlops.community/register Catch all episodes, blogs, newsletters, and more: https://mlops.community/

Connect with Demetrios on LinkedIn: https://www.linkedin.com/in/dpbrinkm/ Connect with Simon on LinkedIn: https://www.linkedin.com/in/simon-karasik/

Timestamps: [00:00] Simon preferred beverage [01:23] Takeaways [04:22] Simon's tech background [08:42] Zombie models garbage collection [10:52] The road to LLMs [15:09] Trained models Simon worked on [16:26] LLM Checkpoints [20:36] Confidence in AI Training [22:07] Different Checkpoints [25:06] Checkpoint parts [29:05] Slurm vs Kubernetes [30:43] Storage choices lessons [36:02] Paramount components for setup [37:13] Argo workflows [39:49] Kubernetes node troubleshooting [42:35] Cloud virtual machines have pre-installed mentoring [45:41] Fine-tuning [48:16] Storage, networking, and complexity in network design [50:56] Start simple before advanced; consider model needs. [53:58] Join us at our first in-person conference on June 25 all about AI Quality