We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Efficient GPU infrastructure at LinkedIn // Animesh Singh // MLOps Podcast #299

Efficient GPU infrastructure at LinkedIn // Animesh Singh // MLOps Podcast #299

2025/3/28
logo of podcast MLOps.community

MLOps.community

AI Deep Dive AI Chapters Transcript
People
A
Animesh Singh
Topics
Animesh Singh: 我目前在LinkedIn负责GPU基础设施和训练平台,以及一些推理引擎的优化。ChatGPT的出现让人们意识到大语言模型的潜力,也改变了我在LinkedIn的工作重点。我们利用LLM实现了个人资料总结、学习课程助手和个性化招聘邮件等功能,并取得了显著成效。此外,我们还推出了基于大语言模型的招聘助手,这是一个智能代理,可以帮助招聘人员寻找候选人并总结他们的工作经验。 大语言模型应用的成本和投资回报率是主要挑战。虽然通过开源模型、微调技术和少样本学习等方法降低了训练成本,但推理成本仍然很高,尤其是在对延迟敏感的实时应用场景中。我们需要优化模型架构和推理效率,例如探索如何将大型语言模型架构(Transformer架构)进行优化以降低推理延迟。 将LLM应用于推荐排序等传统机器学习问题,并非简单的‘套用新工具’,而是需要考虑模型的性能、个性化程度以及成本效益等因素。LLM的优势在于其已学习到大部分模式,因此可能只需要少量实时更新即可。此外,使用LLM可以简化架构,减少模型数量,从而简化合规性管理和人才招聘。 GPU利用率最大化是重要的成本优化目标。我们需要考虑工作负载的弹性、服务器端架构以及GPU的可靠性等因素。GPU的可靠性问题会影响效率,需要通过改进检查点机制、提高容错能力等方法来解决。我们开发了Liger框架,通过内核融合等技术,显著提升了GPU的训练效率,并已开源。 内存限制是当前大语言模型应用的主要瓶颈,需要优化内存利用率,并考虑采用新的硬件架构。我们通过改进检查点机制,例如采用两阶段事务和分层检查点策略,以及使用基于块的存储,显著加快了检查点速度。我们还投资了优先级队列等机制,以确保在计划维护期间训练作业的顺利进行。 在构建支持LLM和传统机器学习用例的平台方面,我们对机器学习训练管道进行了重新设计,使其更灵活、更易于实验。我们采用了Flight作为编排引擎,并引入了交互式开发环境,方便用户进行调试和跟踪。此外,我们还实现了健壮的版本控制机制,以更好地管理实验和模型版本。 未来,LLM架构本身可能会取代传统推荐排序系统中的部分架构。Langchain和Langgraph等工具可能会在传统机器学习领域中发挥更大的作用。目前,我们的工作重点是继续在底层进行标准化,并逐步向上层扩展。

Deep Dive

Chapters
This chapter explores the successful integration of LLMs into LinkedIn's services. It details the use of LLMs for profile summarization, hiring assistants, and personalized recruiter emails, showcasing their effectiveness and impact on user experience and productivity. Future plans regarding agent infrastructure and its potential are also discussed.
  • LLMs power features like profile summarization and hiring assistants.
  • Personalized recruiter emails see increased candidate response rates.
  • LinkedIn is developing agent infrastructure for various applications.

Shownotes Transcript

Building Trust Through Technology: Responsible AI in Practice // MLOps Podcast #299 with Animesh Singh, Executive Director, AI Platform and Infrastructure of LinkedIn.

Join the Community: https://go.mlops.community/YTJoinIn Get the newsletter: https://go.mlops.community/YTNewsletter

// AbstractAnimesh discusses LLMs at scale, GPU infrastructure, and optimization strategies. He highlights LinkedIn's use of LLMs for features like profile summarization and hiring assistants, the rising cost of GPUs, and the trade-offs in model deployment. Animesh also touches on real-time training, inference efficiency, and balancing infrastructure costs with AI advancements. The conversation explores the evolving AI landscape, compliance challenges, and simplifying architecture to enhance scalability and talent acquisition.

// BioExecutive Director, AI and ML Platform at LinkedIn | Ex IBM Senior Director and Distinguished Engineer, Watson AI and Data | Founder at Kubeflow | Ex LFAI Trusted AI NA Chair

Animesh is the Executive Director leading the next-generation AI and ML Platform at LinkedIn, enabling the creation of the AI Foundation Models Platform, serving the needs of 930+ Million members of LinkedIn. Building Distributed Training Platforms, Machine Learning Pipelines, Feature Pipelines, Metadata engines, etc. Leading the creation of the LinkedIn GAI platform for fine-tuning, experimentation and inference needs. Animesh has more than 20 patents and 50+ publications.

Past IBM Watson AI and Data Open Tech CTO, Senior Director, and Distinguished Engineer, with 20+ years experience in the Software industry, and 15+ years in AI, Data, and Cloud Platform. Led globally dispersed teams, managed globally distributed projects, and served as a trusted adviser to Fortune 500 firms. Played a leadership role in creating, designing, and implementing Data and AI engines for AI and ML platforms, led Trusted AI efforts, and drove the strategy and execution for Kubeflow, OpenDataHub, and execution in products like Watson OpenScale and Watson Machine Learning.

// Related Links