We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Efficient GPU infrastructure at LinkedIn // Animesh Singh // MLOps Podcast #299

Efficient GPU infrastructure at LinkedIn // Animesh Singh // MLOps Podcast #299

2025/3/28
logo of podcast MLOps.community

MLOps.community

AI Deep Dive AI Chapters Transcript
People
A
Animesh Singh
Topics
Animesh Singh: 我目前在LinkedIn负责GPU基础设施和训练平台,以及一些推理引擎的优化。ChatGPT的出现让人们意识到大语言模型的潜力,也改变了我在LinkedIn的工作重点。我们利用LLM实现了个人资料总结、学习课程助手和个性化招聘邮件等功能,并取得了显著成效。此外,我们还推出了基于大语言模型的招聘助手,这是一个智能代理,可以帮助招聘人员寻找候选人并总结他们的工作经验。 大语言模型应用的成本和投资回报率是主要挑战。虽然通过开源模型、微调技术和少样本学习等方法降低了训练成本,但推理成本仍然很高,尤其是在对延迟敏感的实时应用场景中。我们需要优化模型架构和推理效率,例如探索如何将大型语言模型架构(Transformer架构)进行优化以降低推理延迟。 将LLM应用于推荐排序等传统机器学习问题,并非简单的‘套用新工具’,而是需要考虑模型的性能、个性化程度以及成本效益等因素。LLM的优势在于其已学习到大部分模式,因此可能只需要少量实时更新即可。此外,使用LLM可以简化架构,减少模型数量,从而简化合规性管理和人才招聘。 GPU利用率最大化是重要的成本优化目标。我们需要考虑工作负载的弹性、服务器端架构以及GPU的可靠性等因素。GPU的可靠性问题会影响效率,需要通过改进检查点机制、提高容错能力等方法来解决。我们开发了Liger框架,通过内核融合等技术,显著提升了GPU的训练效率,并已开源。 内存限制是当前大语言模型应用的主要瓶颈,需要优化内存利用率,并考虑采用新的硬件架构。我们通过改进检查点机制,例如采用两阶段事务和分层检查点策略,以及使用基于块的存储,显著加快了检查点速度。我们还投资了优先级队列等机制,以确保在计划维护期间训练作业的顺利进行。 在构建支持LLM和传统机器学习用例的平台方面,我们对机器学习训练管道进行了重新设计,使其更灵活、更易于实验。我们采用了Flight作为编排引擎,并引入了交互式开发环境,方便用户进行调试和跟踪。此外,我们还实现了健壮的版本控制机制,以更好地管理实验和模型版本。 未来,LLM架构本身可能会取代传统推荐排序系统中的部分架构。Langchain和Langgraph等工具可能会在传统机器学习领域中发挥更大的作用。目前,我们的工作重点是继续在底层进行标准化,并逐步向上层扩展。

Deep Dive

Chapters
This chapter explores the successful integration of LLMs into LinkedIn's services. It details the use of LLMs for profile summarization, hiring assistants, and personalized recruiter emails, showcasing their effectiveness and impact on user experience and productivity. Future plans regarding agent infrastructure and its potential are also discussed.
  • LLMs power features like profile summarization and hiring assistants.
  • Personalized recruiter emails see increased candidate response rates.
  • LinkedIn is developing agent infrastructure for various applications.

Shownotes Transcript

Animesh Singh, currently at LinkedIn, director of our GPU infrastructure and training platform and optimizing some inferencing engines. Coffee, tall mocha, extra hot. Woo, I'm bubbling from this conversation. So many gems when it comes to working with LLMs at scale,

GPU infrastructure, what you want to be optimizing for, how you can think about optimizations. And I had an incredible question at the end, like, how does the platform and the GPU infrastructure that you're dealing with differ when it comes to working on LLMs versus traditional ML? Did not disappoint. Let's get into the conversation.

Well, let's start with this, man. And I'm so happy that we got to do this because we've been jiggling around our calendars to make this work for probably about six months. I think we were going to do this when I was last in San Francisco in June, but one thing led to another. And here we are in 2025 having the conversation finally. Persistence paid off.

I think this is great. And I've been following your work throughout as well. You're doing excellent work in terms of, you know, bringing communities together.

And disseminating that knowledge, right? Like what's all happening in the AI space, right? And what are these cases are springing up? What are the industries they are targeting? There is some excellent work you are driving in that. And I'm so glad it's happening in 2025. We have now some better experience of, you know, what is working, what is not working.

What may be a little bit of the hype, right? What is realistic? What is going to be the trends in 2025? So I think the timing is working out, yeah. What is working? One thing which has definitely proven that it's here to stay is LLMs, right? I feel throughout, you know, 2022-2022,

there was a lot of, you know, discussions how effective LLMs are going to be, right, in the industry, in the space. There are a bunch of, you know, modeling architectures like, you know, recommendation ranking LPMs, graph neural networks, JNNs, right. But the efficacy of LLMs and the use cases being powered by LLMs

was quite a bit of question mark, right? Like, yes, there was a promise. What we saw that magic moment with chat GPD coming in, that literally, you know, woke everyone up, right? Hey, it does seem seamless. It does seem that it's not yet another chatbot you are talking to.

And that sprung up the industry, right? When I joined LinkedIn at that time, you know, the chat GPT moment hasn't happened. And as soon as I joined a month later, chat GPT came in and what I came in here to do, a lot of that changed within a period of a month.

uh, and, and I think through the course of that period, uh, multiple companies, multiple industries have identified, right, different use cases which are working well, uh, with this, right, and people are being productive, uh, be it, you know, either, uh,

you know, generating code, being able to do certain automation, leveraging this, the interface does seem very human-like. And a lot of the generative AI use cases, which we launched even on LinkedIn, for example, you know, profile summarization, right? So based on what you have, you know, create a headline for me, create a summary for me,

Use cases like, you know, assistant for LinkedIn learning courses, use cases like, you know, resume targeting recruiter emails for candidates because a lot of the things which we had seen that, you know, if you're getting cold call emails,

Emails from recruiters, they are not hyper-personalized, right? Like at times they sound like a template. Genitive AI and LLMs helped us immensely in that, right, where they take into account the candidate's profile, the company in which candidate is working, the company from which the recruiter is, and create these very personalized emails, which, you know, we are seeing success candidates actually responding to.

Much more opening those emails. So I think, you know, much of use cases we have seen working really well with LLMs and we are obviously doubling down, right? Like if you see the talk of 2025 and even before that, last year there was quite a bit of discussion on agents, what they can do, what they cannot do.

we did our own experiments right like invested a lot internally in terms of building agent infrastructure right first of all right what does it take to create agents how is it different than the traditional generative AI applications or use cases we were building what are the nuances right what makes it different and then finally we launched you know

Closer in the last quarter of last year, LinkedIn Hiring Assistant, right? So which is essentially an agent for the recruiters, which based on certain criterias they define will actually go work behind the scenes, find candidates for them, relevant candidates, summarize their work.

experience and profile to the recruiters. And then based on that, they can tell, okay, you know, reach out to these particular candidates and, you know, let's start having a discussion. And there is much more we are doubling down on that whole LinkedIn hiring assistant. And we have seen some great, great,

enthusiasm from our partners and customers, right? So it's seeing some very good results. So there is much more we are doing now on other areas of LinkedIn, which will be powered by agents. So they are here to stay. I think it's how and which use cases really benefit from it. That will be a nuanced discussion. You cannot throw it at every single thing. There are so many things which can be just powered by them, which then frees you up.

to do things right which are probably the more creative aspects of the work you are doing right and and we are seeing quite a bit of that happening so what's not working i wouldn't say what's not working but a thing which probably needs a lot of improvement moving forward is the cost and the roi of launching these llm based either agentic use cases or traditional you know

prescriptive, RAC-based generative AI apps or even leveraging LLMs for use cases beyond generative AI, the cost is a big hindrance. I think one, if you see, training itself used to be the biggest barrier for entry and that got solved to some extent.

Because, first of all, you know, even within LinkedIn, we invested quite heavily in building out our scale-out training infrastructure, which can power LLMs. I think when I came in, you know, we were working off a V100 fleet. And then since then, we have scaled our LinkedIn fleet by 7x.

We have A100s, we have H100s, we have H200s, and the fleet is as modern as it could be. And we have scaled our training tremendously, right? Like it's 150x increase in the size of the models we are training, right? Our data processing has increased many fold. We actually, you know, completed 1 million training runs on the platform and we're training big foundation models. Now,

investment in the infrastructure went in and I think it was well understood for companies at scale of LinkedIn etc right where a lot of content based data is being consumed being produced you will have tons of data right and when you have these tons of data you need to make sure that you have the infrastructure to train models on these data right

And then the other thing which actually helped a lot of the training landscape is the emergence of open source models. I think, you know, meta led the way, followed by many others in the industry, right, where for smaller companies or for companies which there is not a need to train a model on world data, right? Like you have your own specific data, which is probably, you know, not as big as the world data.

Getting these open source models and starting on top of them because these models already know what exists in the world. They can answer your questions around, you know, they've crawled through Wikipedia, they've crawled through, you know, public libraries, all the articles, and they're being trained on that. Then you can bring in and do more fine tuning instead of, you know, training on huge amounts of data.

So then the infrastructure cost can go down further, right? So fine-tuning became a big mechanism, plus, you know, the emergence in the industry around a lot of techniques around supervised fine-tuning, zero-shot, few-shot learning, which emerged where, you know, prompt optimization techniques, which emerged, which essentially brought the cost down heavily on the training side. What is now, you know, more and much more in the picture is the cost of inferencing, right?

Right. And I think it's humongous, it's big, and there are a lot of efforts, right, which we are doing, which the industry is also doing, right, overall, how to bring down the cost of inferencing these models. Now, if you take a look at generative AI use cases specifically as well,

There is some thinking which is built in, like when you are interacting with a model, right, where you are asking certain queries, asking it to analyze, you are prepared mentally. It will take some time. It will think through it. Right. Even with the emergence of, you know, the latest open AI models, right, where there is a lot of reasoning going on. Right. So there is a lot of it is actually analyzing its own output, right.

then, you know, refining its own output. Then the second output is further analyzed. There is a lot of back and forth reasoning. So there are multiple inferencing calls happening. And I think as a consumer, we are prepared if we are going into a scenario in this particular context, the model may have some thinking time, specifically if you are interacting with the latest OpenAI models, et cetera, then you know that, you know, you're asking complex queries which need that analysis. So the latency you are willing to tolerate.

Now, the advent and even to get to that latency, right, like there is tons of infrastructure investment which has been made, right? The general realization is that, you know, even with all these investments, we are not able to get our GPUs to perform at maximum utilization, right? Like inferencing is becoming very costly because you are optimizing a lot for performance.

latency, throughput, and you have a lot of failover mechanisms which you need to build. Almost any company needs to account for, hey, I have two or three data centers. If one data center goes down, so there is a lot of redundancy you need to build for applications which are user-facing.

that means you know the cost of gpus is ballooning right and and uh that essentially is is an infrastructure problem which needs to be solved specifically now when we do take llms in use cases where the consumer appetite for latency might not be there at all right like um there are efforts happening across the industry right like hey

the traditional rexes, right? Recommendation ranking. Like, so for example, you go to social media sites, you get recommendations, you get feed, you get people you want to connect with all these things. Like, you know, as soon as you go to the site, this should be there, right? You should be, as you're scrolling through the feed, the feed should be just being, you know, updated and customized for you in real time. There is no appetite for latency in those scenarios, right?

Now, if you need to see if LLMs can be effective in those world, that means, you know, you nearly really need to optimize for latency. And if you're really doing that, you are potentially throwing more money at the problem. And so that problem, I think, you know, how do we take this large language model architecture, the transformer architecture and make it really, really optimized for latency?

inferencing, right, is becoming a big thing, right, which needs to be solved for scale, right. Do you feel like it's a bit of trying to fit a round peg in a square hole because when you throw LLMs at like a rexis problem, just because it's

we've been doing rexus for a while and we figured out how to make it real time why do we need to add an llm on top of it it almost in my eyes is over complicating things just to try and use a shiny new tool but maybe you've seen there's better performance there's better personalization there or something that i haven't i think i would um

More than speaking on my behalf, right, like in general, there are like, you know, research papers, emerging companies are trying that. Now, why would you try something like that, right? I mean, and that's a fair question, right? Like REXIS is already well established, right? And as an architectural pattern, right, like the traditional recommendation ranking models, retrieval models, right?

including, you know, something like graph neural networks and GNNs, et cetera, right? They do a very, very solid job at this. And you have seen companies like...

And it's fast. It's fast. And like you take a look at TikTok, oftentimes, you know, the recommendation algorithm is talked about. So you're fairly right. Like, hey, is why? Is that why? I think there are a couple of things in here to take a look at. So the way we have solved recommendation ranking problems in the industry is obviously you have created models. A lot of the companies,

have smaller models, right? These are not traditional foundation models which have been trained on world data, etc., right? They have potentially not seen a lot of user interactions and patterns. So then, you know, you add things like real-time training. There is a lot of data being ingested in real-time. Online training is happening. You know, there is a lot of feature ingestion which is happening in real-time. What is user interacting with?

So there is this new paradigm, okay, which is these models, the LLM models, these are foundation models. They have potentially seen maybe 95% of the patterns. So maybe what you need to do in real time

to update these models is is probably a lesser investment right models have seen majority of the patterns and and if you feed what the user is done right like they they would be able to predict much more comprehensively you don't need to do a lot of online training etc right in your time that's one of the the you know thinking the other thing is the simplification of the architecture

For something like GNS, right, what we need to do is, you know, when you are doing the training, GNS, all the data is in the graph format, graph structure format, right? You need to traverse the nodes and the edges in real time, right? Because you don't know pre-handwrite, like how much data you will be processing, which will be the right node, which edge you need to traverse, right? So there is not data pre-processing happening, right?

And there is a different architecture and it's inherently hard to scale GNNs beyond a certain limit because there is live data processing happening while the training is going on or while inferencing is going on, right?

So an architecture, and then JNN is one example. There are different recommendation ranking architectures. And then companies would have a proliferation of these recommendation ranking models, right? Like every team would create for each of their use cases, they would start from scratch and create a model which is potentially a small model. It does a very targeted job. It does a really good job at it, right? And then build all these things. So you have bespoke models with different architectures.

There are a huge number of them. If you take the LLM route, right, like the trend which is emerging in the industry, right, hey, if I do create a giant foundation model, right, and if you're obviously very present in what's happening, distillation as a technique is becoming

very prominent in the industry, right? Like I will create smaller models for inferencing, but I will distill it from this giant foundation model. So what you have done is, you know, you have sort of centralized the creation of models in potentially a central place. So you can think of a scenario in the future, right? Like,

There is one central team, instead of having every different use case and every different vertical within your org, creating your own models for their own use cases, which are very targeted. One central team, which is, you know, the holder of your organization's data, which is curating that whole data and creating, you know, maybe one or two or three. Okay, very simplistic scenario. You're creating two giant foundation models, one for generative use cases, one for non-generative use cases.

And then, you know, the models for all particular use cases are being distilled from this. So you have simplified the overall architecture. You are essentially worrying about the compliance majorly. Like there is compliance is a big thing as well. As you are seeing, right, there was the TMA Act. There is AI Act. There will be other acts which will be coming. If you curtail the surface area, how many models you have, right?

what data it was trained on, right? Then you're not worrying about, you know, the many hundred other models which your organization has created and make sure everyone is compliant. What data it was trained on? Was that data compliant? You have sort of centralized and simplified that problem. Hiring becomes easy, right? Right now you do tend to hire like when, when,

I'm looking at certain use cases like, you know, GNNs, people who can run GNN at scale, right? You need a very targeted skill set, people who understand, you know, how to target graph data. And then you need to go into the depths of the GPUs architecture where is, because what a lot of data transfer is happening during that training. So, okay, is NVLink good enough? How much HBM memory I have? How much on-disk memory I have on the GPU nodes?

Then, you know, what is the network bandwidth? There is a lot of in-depth GPU knowledge you will require, right, just to go there and start solving. Plus, you know, the graph traversal, what algorithms do I need to introduce to? So your sourcing of skills and talent also becomes simplified. So there is that overall value proposition which could be achieved, provided, if LLMs do prove themselves. I think...

Many are looking at this problem space and figuring that out. So it's yet not solved. And as I said, unlike generative AI as well, this is a problem of scale. Like generative AI use cases, you have to explicitly, as a user, go and invoke that.

You will go like I am giving you a LinkedIn example. Right. You will go and say, hey, summarize my profile or a recruiter will say launch this particular agent or you, you know, LinkedIn learning user will go and say summarize this course for me or explain to me this nuance. So these are, you know, discrete transactions.

going to feed logging into feed browsing through the content this is happening all the time it is happening at scale right so so then that's a bigger much bigger scale problem which needs to be solved with LLMs overall right in a cost effective manner I think it's it's and very very you know uh

latency sensitive manner. So that's the thing, right, which needs to be happening. And we'll see whether they do prove efficacy, like, you know, based on all the results you see, they're able to pass a lot of bar exams, PhD exams, math exams, right? So you'll see, hey, they are intelligent, right? Like, can they be very intelligent for this specific set of targeted problems?

I hadn't thought about the simplicity in the architecture and also the simplicity in being able to attract talent that understands this architecture because it is more simplified. And you were talking about, A, the...

cost of having GPUs at scale. And when you have this many GPUs and you're trying to utilize them, you don't want to have any percentage going idle. I imagine you think about that a lot. You're thinking, wow, we're burning money just letting this GPU sit around and we're not utilizing it to its maximum capacity.

Is that what you're trying to do or is that what Liger kernels are trying to help out with? Can you explain that? Any infra and platform team, which is an ML infra at this point, you would talk, this is a burning thing, right? It keeps you up at night. You have, on one hand, you know, you cannot run these LLMs, for example, or these modern recommendation ranking models without GPUs, right? Yeah.

Training, definitely, right? And even with inferencing now with HoloLens, you have to go on GPUs, right? Like the general trend in the industry, like if you remind two years ago was, hey, we will train on GPUs, we would serve on CPUs, right? That's how most of the companies had architected

And CPUs were not that expensive. The modern architecture, it doesn't lend you to that, right? So you need to get. So the whole investment in GPU efficiency becomes very, very paramount. And it starts from at every single layer. Like in our case, like, okay, there is the general thing, right? Which also happens in companies, right? Hey, you need to allocate and maintain its budget.

Now, you can be much more generous with CPUs, right, in terms of maintenance budgets. How much spare capacity do you allocate so that if your 20% of your fleet is being maintained on a regular basis, so that's the delta you need. If you need high availability, you need to spread it across three data centers.

uh okay let's go ahead and do it right those decisions were much easier they're becoming much more harder when you have gpus in-house because yes there are certain companies which have potentially invested a lot right uh uh and and they have deep coffers but uh a lot of the companies want to be cost conscious about it right like you are uh so every decision hey even to the point like looking at certain use cases what would be the right maintenance budget right like

If you have, let's say, you know, in one maintenance zone, 1000 GPUs,

Can I allocate 100 just for maintenance, right? Like then you do the cost and it's like, hey, there's a lot of wastage which is going on. You start sweating. You start thinking about how much money that is. Yeah. And so how do you, first of all, start then looking at the workload so that they become more resilient to maintenance? Like, so, okay, if I cannot take the previous, the previous approach used to be like, once I know,

20% of my fleet is going to be maintained. I will not schedule anything, right? We don't need to worry too much about, you know, for the next 24 hours, even if they're empty. Now you're worrying about that. I cannot leave 200 GPUs lying empty for 24 hours, right? Just because, you know, the maintenance is going to happen. So then you look at your workload. Okay, how can I make the workloads more resilient? Now, if you look at distributor training workloads, right? Majority in the industry are gang-scheduled, right?

So you do GAN scheduling, you know beforehand this distributed training workload will take X amount of GPUs, right? And there is a lot of synchronization happening as the distributed training is going on. And if a few of the nodes fail, you will have to put it back in the queue. You obviously need to have checkpoint restore, put it back in the queue, do again GAN scheduling so that it can resume.

So then you look at this workload, okay, maybe in this world we have to rethink the whole notion of gang scheduling, right? Do we need to build more elasticity, right? Like if a workload has been launched on 100 GPUs and 30 GPUs are...

going to be taken away. Can it shrink? Can it expand on its own, right? And continue going. Same for inferencing. I think the need for scale down, scale up is becoming big. Serverless architecture has to become a prominent scheme in the thing of things like you cannot optimize for max traffic. And a lot of us are doing that. A lot of the industry is doing that. Where do you put what is the peak traffic you get? And that's your capacity, right?

But if out of 24 hours, two hours is your peak traffic, for the rest of the 22 hours, these GPUs are sitting idle because they have been provisioned to handle that peak traffic. Again, a lot of wastage. So, so many of these decisions are now becoming very prominent. So, I feel the emergence of elastic architecture, serverless architecture to handle workloads which, you know,

should be able to scale down, scale up as the capacity is, you know, shrinking, expanding is, is, you know, a big and mismatched area and something, right. Which, which we are doing like, and, and I'm assuming like, you know, a lot of people are solving this problem at different means and mechanisms. Right. So yeah, that's, that's essentially the, the key characteristic there.

Well, and you haven't even mentioned the reliability of the GPUs because they're notoriously unreliable. They are. Like there are, I mean, all these NVIDIA GPUs you get, for example, right? And one of the decisions we took early on, and I think for the simplicity part of it, right? Like, hey, let's start at least on the training side, right? We'll go with NVIDIA.

And so far we have standardized on NVIDIA throughout. But then, you know, you have a lot of these checks, things which can go wrong because it's just taking a distributed training example.

When you are running at scale, you are touching your, in our case, you know, we have Kubernetes as the compute infra layer, right? So you have Kubernetes, you have the network, you have the hardware, you have the storage. Training is, you know, reading and pumping a lot of data throughout. Any of these points going down is a weak link in addition to the GPUs itself, right?

and act with the planned maintenance. So, and sometimes, you know, for foundation models, your training can run for one week, two week, three week at stretch, right? When you are actually running training for these large foundation models, you're guaranteed to have some part of the infrastructure having some problem. Sometimes it will be the messaging queue. Sometimes it will be the storage system. Sometimes it's the, the,

Something in the Kubernetes layer or sometimes it could be just the GPU node itself, failing health checks, ECC errors, nickel is a problem, infinite band of the GPUs, right, which is connecting all this. So this poses a huge challenge, right, in terms of ensuring because these are so long running jobs. How do you make sure that you are being effective, right? And that combined with the problem of gag scheduling because, hey,

they need to go all together at once or otherwise they will need to wait. If a training job needs 500 GPUs, it will be in the queue until those 500 GPUs are free, right? You cannot launch it before that because, you know, you have not, I mean, that's how the architectures around GAN scheduling are built. So this is a big thing, right, which we have invested a lot in terms of building, you know,

very fast checkpointing restore because you assume failure is happening, maintenance is happening. So a lot of investment has gone in from our side in terms of ensuring that, you know, there is a

automated checkpointing restore which could be done uh and which is you know hierarchical so we first checkpointed memory then distribute those checkpoints to some async block storage then you have to read right and and then you know there is also discussions around whether these checkpoints can reside on the gpu ssds itself because there is a cost of you know

reading once you have like the training can continue checkpointing in memory but once you have sent it to some remote storage a block storage or a file based storage then when you're reading if something has gone wrong that takes time you have to bring that through the network and bigger models will have you know many gigabytes of checkpoints right bringing them through the network so then okay can we leverage the GPU SSDs under the covers itself right GPU SSDs obviously have some limited capacity

How much can you store? So a bunch of those architectural elements need to be crafted. So we are investing quite a lot and quite heavily into that. Liger, you talked about, was the other part of the efficiency problem, which is essentially, as we went on this journey,

We saw like beyond the reliability and the scalability, a lot of the use cases, the model started getting more complex, more bigger. And we had an X amount of GPU fleet. We started seeing a lot of pulls from our customer. Hey,

I need GPUs, I'm not getting GPUs, or my training is running for like, you know, so long, right? And I really want to get it done faster. Multiple of these use cases and literally people were

because you know gpus was our scarce entity at least you know a year ago we we had quite a bit of that right we were as we were scaling the fleet right and we were ordering nvidia has its own supply chain and timeline right uh the use cases were just springing up so the we looked at the problem okay and and you follow the traditional methodology of data parallelism model parallelism let's introduce tensor parallelism we invested in technologies like zero zero plus plus

because our A100 fleet had a very constrained network bandwidth. So how you can scale up training even with that constrained bandwidth, zero plus plus helped in that. So after we have exhausted every single option, then, okay, what can we do? So as we started going deeper, one of the things which we used to do infrequently for our customers was, you know, rewriting CUDA kernels.

Right, like a lot of the use cases which are very sensitive to that the training should complete within X number of hours, etc. When we looked at the model training code, which our modelers used to produce, there was a lot of improvement. So at times we will go and rewrite the CUDA kernels, few certain operations, right? Now you bring this to the LLM. The thought which we had was, hey, we can do this on a model by model basis, right?

But this is not scalable, right? Like there are multiple users having different models. So the thought, okay, how can we make what exists to actually solve this problem, which can be a little bit more scalable. So that's where, you know, fortunately at that point, OpenAI has launched Triton, which was essentially a Pythonic programming interface, right? Where you can do CUDA kernel programming, right?

at a much abstracted layer most of our users would be you know familiar with python so using that now the second thing was like you know gpu memory is hierarchical right like you have uh the d ram then you have the the hbm memory then you have sram the streaming multi-processor ssms and gpus they interact mostly with sram right which is very little so

And so all these multiple kernels you have and multiple operators you have in your training code, there's a lot of data transfer happening, right, from CPU to HBM to SRAM. Like that is your biggest bottleneck, right? Even though the GPUs can massively parallelize, the amount of IO, data IO, which is happening between different hierarchies of memory, that's becoming a big bottleneck for the training time.

So that's when we thought, okay, let's combine what Triton has brought to the table. Let's take this problem and create certain, first of all, custom kernels for our distributed training workloads. And we started seeing huge gains, right? Like in one of the cases, like just 300% increase in the overall efficiency. Majority of the cases, like more than 50% decrease in the memory, which was being used. What we did was we started fusing kernels, right? Like a certain operations, uh,

they can be just combined, fused together. Operators can be fused together. You don't need to have five kernels to do these different operators, right? And there is some human judgment involved in that to make that decision, which, and we have also seen flash attention in the industry sort of saw that quite a bit, right? And it became very popular and it was built on the principles of kernel fusion. So we took that and combined that and

it's, it's solved and helped immensely. A lot of our internal developers, a huge decrease in training time because memory efficiency got much bigger. And once we realized, Hey, this looks good. And, and let's just open sources. There was not a lot of planning. There was not a lot of thought. It was also like, Hey,

a lot of open source models are coming in and maybe, you know, if community likes it, they will potentially create kernels for those open source models. And if we end up consuming it, it eventually also benefits us, right? Yeah. It just took off, right? I think multiple companies were in the same position, right? It's when you like go and put a problem out, they were all, you know, getting this big load of generative AI use cases, a lot of LLMs, GPU crunch was only present, right?

And almost everybody, right, uh,

needed to solve this problem for distributed training right at scale and we got a solid reception right in the community right from Andre, Hugging Face one thing which we did spend some time on making the use very easy right so we integrated with the Hugging Face ecosystem right off the bat right etc so and we just last week it completed a

more than 1 million downloads, right? Incredible. This is amazing. Like we never thought, you know, this will go all the way there, right? So, and we have gotten lots of good feedback from the users who are benefiting from it. Yeah, it's funny that you mentioned the memory aspect because we had a guest on here a few months ago talking about how

His whole thesis was that we are memory constrained. We're not GPU constrained. GPU is almost like not the way to look at it. It should be really looking at memory. And that is the bottleneck right now. And I think very rightly so, right? Like, see, ultimately, ML is a data processing problem, right? Yeah.

Now, so that does mean that, you know, there's a lot of data moving in between these GPUs, right? Now, the fastest you can process data is if the data is in the memory, right? Now you go through the hierarchical thing, okay, GPU, HBM, SRAM. The second thing is, okay, the CPU on system, memory of the CPU, right?

on the GPU nodes, then you go, okay, I will go outside the GPU nodes, right? Maybe the GPU SSDs, right, itself. But the more you can keep in memory, the faster you can process. Yeah. More so for...

LLM-oriented workloads and ADBI workloads because if you look at, you know, people want to have bigger context length, right? Even when you are training the model, right, if you increase the context length, you potentially can train bigger models. Inferencing time, right, like the amount of information users are sending, right, for the model to process, the amount of information the models are generating. Like, see, Rexus output used to be

very straightforward, right? Like it's, Rexxas recommends you whether this is a good content to show or not. In case of generative AI, the output is also huge, right? So a lot of that data processing is going back and forth and the more you can, you know, leverage the memory effectively, the biggest gain you can get. So it's, rightly so, there's a lot of, even during inferencing time, like KV cache is becoming more

a big technique, right? Like where you can reuse KV cache as the inferencing calls are coming, right? For sequence of tokens, et cetera, right? Which we have seen before. Yeah. Calculated. I can save a lot of costs too. Yeah. Yeah. You have collect, you've already calculated attention scores for all these different, you know, tokens you are swinging the sequences you are seeing, right? So why do this again? So let's keep it in memory, but how much can you keep in memory? The other part is like, you know, if you start leveraging GPU memory for KV caching, then, uh,

Model itself needs GPU memory for its own processing. So what is that right boundary? And most of the H100s, right, which is now dominantly, like if you look at what the industry has, probably most in bulk is right now H100s, right, or to some extent H200s. And 80 gig memory per GPU is turning out to be low.

Right, based on the amount of data you're processing. And I think part of it is also as the use cases emerge, that starts becoming clearer. Because the way maybe whatever went into the decision-making was, hey, this memory is for GPU processing, whatever GPU is computing. When people start leveraging this memory as a cache, like KV cache, like in case of GNNs also, we are keeping some part of the graph structure in memory.

When you start using it as a data storage mechanism, in addition to being used for computing needs of your model itself, then it's less. Then you need more. So I'm assuming that realization is already like if you see the Grace Hopper architecture and the Grace Blackwell architecture where they're combining ARM CPUs.

with the GPU nodes, and then they're creating a very high bandwidth data transfer link between that, the CPU and the GPU, so that you can then leverage the more memory on the other side of the CPU nodes because there is a very high bandwidth data transfer. Or a realization to this particular use case, which is emerging from generative AI, right? It's, yeah. Yeah.

I want to give a shout out to the person who said that to me because I was trying to remember their name. It was Bernie Wu was the one who said that we are memory constrained. And you just went very deep in that and makes a lot of sense. Everything that you're saying, you did mention one thing that you worked on quite heavily too, which I've also heard is an absolute pain around checkpoints and speeding up the checkpoints because a, um,

It's not clear when things are going to fail. So you don't know if you need to checkpoint every second or every five seconds or every five minutes or every day. And so you if you over optimize for checkpointing all the time, then you're potentially transferring around a ton of data and you don't need to be because, as you mentioned, these checkpoints are huge.

And so especially when you're training LLMs that are very, very big, the checkpoints can be in the terabytes. And that's just like, if you're doing that every five seconds, that's a whole lot of data that's going around. So how did you specifically speed up the checkpoint and make that process better? I think we decided, like, so initially the very naive implementation of checkpointing was,

So our majority of our data is on HDFS. The mechanism was you checkpoint and the checkpoint goes to HDFS and when you are reading, you read from there. So the realization specifically with LLMs, first of all, these checkpoints are big. The training pauses while you are checkpointing. So there is a pause in the training workload.

and this gets done, there's a transaction which is happening, right? So the first thing which we did and, you know, changed that architecture as LLMs became more prominent is, hey, let's make it, you know, a two-phase transaction, right? In essence, we will checkpoint in memory, right? So, and from there onwards, any copy of that checkpoint is a sync. So two points, right? Like the checkpointing in memory is very fast, right?

and the second thing is you're not waiting for that checkpoint to be transferred to a remote storage right so it's hierarchical checkpointing strategy which we developed and now we are you know streaming that checkpoint we also changed our our back end for checkpointing storage from HDFS to a block based storage now we are you know investing in in figuring out uh going and optimizing it even further so uh GPUs

if you request, like when you are ordering GPUs, NVIDIA GPUs, you can request SSD storage within the GPUs, right? And there are technologies which I'm assuming, you know, some of the companies may have in-house, but even in open source, right, where you can then start looking at creating a cache or distributed cache onto the GPU SSD. So maybe one GPU SSD storage is not enough, right? It's...

for your needs, then you can combine the GPU SSD storage across all your GPUs and then create a distributed cache to store this, right? So that's the second phase we are on in terms of ensuring so that even, and the advantage with that will be in the restore. I think checkpointing will still be fast because it's in memory. It then will go to the GPU SSDs, but when you are reading back,

If you don't have to take it outside of the GPU network and bring it back into the GPU network, you save tons while restoring and when reading. The other thing which we did was, like, to your point, one of the questions was how often do you checkpoint, right? I mean, that's a big question. And we have mostly sort of, you know, left it to the modelers, like, to take that common sense decision. How soon should you checkpoint, right? How much of a, like, for example, for jobs which take, like, more than a week, right?

I think they can have a, hey, how much I'm willing to lose? I'm willing to lose five hours worth of data or am I willing to lose if something goes wrong and I have to restart, can I restart from previous day's checkpoint? So it's a decision-making, right, heuristic. In certain use cases like, you know,

very deterministic use cases when you're doing incriminator training like you need to train every few hours then it is very clear like hey it's driven by checkpoints right so you have to do a checkpoint you know every two hours etc so depending on the use cases the other thing which we have done is like disruption readiness so if in cases of planned maintenance right

where we know that the nodes are going to go down, right? We trigger a checkpoint, right? So it's a triggered checkpoint. And if the modelers have implemented a checkpointing restore, right, that trigger will go, invoke a checkpointing restore. So this is not a plan. This is not what's happening before we take that workload of those nodes, right? So essentially, this is a signal from the underlying infra that, hey, these nodes are going away.

I'm triggering a notification, this notification, and this is all automated right now. We'll go to the model, running model, that will be checkpointed, moved, and then, you know, we will put it back. And then we have invested in things like priority queuing. So if things which are being disrupted, they should be moved to the front of the queue, like a Disney front of the line pass, and rescheduled, like as fast as possible. Right? So you need to invest in those mechanisms to make sure that, you know,

that goes on smoothly yeah i like the idea of first of all just getting the checkpoint offloaded as quickly as possible so you can get back to training and then do what you want with the checkpoint but don't have everything stopped because you're trying to offload the checkpoint so like make that as quick as possible i think i've heard that before as as a design pattern and then

also make sure that there's no surprises for the model that is training and then it's like hey what happened i thought i had all these resources and it all went offline so you make sure that these things are that they feel kind of standard but it's like unless you think through it or you have a bad experience i imagine you probably had an experience where you realized wow i

we should probably do something about that because there was a whole training job that just kind of went to nothing and we could probably provision or at least...

put a bit more anticipation in place for that. One thing that I was thinking about is how we've been centering this conversation a lot on the new paradigm of LLMs and agents, but LinkedIn still has a ton of traditional ML use cases, right? So how do you think about

Bridging the gap or creating a platform that can service both the LLM, the agent use cases, and then also the traditional ML use cases. I think it's happening. So certain things are common, right? Like obviously all the investments we are doing in the GPU infrastructure, GPU monitoring, observability, resiliency, etc.

and, you know, building things like distributed checkpointing, restore automated disruption readiness, then, you know, it goes. Then you start going into, like, so one of the things which I did once I came here was, you know, we took a major decision to rehaul our machine learning training pipelines. The pipelines or the orchestration engine was built

very much in a solid prescriptive way, right? Like it was multiple components which we had in our training pipeline. It was based on TensorFlow. And sort of, you know, it's a similar paradigm which TFX follows, TensorFlow Extended, you look at from Google. And then, you know, every single step is sending metadata to the subsystem. So very rigid. Now, I think that realization was very early on. Hey, this is

And part of this was also, you know, there is a lot of traditional feature engineering which happens with the traditional Rexis models, right? As opposed to LLMs which don't have the notion of feature engineering. You are feeding, you know, covers of text. So the feature engineering as a discipline is sort of disappearing in this new world.

So, and there is a lot of new, you know, rack-based architectures which are bringing up, right, you are doing more in your time. So the machine learning pipelines engine, we rewrote and we redesigned on an open source orchestration engine called Flight because the notion was that experiments were

will be very heavy, right, in this new world. And so, you know, there is a lot of quick changing of code. I want to do very short fine-tuning jobs or sometimes very long-running jobs, but the user experience has to be, you know, very nifty. We launched something called Interactive Dev where you can, you know, have VS Code connect to a remote running job, right, and put debug pointers and trace through, right?

what is going wrong because otherwise the previous experience used to be like you know you don't know what has gone wrong so once it has gone wrong you need to rebuild recompile re-upload so this this whole new modern pipeline training architecture which is very much focused on very quick changes very much experiments you don't need to re-upload go directly debug there see what values are being passed in real time through your debug pointers go one by one

And then, you know, once we standardized LLMs on it, we are migrating a lot of our traditional Rexxus models onto this new platform as well. And the feedback from users is really great. Like they're all liking this a little bit less rigid.

a very, you know, experimentation oriented architecture where, you know, the back and forth things which they need to change, right, to change their model and train their model is, and there is a very inbuilt robust versioning mechanism. One thing which was missing in our previous versions of machine learning pipelines was you have a config, you have a model, you have a data set, all these are different version things, right?

and you do an experiment, then you have to go and look at every different entity, what was correlated with this particular experiment or training run. So in this new machine learning pipeline, right, there is a very, very robust versioning, right? Like in terms of all the entities, your configuration parameters, your pipeline, your model, as well as, you know, what data was in version, these are all bundled together, right? So every experiment is associated with the right lineage. Like, okay, this was all the things which were used to create this

particular model right so versioning is is very robust because it's sort of built with that assumption you will experiment a lot right so that's one change I think one thing which potentially and again it's early early days of the what will change in rexis vis-a-vis you know the non-LLM architecture I think the LLM architecture itself like one hypothesis is hey

the LLM architecture itself will take place in REXIS, right? So then, okay, whatever you have done at upper layers, right, or you're doing for LLMs like vector DBs, embeddings, right, they become the prominent mechanism. Overall, the rack becomes a prominent mechanism, right? So your traditional feature engineering pipelines, the way you were constructing, they will change, you know, the real-time feature pipelines, not the offline, right? They will change to accommodate to this new paradigm. Tools like Langchain, Langgraph,

which are very much dominant in the LLM space, they may become prominent in that space as well. So far, we haven't gone that far. I don't think we have looked at, hey, how the lang chain or lang graph is being used for LLMs, whether prescriptive orchestration graphs when the models are inferencing in real time or non-prescriptive agentic graphs which are formed, like where the agent decides what path to traverse, right?

Can they be used in the traditional space where, you know, there is a lot of feature ingestion happening? There are feature engineering pipelines which you have built. But I think the wait is on to see itself. I think the thing which will simplify this can LLMs

that part of the problem. Then you can take the LLM infrastructure as is even at the upper layers and start replicating it for Rexis. Until then, I think it's, in fact, majority of our workloads are still Rexis workloads are not powered by LLMs, right? And rightly so. They are very targeted, very effective in what they do and

And so it's the first hypothesis has to be at the model layer itself. But at the lower layers, we are definitely ensuring that there is standardization across. Like, so it's now the same machine learning pipelines engine, which is being used all across LinkedIn, whether it's for Rexis or for LLM workloads. Right. So we have standardized all that layer. The top most layer when you, you know, move towards inferencing and all some of those changes, right.

you know we'll figure out as we go yeah yeah where where exactly does it divert because it is fascinating to think about how the lower levels you've recognized there you can share resources or you can have the same type of stuff going on but at some point it's diverting right now so where where is that point if you know or it might be case by case basis

I think, see, first thing is, so you can look at every single layer, right? Hey, do we need, so for example, when you are going to inference, right? Do you, so people have either written their own inferencing engines, right, in different languages, or you have standardized on something like TF serving from open source, TensorRT from NVIDIA, which were very heavily optimized for models, right, in the Rexxas space.

When you look at the LLM space, there is VLLM, there is TGI, there is SGLang, right? The inference engine itself, there are custom inference engines being written for LLMs. You just even look at the hardware architecture we didn't talk about, right? There are a lot of custom chips.

which are coming. Like you see many startups, Grok is another example, which are just developing custom hardware. Cerebrium. Yeah, yeah, for sure. So it's if for people who are investing and taking a bet that, hey, it's much simpler for them because they can just standardize on one that. One of the reasons, right, where you may feel, hey,

Nvidia GPUs could be better for LLMs might be if you're right versus you know people who are investing in custom ASICs custom chips which are potentially right better for LLMs in terms of inferencing etc is the bet on the architecture if the transformer architecture is going to replace every single thing then you can have custom chips so right there is some diversion some companies are just going with very custom chips

right for LLMs. So the engine is different, the chips are different, right? Like if you build something general purpose, right now we are all, you know, standardized. The hardware layer is, there is no specific thing apart from the kernels work, right? Which we have been doing, which has been pretty specific. Then you also look at the feature processing infra, right? So a lot of real-time feature processing, feature ingestion,

in case the kind of tooling which has emerged in the LLM world, it's, you are talking, you know, uh, land chain, uh, a land graph, vector DBs are prominent, right? It's not the traditional feature processing, right? And, and embeddings have become the dominant landscape. And I think we have been on a journey where, you know, we are using embeddings all across as well, started, uh,

both for Rexes as well as LLMs, but some of the tooling chain there is different, right? What you will use for something like Beam, for example, you know, for feature processing in the past. Now, you know, a lot of that is happening now

in the LLM world through the LANG chains and the LANG graphs, like, you know, when you are actually orchestrating the data. So those are the divergences which are happening and are there for the right reasons. I think, you know, over a period of time it will reconcile. But yes, that layer is different.

when you are handling traditional rexes versus this the inferencing engines are different when you're handling there in some cases people are even taking different chips hardware chips for these layers so i think it's bottoms up we have to push we have to make sure you know continue standardizing on bottom layers right and go up the stack and see you know what else you know as as we become more mature what else can be standardized and can be the same if lln's

start solving Rex's then potentially maybe you know that itself solves a lot of this problem.