This is episode number 886, our In Case You Missed It in April episode. Welcome back to the Super Data Science Podcast. I am your host, Jon Krohn. This is an In Case You Missed It episode that highlights the best parts of conversations we had on the show over the past month.
Our ICYMI, in case you missed it this month, starts with Sama Bali and Logan Lawler. Sama is from NVIDIA and Logan is from Dell. And in episode 883, I asked them about libraries like CUDA that make up the AI software stack on NVIDIA GPUs. I loved the scenic route that Sama took me on to get us there as it knocked so many novel concepts in AI and emerging tech into place.
Here she talks about Nvidia's new software and services and how they interconnect. I want to get back to the Nvidia story from around the time and kind of this visionary nature of what Nvidia has done and reflected in their share price.
is this idea that, okay, deep learning is going to be gigantic, or let's assume that deep learning is going to be gigantic. And so let's build a software ecosystem, going back to your point earlier, Sama, that supports that. So yeah, so tell us about things like CUDA, TensorRT, maybe a bit of the history and why those are so important in this GPU ecosystem and in this AI era.
Yep. I'm actually going to start first with NVIDIA AI Enterprise, right? Just completing the story of how we're doing things, especially with Dell Pro Max AI PCs. So think of NVIDIA AI Enterprise as our version of, you know, end-to-end AI
software development platform, which is helping you not just accelerate your data science pipelines, but also really helping you build next-gen. It can be generative AI applications. It could be computer vision applications. It can be speech AI applications. And it has a lot of components. We've got NIM microservices. This is
how we are delivering all kinds of AI models as containerized microservices. So literally think of any other, any AI model in the world. We work with open source partners, proprietary partners. We have our own NVIDIA AI models as well. We're taking each of these AI models, putting them into a container and then adding our, you know,
I won't say secret sauce because everybody knows about Tensor or TLLM and all kinds of services which are really helping you get the best inference possible on NVIDIA GPUs. And we're offering them as microservices. And the reason being, and you'll soon start seeing this from NVIDIA perspective that we are providing almost all of our AI software as microservices is because
Things are changing quickly. I'm a developer today who built an application with Lama 3 and guess what? In two months, Lama 3.1 comes and then another two months, 3.2 comes up. So we want to make it really, really easy for people to just swap in the model as quickly as possible without really disrupting that entire pipeline.
So that's NIM microservices. We've gotten all kinds of models from if you want to build a digital human to actually building speech-related applications to now we also have NIM microservices for our reasoning AI models as well. So that's the first component of NVIDIA AI Enterprise. Really quickly before...
It's going to be obvious for sure to you, to both of you, as well as to many of our listeners, exactly what a microservice is. But could you define that for our listeners that don't know just so that they understand what it is and why it's important, why it's helpful? I actually don't have a definition of microservice. I'm going to give you not like a textbook definition, but I'm going to give you a practical definition, right? Cool. Let's say you're a data scientist and you have created...
Let's just pretend a chat bot with llama three and you create that without a microservice without, you know, an NVIDIA NIM. Like trauma said, every time that that model updates any, if there's security, all this stuff, you're doing a ton of, I hate to say it, but background tedious work to get that to a point where you can deploy it.
where when things change, for example, if you don't like that's the whole point of a microservice with NIM is you basically can load that to literally one line of code and the LLM part of it is really done for you. It is containerized, it's packaged, it's ready to go. So a data scientist can focus on, well, how am I going to customize it or building whatever application wrapper around it versus like, Ooh, I need to update the code here to get this to connect. Like that's really the point of a NIM is how
how quickly can I leverage the power of an LM vision model, whatever with one line of code, that's the power of a NIM.
And it runs on a workstation too. It runs on Dell Pro Max servers. It runs pretty much everywhere. Yeah, that was going to be my point, that the key point being with these NIM microservices, you don't have to make sure that the AI model is tuned to the GPU, right? We've done all of that work for you. So as soon as you're downloading this locally on your Dell Pro Max PC, it already understands the kind of GPU it's running on. The only thing you have to make sure is
you know, the model you're downloading fits onto your GPU memory size now, but with 96 gigs of memory, you've got, you've got the entire world for you here. Nice. And so I've been trying to, as you've been speaking, I've tried to kind of look up quickly online what NIM stands for. It doesn't seem to stand for anything that I can find easily. It just sounds, Oh, I'm going to let the secrets out. It's actually stands for NVIDIA inference microservice, but then we also use NIM microservice. It's like, it's like chai tea kind of a thing.
they mean the same thing. Potato potato. Yeah. A potato potato. A potato brand potato. Exactly. Cheese queso. That's what I would say. I go ahead to a restaurant, I'll say, I want cheese queso. And then my wife always gives me a hard time. But yeah, cheese queso. Nice. Yeah. Now I understand perfectly. Thank you for giving us that insight. It's interesting. It isn't, it isn't something that's very public. So people really are getting the inside scoop on NIM. And yeah, it's just spelled N-I-M for our listeners who,
who are wondering what word we're saying. It's exactly like it sounds. And I am in all caps. And I'll have a link to that in the show notes, of course. Anyway, so I interrupted you. Oh, go ahead. Oh, I was just on the same topic of name microservices. I was going to say, we've got a website called build.nvidia.com. That's where we host all of these name microservices.
It's a good website to not just go try out these different kinds of AI models. You have the ability to prototype on the website itself. There are no charges for it at all. You can see models, again, by all kinds of partners that you work with, including NVIDIA models as well. They're segregated by the industry you work with or the use case you're trying to build. So it's easy to kind of maneuver around, find the exact model you want to work with.
And then once you want to download this, we've made it easier. So if you really sign up for our NVIDIA developer program, we actually let you download these models and then continue to do your testing, experimentation, free of cost. There are no charges at all. So you can continue. As a developer, I would want to go try out different kinds of models, see what's working with my...
So we like to do that as well. Fantastic. That was a great rundown. What I was going to say, and I'm glad that you had more to say on NIM microservices, because my transition was going to be that the last time I interrupted you, you were about to, I think, start talking about other aspects of the AI enterprise. So now I'll let you go on that.
So outside of the microservices, we've got Nemo, which really helps you build, train, fine-tune your own models, but also gets you the ability to add guardrails to your model so that whenever you're deploying your application, you are making sure that the application gets used exactly the way that you want to do it itself. We've got AI Blueprints. Think of these as reference AI workflows. So
We give you the ability to build different kinds of AI applications. So we give you, think of this as a recipe. You've got the step-by-step process to actually build an application. There's a reference architecture process.
But we also get you the ability to add your own data to it. And that's what gets every company their own edge, right? You want to add your data, which is your differentiation at this point in time. So you have the ability to build different kinds of applications. What else do we have? Oh, we've got different kinds of frameworks and tools. So we actually do support different kinds of AI frameworks like PyTorch, TensorFlow,
We also have our CUDA library. So I think this is a good time to kind of talk about CUDA as well, which really stands for Compute Unified Device Architecture. I didn't know that. I didn't know that. I've been using that word for like a decade now. Thank you.
So this really has been playing a crucial role in AI development by enabling efficient parallel computing on NVIDIA GPUs, right? So the idea was its entire architecture really helps you train different kinds of models significantly faster, which means that you can, in some scenarios, actually reduce your training times from weeks to days, right?
It is also helping you get better and better inference. So you see higher inference performance on NVIDIA GPUs because of this architecture of parallel processing, if you're comparing it to just CPU-only platforms. We now have, and I'll have to look up the right number of how many CUDA libraries we have, but we've got...
Tons and tons of these CUDA libraries, and these are GPU accelerated libraries. So a good example I'll give you is of RapidScootEF, right? So the idea, and Logan touched on this earlier as well, is
the way RapidSchoolDF works is that it tends to mimic the APIs of a lot of data frames like pandas, polars. So if you are in that process of pre-processing your data in your data science workflow, it can actually accelerate that entire process by 100x on our 6,000 GPUs.
without any code change. That's the beauty of it, that as a data scientist, all I'm doing is adding that one API line of code and then it actually
accelerates the entire process by 100x. So that's like massive time saving from a data scientist perspective. At GTC, we announced QML, which is again, one of our CUDA libraries as well. This is helping you accelerate your machine learning tasks as well. So if you're using Skitlearn, you have the ability to go up to 50x acceleration for your ML tasks as well. So each one of these libraries, and as I said, we've got tons of these right now,
But depending on the data science tasks that you're doing, these are all designed to then offload that work to the GPU so that you can see that massive acceleration. From NVIDIA's AI enterprise, we turn now to AWS's Graviton and Tranium 2 chips. I'm taking this clip from my conversation in episode 881 with Emily Weber, a principal machine learning specialist from AWS's elite Annapurna division.
In the clip, Emily explained why one might elect to use a specialized AI accelerator over a GPU. So let's start there. You can tell us about this Graviton chip, the Tranium 2 chip, and maybe this kind of relates to a general question that I've been meaning to ask you this whole episode and have just continued to forget with each wonderful explanation that you give after another, which is that why should somebody, why should a listener, for example, consider
consider using an accelerator like Tranium and Inferentia instead of a GPU? Maybe that's a great question to start with. And then I'll remind you of the series of questions that led me to that question. Sounds good. Thank you. Thank you. Yeah. So fundamentally at AWS, we really believe in customer choice. We believe in a cloud. We believe in a cloud service platform.
you know, provider that enables customers to have choice about data sets, have choice about models and have choice about accelerated hardware. Uh, we think it's, it's good for customers to have that ability, um, and to have real, you know, options, uh, that is ultimately best for consumers and that that's best for customers. So, so fundamentally that's, that's the direction. Um,
Annapurna Labs is an awesome company. Annapurna Labs has been building infrastructure for AWS for many years. So Annapurna Labs is a startup that Amazon acquired in 2015, primarily to develop the hypervisor actually. So they developed what's called the Nitro system. Yeah, we'll talk it through. So they developed, yeah, it's like the coolest story in tech that is the least told. So here's the scoop.
So in 2015, the way people were doing cloud 10 years ago is you had this thing called the hypervisor. And the hypervisor essentially was this giant monolithic software system that managed the entire host of all servers. And the challenge with the hypervisor systems is that
it made it really hard to innovate for the cloud because all of the control, the communication, the data at the server level was implemented in this giant monolithic thing called the hypervisor.
So Annapurna had this crazy idea of decoupling the parts of the hypervisor that you need to scale at the cloud at the physical level. So they developed what's called the Nitro system today, which provides physical separation for things like the data that's running on the instance from the communication that's controlling the instance.
And so this is both how AWS scales and how AWS provides such strong security guarantees is because physically there are two different controls. There's one physical chip or there's one physical component of the hardware system that is managing the data.
customer's data, and there's a different physical control that's managing the governance of the instance. And so every modern EC2 instance today is built on the Nitro system. So that was the first major development for Manifernal Labs was Nitro. So that's Nitro, like nitroglycerin, N-I-T-R-O. N-I-T-R-O, yeah. Explosive. Yes, yes.
So after the Nitro system, Annapurna started developing their second sort of main product line, which is Graviton. So Graviton are custom CPUs, custom ARM-based CPUs developed by Annapurna Labs.
And if you watched re:Invent, one of the highlights that you saw is that today, more than half of new compute that comes onto AWS is actually Graviton CPU.
Oh, yes. So when you're looking at instances on AWS, when you see that little G at the end of a family, so like a C6G, or even a G5G, that second G means it's a Graviton CPU. And so that means you're going to get much better performance at a very, you know, competitive price. And
And so the Graviton CPU is our second main product line. And then Traneum and Infrentia is the third main product category from Annapurna Labs, which is now let's take this large,
awesome ability that we've created in developing infrastructure and scaling infrastructure across AWS. And let's focus that on AI ML. And so Inferentia, of course, was developed and, you know, came out a number of years ago. Tranium 3 is our third generation chip.
So it's the third generation accelerator for AIML. And that is why it's such an exciting moment, right? Because you see the...
and the scope and the incredible results that Annapurna has delivered like over the years. And now this is totally focused and now a large focus is AIML. And so when, you know, customers are taking advantage of this, like fundamentally they're interested because they get the benefits of price performance, like more than anything, it's this, you know, benefit of high
highly optimized compute that is scarily energy efficient. Aperna is so good at identifying improvement areas to just take cost out of the equation and reduce complexity and pass performance and pass cost savings back to customers.
while, you know, meaning performance and in many cases exceeding performance. So TRM2 is actually the most powerful EC2 instance on AWS for AIML. Like full stop when you look at the, you know, performance metrics that we're seeing. It's a very exciting moment. It's an exciting moment for customers, exciting moment for the whole group. Tranium2 is the most powerful on AWS.
Correct. Performance and power are certainly crucial for measuring chip efficacy, but what good is AWS's chip if it's not being used in deploying AI models? Deployment can become a real problem for teams where data scientists outnumber software engineers. In an episode number 879, I talk about the issues of model deployment with Greg Michelson, Dr. Greg Michelson, co-founder and chief product officer of Zerve.
Nice. So another kind of tricky thing that data scientists, maybe even myself, have difficulty with is deploying AI models. So something that's been intuitive for me for literally decades is opening up
some kind of IDE, Jupyter Notebook, something like that, and getting going on inputting some data, doing some EDA, and building a model. But the thing that hasn't been intuitive to me, and it's probably just because I haven't been doing it as much, had the luxury of working at companies where machine learning engineers or software developers, backend engineers, take then the model weights that I've created, and they put them into a production system. So
on a smaller team or on a team where there's huge demand for software engineers, which is often the case, you can end up having more data scientists creating models than there are software engineers to deploy in a lot of companies. That creates a bottleneck. So how does ZURV's built-in API builder and GPU manager remove those kinds of barriers?
Yeah, it's not just a bottleneck. It's also kind of a problematic dependency because at the end of the day, the software developers that are deploying these things are probably aren't data scientists. So it's not obvious that they are going to understand what is supposed to be done. And, you know, there's a lot of subtlety to this sort of thing. So you can get mistakes introduced in really easily here as well.
So, yeah. So like if you think about the deployment process and, you know, you're there's a lot of a lot of hurdles to overcome. If you've ever been slacked or emailed a Jupyter notebook and tried to run it, you know what some of them are. Right. Like you have the wrong version of this package installed. Oh, you got to pip install a whole bunch of other stuff to make that work. And so you might spend an hour trying to get your your.
trying to even get the code to run, assuming that you have the data and that all the folders and file paths are the same and all that sort of stuff. So, you know, at the end of the day, what data scientists spend most of their time doing today is building prototypes. And then those prototypes get handed off to another team to kind of like recode
in another environment with, you know, you know, Dockerized and deployed and managing servers and stuff like that. But it's not obvious to me that data scientists know how to do that. And it's really not obvious that they have the privileges to do those kinds of things in terms of just like the infrastructure and all that kind of stuff. So Zerf kind of like,
handles all of those problems. So every canvas inside Observe has a Docker container that's supporting it. So anybody that logs into that canvas doesn't have to worry about dependencies because it's all saved in that project. And so those environments are reusable and shareable and so on. So if I wanted to start a new project using the same Docker container that the...
Another project was in, it's really easy to do that. And so, you know, when you have a new data scientist join your team, they don't have to spend their first week getting Python installed and making sure everything, oh, we use NumPy 0.19 and you've got 0.23 installed. And like none of those conversations have to really happen anymore because we manage all of that.
And then let's say that I did train like a random forest. I mean, you mentioned using your weights. Like if I train a linear model or a logistic regression or something, then maybe it's just a vector of weights that need to be handed off. But if it's a more complicated model, like a random forest or an XG boost or a neural network or something like that, it's not as simple as just like, here's some weights to put into a formula.
It's a more complex thing. And so then you've got to figure out, okay, I'm going to serialize this model, pickle it, and then dump all the dependencies out and dockerize it and then hand that thing off. And that's also beyond the skill set of a lot of data scientists too. So Zerv handles all of that. So every block inside of Zerv, when you execute it, it creates serialized versions of all of the variables that you've worked through.
So if I train a random forest in a model or in a block, then it's there and it's accessible. So I can access it from external to ZERB using like an API. I can reference it in other layers. So when it comes time to say make an API, maybe I want to make a post route where I send in a payload of predictor columns.
and then I want a prediction back from that random forest. Well, then I just say, hey, remember that random forest? And I just point at it instead of having to figure out, you know, like how to package that thing up so that it could be deployed as an API. So we...
We handle all of that stuff. And then when you deploy and serve, you also don't have to worry about the infrastructure stuff because all of our APIs utilize Lambdas, like serverless technology again, so you don't have long-running services that are out there. It's just there. So a lot of the infrastructure stuff and the DevOps stuff and the kind of picky engineering stuff that can trip you up is stuff that we've just sort of handled so that it's easy for the user.
And that means that data scientists can start to deploy their own stuff. But in some organizations, they still might not be allowed. So then we have like a handoff system where it's really easy to take something that a data scientist has done, who, by the way, aren't building prototypes anymore. Now they're building software that can actually be deployed in Zerv. And we can hand that off to other teams to actually do
deployments.
And I wanted to know, how does this capability power up AI chips? When we were doing research about you, we uncovered something called heterogeneous integration in AI chips. So what is heterogeneous integration? And how does it impact performance and the packaging density of AI chips? This density thing being critical to building more and more powerful chips, because obviously the more transistors you can get in a smaller space,
the more powerful a chip can be. Yeah, that's an important area, and I called it earlier in our conversation. So this is like what is more than Moore. So what dimension drives performance or allows to scale performance beyond just making smaller transistors on a chip? This is the additional dimension driven by heterogeneous integration. And maybe let me just...
quickly with a sentence, come back to the AI for AI. So we call that, we have like branded it the way that we call that materials intelligence. This is the use of artificial intelligence to drive the development of novel materials for applications in electronics. We call that materials intelligence. And this is how our team works as a global R&D team, not just
in a traditional way, sequentially improving properties of materials by using AI to replace experiments in order to kind of avoid
avoid unnecessary experiments and going straight into where it really matters. Where can you really make a difference for the customer technology? How can you anticipate how a material works in a customer setup and how does it drive the solution of their problems and not just
chemicals properties in the first glance. So this is how we drive the development of novel materials. We talk about millions of different options that need to be optimized in order to drive the performance of materials. So this is just to give you the idea on how that blends into AI for AI. Second is then driving the different aspects of how our customers improve the performance
performance of their devices and besides
shrinking the transistor, building more integrated systems and heterogeneous integration is the important area here. You know, it started traditionally with what is called kind of a front-end process, making a transistor, and back-end was then you wire it somehow that at the end the signal gets to the outside, which is then called in a broader scheme, packaging.
Now there's something between these two extremes. It's called heterogeneous integration, when at the end, the chip is not just one die, one single chip anymore, when you combine different chips to a system. And I refer to it in this specific example as COVOS. These structures are being built in the examples I've used. I can use
different customer examples here as well. Just wanted to use one nomenclature, which is pretty common in the current conversation. This is when you glue dies on top of one another in order to build memory stacks, for example. Or you build a memory stack and you kind of almost glue it next to a GPU in order to shorten the transfer of data and to make it more efficient in getting the data to the GPU.
And that is called heterogeneous integration to make that possible. And it requires, of course, technologies well advanced from what was used in packaging historically. So much smaller structure sizes, much more complicated efforts to get your heat out of the system, as one example, or to optimize power consumption. The precision required then needs different technologies, more front-end like technologies.
technologies and which is makes it an area of course for materials innovations and for metrology innovation as per what our company is focused on. Widening the capabilities of real world applications is a must for new AI product developers
and AI product manager, Sharish Gupta, has come up with the easy-to-remember mnemonic AIPC to help you determine whether your particular application might be ideally suited to local inference with an AIPC, an artificial intelligence personal computer, as opposed to relying on cloud compute. Here's my final in-case-you-missed-it clip coming from episode 877.
So we're talking about taking capabilities that today might require you to have an internet connection and depend upon some cloud service in order to get some kind of, say, large language model or other foundation model capability. But instead, with an MPU, you could potentially have the operations, the inference time calls, instead of going out over the internet...
and using cloud compute, you can have it running locally on device. So you're also probably going to get lower latency. You have fewer dependencies. Yeah, talk us through some of the other advantages of being able to now do things on edge instead of having cloud for alliance. Yeah, I think this is a perfect segue. In fact, this is a mnemonic that I came up with myself.
The term that is being thrown around for these devices with NPUs is an AIPC. I'm sure you've heard of it, right? So to think about the benefits of an AIPC, I've created a mnemonic with those four letters. So A is accelerated. It's basically you have now a local hardware accelerator that gives you that low latency, real-time performance for things like
translation, transcription, captioning, and other use cases where latency is super important for persistent workloads. That's A. I is individualized. Again, this is great because if you have an AI that is on your box, it has the ability to learn your styles. Let's say if you're creating emails, if you're using it to generate emails, it's learning your style, it starts writing in your style.
It's great for, you know, we had a healthcare customer that we've been working with on a use case where, you know, there were, there's two parts to it. I'll talk about the second part. The first part is even more interesting, but I think it's related to a different example that we'll come back to later. But the second part of the AI solution is that they were taking,
information from a physician's diagnosis of a patient in the ER. And they were using that information to auto-generate the physician's report. You know, mundane stuff, physicians don't like spending time on that. They'd rather go to the next patient, have that interaction, you know, increase their ability to spend time with patients. What they, the feedback they gave was, you know, with this, with this solution, now that I've, it started seeing the way that I'm changing,
And editing its initial draft, it's starting to take on my style. And now it just sounds like me. And I love it because I don't have to do this report generation. It does it for me and I've got more time for my patients. So that's the individualized value. The third is P, it's private. Like you said, the data doesn't have to leave your device and its immediate ecosystem.
You don't have to send it back and forth to a public tenant or even a private tenant for that matter. You may have confidential information with PII that you have access to, but you don't want to merge with even a private tenant. There is sensitive information like that or unclassified information depending on your vantage point. So that inherent privacy of data
and the inherent security of running the model locally on your device gives you that assurance that this is more private than it would be. So that's P. And C, this is really important because I hear this from customers, it is an important cost paradigm shift. And I'm starting to hear this from some of our early, maybe earliest adopters of on-device AI,
which by the way is not ubiquitous today, right? In terms of enterprises building out their own AI capabilities and using on-device accelerators for offloading that. We're at the tip of the spear with Dell Pro AI Studio and we'll come back to that later. But the early adopters, what they say, and I had a FinServ or financial services customer tell me, "Sharish, my developers are using CodeGen
and they're using our data center compute, 15% of my data center compute is going to these developers that are using it for code gen or code completion or writing test cases, unit tests, what have you. They all have PCs. I want to get them to an AI PC with a performant NPU
So I can take that offload. I mean, I can offload that compute from my data center because they don't need H100s to do that code completion. I think I can do that with the NPUs on your Dell devices. So that's a real opportunity as well. It's just because you have the compute doesn't mean you should use it. It's like the right compute or the right engine for the right workload at the right time, right? So there's plenty...
of use cases where offloading from even your private data center to a on-device capability makes a ton of sense. And then if you're actually using the cloud, you're paying for every inference, right? It's tokens and API access. So now that you've got an AIPC, it's no cost to you. You built your solution, you're using it on the device,
That's it. So cost is a big factor. Now, you'll argue that the cost of inferencing in the cloud is coming down. It's scaling very fast. But again, I get back to the point that it's the right engine for the right use case at the right time.
All right, that's it for today's In Case You Missed It episode. Be sure not to miss any of our exciting upcoming episodes. Subscribe to this podcast if you haven't already. But most importantly, I hope you'll just keep on listening. Until next time, keep on rocking it out there. And I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon.