This is episode number 881 with Emily Weber, Principal Solutions Architect at AWS. Today's episode is brought to you by ODSC, the Open Data Science Conference.
Welcome to the Super Data Science Podcast, the most listened to podcast in the data science industry. Each week, we bring you fun and inspiring people and ideas exploring the cutting edge of machine learning, AI, and related technologies that are transforming our world for the better. I'm your host, John Krohn. Thanks for joining me today. And now let's make the complex simple. Welcome back to the Super Data Science Podcast. Today, I'm delighted to have the amusing, brilliant,
and Zen, Emily Weber, as our guest on the show. Emily is a principal solutions architect in the elite Annapurna Labs machine learning service team that's part of Amazon Web Services. She works directly on the Tranium and Inferentia hardware accelerators for, respectively, training and making inferences with AI models.
She also works on the NKI or NICI, Neuron Kernel Interface, that acts as a bare metal language and compiler for programming AWS instances that use Tranium and Inferentia chips. She wrote a book on pre-training foundation models. She spent six years developing distributed systems for customers on Amazon's cloud-based machine learning platform, SageMaker.
And she leads the neuron data science community as well as technical aspects for the Build on Tranium program, a $110 million credit investment program for academic researchers.
Today's episode is on the technical side and will appeal most to anyone who's keen to understand the relationship between today's gigantic AI models and the hardware that they run on. In today's episode, Emily details the little-known story of how Annapurna Labs revolutionized cloud computing, what it takes to design hardware that can efficiently train and deploy models with billions of parameters...
how Tranium 2 became the most powerful AI chip on AWS, why AWS is investing $110 million worth of compute credits in academic AI research, and how meditation and Buddhist practice can enhance your focus and problem-solving abilities in tech. All right, you ready for this fabulous episode? Let's go. Emily, welcome to the Super Data Science Podcast. I'm so excited to have you on the show. Where are you calling in from today?
Hi, John. Excited to be here. I'm calling in from Washington, D.C. Nice. It's interesting times in that part of the world.
lots of things happening, but we're not here. We've never been a political show. We won't get into it. At the time of recording, I'm excited. I'm looking forward to being at the Data and AI Summit in Richmond, Virginia, which is not crazy far from D.C., or at least Virginia isn't. And that part of the world, Virginia near D.C., I've always really enjoyed everything about it, except the traffic.
Yeah, no, the traffic is tough. Actually, this time of year, it's lovely because the cherry blossoms are just beginning to bloom. So peak season for cherry blossoms is coming up at the end of March. But one of the primary reasons I'm in D.C. is because it's the HQ2 area for Amazon.
So it's our second headquarters. You may remember a number of years ago, we did this sort of HQ2 search and Crystal City, Virginia was awarded HQ2. And so I moved out here a number of years ago to be a part of all of the activities and everything that's going on there.
That's very cool. Now that also, it reminds me though, wasn't it initially supposed to be Manhattan? It was supposed to be New York City and then there was like an uprising against it and so they had to pick somewhere else because it was like, there was this concern, I can't remember exactly, but that like it would change Manhattan
too much, like too fast, like this huge influx of people into an already busy place or something like that. Yeah, no, there were, there were a lot of great cities, obviously a lot of great choices. I think the original spec was, was spread across three cities. I think when they first announced it, it was like New York, DC. And then I want to say somewhere in Tennessee, if I'm not mistaken. And,
And then that's sort of boiled down into definitely the D.C. area and some other places as well. But yeah, primarily D.C. Nice. Well, I'm glad it's working out there. It sounds like a great environment to work in. Certainly, AWS is doing a lot of exciting things. I thought that we might start, we almost never start with going with somebody's career path.
But in your case, we're going to do that because you have a unique career trajectory that I think provides some good context for the rest of the episode. So you started with a degree in international finance.
And now you've been a hands-on practitioner at Amazon for some time working on AI and machine learning. So tell us about that transition to what you're doing today, your draw to AI ML. Yeah, totally. So I would say I got into computer science a little bit later. Definitely. I lived in Arizona actually is where I got that degree from a school called Prescott College.
And I studied definitely finance. I was actually interested in Buddhism as well. So I lived at a retreat center for many years and studied. Yeah, I studied meditation and all sorts of things. You do seem super Zen, super empathetic.
Our listeners wouldn't know this, but we were talking for a while before starting recording. And I was like, wow, Emily is just such an engaging, empathetic person. And all that time in the monastery, I think, paid off. Yes, I find myself coming back to this grounding moment
Many times, actually, because when we're in computer science, right, when we're trying to solve an algorithmic problem, trying to solve a compute problem, a development problem, you know, many times what we really need is focus. Actually, we need the ability to just bring our mind back to what the goal is, what the details are, what the challenge is.
and not be overwhelmed by getting too fixated on something or being afraid of something. And so just developing this sort of mental ability to like calmly abide and calmly focus has honestly been really helpful in my computer science degree.
So I studied at the University of Chicago after that and did a joint degree that was a master's of public policy with computational analysis, actually. So studying like public policy projects through the lens of computer science. And so that was where I developed a love of data science. I interned at what's called the Data Science for Summer Good Social Fellowship, which
where we analyzed public policy problems and worked with organizations who were nonprofits or NGOs, analyzed their data science, and then delivered projects to them. And so that's sort of where I got very interested, obviously, in technology, technology development, and trying to make a positive impact in the world. And that has led me to AWS. Very interesting.
Very nice. And yeah, you've worked extensively with SageMaker, which a lot of our data science listeners would be familiar with. Maybe you can give because you do an even better job than I could at explaining what SageMaker is. So you can let us know about SageMaker and other AWS AI services that you've worked with. But now you're working on the Tranium and Inferentia team. So it's hardware, compute hardware that you would use instead of a GPU. You'd use a Tranium or Inferentia chip.
to be doing a lot of the heavy lifting in training in the case of Traineum or at inference time with the Inferentia chip. And yeah, so fill us in on SageMaker, other AWS AI services that you worked on in the past and why hardware, Traineum, Inferentia took your fancy recently.
Yeah, absolutely. So I joined Amazon actually as one of our first SageMaker solution architect SA's. So I, you know, got to work with some of our earliest customers in the SageMaker days and figure out... What's an SA?
Cool. So what is an SA? So a solutions architect at AWS, fundamentally, we work with customers. So that means your fingers are on the heartbeat. They're on the pulse of the business or on the pulse of the service because you're explaining what the service does to customers every day.
You're in the weeds with developers, with data scientists, leadership on both the customer and the service team about what feature A is doing, how
how well it's doing and what it needs to do in the future. So I love being a solutions architect. I've always profoundly enjoyed this as a role because you have visibility into the whole picture. You get to be a part of the whole life cycle. And so I was one of our first, uh, um, I was one of our first solution architects for SageMaker. So
So SageMaker is a managed ML infrastructure at AWS. Essentially, you can use SageMaker to spin up a notebook server,
use SageMaker to spin up what we call training jobs, which is where you're training your model in the context of a job. Use SageMaker to spin up ML hosting infrastructure. We have prepackaged models that are available in SageMaker that you can pull down for training and hosting. And we have a really cool development environment. So SageMaker Studio and the Unified Studio, that lets a data scientist
So actually what it does is it decouples the UI that's hosting your development environment
From all of the compute that's like running your notebook and running your analytics job. And we package it up really, really nicely. So SageMaker studio is a great like data science work bench, for example, where an enterprise data science team can just get onboarded, have all the tools that they need to go analyze some data and train some models.
Very nice. Yeah. And, uh, yeah. So then what was the, what was the transition? Like, why did, why did you go from software to hardware in the AI space? Yeah, absolutely. So through many years on SageMaker, uh,
like many people, um, I saw how important foundation models were. It was obvious that customers were increasingly going to foundation models, um, for, you know, their ability to unlock a variety of use cases, but also the size of the models just kept getting larger and larger and they were just consuming so many resources. And so I, um,
set up many of our distributed training capabilities. So we were running distributed training workshops with customers. We were doing accelerator health checks. We were developing managed clusters and that led to a service called SageMaker HyperPod, which is a fully managed
parallel environment to establish clusters essentially. So when you want to train and host large language models and large foundation models on AWS, SageMaker HyperPOD is a really easy way to have a managed Slurm environment that you can hop into and take advantage of optimized libraries and have a variety of health checks and cluster management tools
already available for you without needing to develop that. And so through this journey, I became convinced that obviously foundation models were the future of AI. But I also saw increasingly how infrastructure was just the make or break. Like really everything came down to from a customer perspective, how
many accelerators can I get? What is the size of those accelerators? How healthy are they? And how efficiently can I train and host my models on top of that? Once I realized that that was the game, that that was the primary focus for customers, I
I just wanted to dive in and figure out what does it take to actually develop a new accelerator? How do you develop a software stack on top of that? And then how do you expose that through the rest of the cloud? So fundamentally, I love the business opportunity. It's just really exciting to think about obviously developing new accelerators and bringing those to customers. But also the technical problems are just so interesting. Like it is
absolutely a joy to sit down and think about, okay, how do I write a kernel for this algorithm? How do we design like communication collectives for this whole host of workloads, like reinventing many of the foundations of the ML technology stack as a whole on the cloud is just the absolute biggest draw in my mind.
Excited to announce, my friends, that the 10th annual ODSC East, the Open Data Science Conference East, the one conference you don't want to miss in 2025, is returning to Boston from May 13th to 15th. And I'll be there leading a four-hour hands-on workshop on designing and deploying AI agents in Python.
ODSC East is three days packed with hands-on sessions and deep dives into cutting-edge AI topics, all taught by world-class AI experts. Plus, there are many great networking opportunities. ODSC East is seriously my favorite conference in the world. No matter your skill level, ODSC East will help you gain the AI expertise to take your career to the next level. Don't miss. Online special discount ends soon. Learn more at odsc.com slash boston.
Wow. Yeah, your genuine excitement for it really shines through. Absolutely. So you've said the word accelerate a few times. I just want to disambiguate. That is, so earlier when I said you would use a Tranium or Inferentia chip in lieu of a GPU, that was...
term accelerator would apply to, it's the broader umbrella that includes Tranium, Inferentia, and GPUs. These are all different kinds of hardware accelerators specialized in the case of Tranium and Inferentia, specifically for neural networks, for deep learning models like the large language models that have taken the world by storm and that got you, the foundation models that got you excited about moving into this space. And it is, it's hardware driven. It's such an interesting phenomenon to watch
from a distance where the scientific advances, the kind of new ideas in terms of how should we model science
they're not necessarily super fast moving. Like the transformer idea many years later is still the dominant paradigm. And at some point that may be replaced and that builds upon deep learning, which is, it seems like an even more entrenched paradigm that will be difficult to shake. Um, maybe a nice thing when you're designing accelerators, because it means you have some kind of like linear algebra, some kinds of matrix multiplication operations that you can be like, we're probably still going to be doing that in five years. Um,
But yeah, it sounds like a really exciting space to be working in. There's a term that you mentioned as you were describing what excites you most about your work that I've got to admit I don't understand very well. And so I bet a lot of our audience doesn't as well, which is this idea of a kernel. So when you talk about an algorithm kernel, what does that mean?
Absolutely. So fundamentally a kernel is a function that's defined by the user. And when you're thinking about programming up at the Python level, we don't really think about that way. Everything we define is user defined. So we're like, what gives everything I write is a user defined function.
This thinking breaks down the further down the compute stack you go. So if you want to run a program on train human in French, yeah, for example, the way that happens is you write your program in Python, you write your program in PyTorch and
And then you're going to compile that through something that's called PyTorch XLA. So accelerated linear algebra. What PyTorch XLA is going to do, it's going to take the model that you defined and it's going to represent that as a graph.
essentially. So the structure of your model is represented as a graph. We call that graph an HLO, high level operations. So you get this HLO graph and then essentially we do a handshake between that HLO graph that's generated from XLA and we feed that into the top of our compiler. And so we maintain a compiler that takes the graph
that you produced from PyTorch and from PyTorch XLA and we convert that through a variety of algorithms and processes to ultimately generate the instruction set that actually gets executed on the hardware directly. So what's a kernel? A kernel is where you override the compiler
and you get to define the operations on the chip yourself using our kernel library. And our kernel library is called NICI, the Neuron Kernel Interface. So fundamentally, a kernel
is a function that's defined by the user and not as generated through the compiler. Now, there's a huge variety of sizes of kernels, right? So you can have a kernel that's really just a hello world function. It's like, hey, I did a matmul or I did like tensor add, right? And that's like to get the software working and make sure you have the environment ready.
But then what most people will do is build on top of that to define a full operation. So you'll define like a full forward pass for your model or a full backward pass or even just a part of it, like maybe just the MLP, you know, up projection or down projection.
And then what you're doing is you're studying the compute optimization of that kernel. Like you want to look at the data movement. You want to look at the, you know, utilization. You want to look at your memory utilization, your compute utilization. And so I know you've had Ron on the session in the past. And so like everything Ron, you know, teaches us and teaches the world about compute optimization. We try to apply that when we're,
developing our kernels. So we study like the compute requirements of a workload and we try to improve it. Like that is the heart of, of writing a kernel is,
implementing this algorithm that you have that's trying to improve something for large language models, we implement that as a kernel in order to improve the performance for it. An excellent explanation. You are an outstanding teacher and we will actually get to later in this episode some of the educational stuff that you've been doing. You're kind of an inspiration there, but you're naturally such an amazing explainer. That was like 90
99th percentile of explanations of technical concepts that I've ever heard. So thank you for that introduction to kernels. And if people are interested in that Ron Diamant episode, it's number 691 of this podcast. Also an amazing explainer of technical concepts. So if you want to understand a lot in that episode, we talked a ton about designing accelerators. Um, and I learned so much in that episode. It was amazing. In fact,
Ron is such a luminary in this space that at NeurIPS, Neural Information Processing Systems, arguably the most prestigious academic AI conference in the world. I was there in Vancouver in December and I met somebody new at lunch or dinner or something, I can't remember exactly the context, but they worked on training men in Ferentio chips and I said, oh, we had someone from the show. He was like, was it Ron? Yeah.
So, yes. Absolutely. He's an iconic person in this space. Yeah, so fantastic. There was a lot in there that some people might want to go back over to learn again about kernels. There was one term in there
that I might define quickly for the audience. So you said NLP, kind of casually in there as one of the things that you could be implementing in a kernel. And so multilayer perceptron, I'm guessing is what that is there. And so kind of like one of the fundamental building blocks, it's like
When you're thinking about building a deep learning network, a multi-layered perceptron is kind of like a, it was an early deep learning network, but then you can also think about it now as something that you can scale up into a bigger architecture. Yes. And it's really interesting to think about how we represent data for kernels, actually. So the MLP itself is,
When you're designing, say, a baby MLP or a tiny MLP, in PyTorch, it's crazy easy to do that. It's so easy to just define your tensor, define the operations you want to do, and call it. And from the PyTorch perspective, that's it. Your job is done. You've created an MLP.
But it becomes so interesting when you think about the size of that, like when you want to scale it up, when you want to shrink it, but also when you want to actually process it, when you want to run the computations on that to execute the operations you've defined. And it quickly becomes very challenging to do, actually. And so...
when we're defining our kernels and when we're defining our programs in Tranium, part of what we want to do is think about how we're representing the data, how we're structuring the data from the PyTorch perspective. And then actually the trick, the game, is to try to optimize the data representation and optimize the program for the hardware, actually. What we want to try and do is pick
like designs within the data structure and within the algorithm that leverage some of the lower level capabilities of the hardware to like ultimately get the best utilization and the best performance that we can. And then once you have sort of like hardware and software programs that are like well synced and like running together and like using the same assumptions, like
that's when you can really scale and get excellent utilization and then excellent price performance. And that's really where we want to help customers go. Nice. And so speaking of the connection between very popular deep learning libraries like PyTorch and the interaction of those libraries with your hardware, with Trinium Inferentia accelerators,
There's something called the AWS Neuron SDK, software development kit, which is the SDK, the software development kit for these AI chips for training and inferentia. Can you tell us how AWS Neuron enables builders to use the frameworks of their choice, like PyTorch or JAX, without having to worry about the underlying chip architecture?
Yeah, absolutely. So the Neuron SDK is a term that we use to cover a very, very large variety of tools. And the tools essentially are capabilities that we offer to developers to easily take advantage of terrain human infringia. Some of the tools are really low-level things like the runtime,
the driver, the compiler that pulls everything together. Some of them are much higher level. So something like Torch Neuron X, TNX, or essentially Neuron X distributed in NXT. So NXT is really the primary modeling library that's really useful for customers where when you want to go train a model and you want to go host a model on training and in front of NXT packages up
many of the lower level complexities and it makes it easily available for customers to access. So compiling your model, for example, is handled by NXT, uh, sharding your model actually. So, uh,
taking a model checkpoint, say like a Lama or a, you know, a Pixar model, and then sharding that across the accelerators that are available on your instance, NXT actually handles the model sharding for you, both from a data perspective. So taking the checkpoint itself and just splitting the checkpoint into, you know, N number of shards, but then also the communication, um,
and the optimizer updates and the forward pass. So NXD is a very, very comprehensive modeling library. And so NXD is useful for, of course, implementing your own model, but also just pulling down a model and running it. So when you want to just
Get something that's prepackaged and test it for, say, something like alignment or supervised fine tuning or hosting. You can pull down the model packages that are prebuilt and preset with NXD and just run them with your experiments and with your changes. And there should be very little complexity that's exposed to the customer in those cases.
Very cool. We'll have a link to the Neuron SDK in the show notes. People can check that out more. But yeah, as usual, another great example of your ability to explain technical things really well. Thank you.
With your experience previously on the SageMaker side, which we talked about earlier, does Tranium and Inferentia work with SageMaker as well? Just as the SDK allows you to take your framework of choice, is it easy to have SageMaker blend on the hardware side with Tranium and Inferentia?
Yes, absolutely. I mean, you can run SageMaker Notebook instances, you can run SageMaker Studio on Tranium. So if you want to say, develop a new kernel or test, you know, NXD, you can do that very easily on SageMaker as a development environment. We also have many models of
that we've already supported on NXT that we'll make available through what's called SageMaker Jumpstart, where SageMaker Jumpstart is sort of a marketplace for machine learning models and LLMs that are pre-packaged and available. And so when SageMaker customers are, say, browsing in SageMaker Studio, they can click a button to download the model and
But they're not actually downloading the model. What's happening is they're accessing the model through the marketplace, the training and hosting infrastructure. And a lot of the software is fully managed by SageMaker. And then customers can bring their own data sets. They can fine tune the model. They can host the model all through SageMaker Jumpstart. And so that absolutely is well integrated with training and infringement.
Hey, this is your host, John Krohn. I'm excited to announce that this Northern Hemisphere Spring, I'm launching my own data science consultancy, a firm called Y Carrot. If you're an ML practitioner who's familiar with Y-Hat, you may get our name.
But regardless of who you are, if you're looking for a team that combines decades of commercial experience in software development and machine learning with internationally recognized expertise in all the cutting edge approaches, including Gen-AI, multi-agent systems and RANG, well now you've found us. We have rich experience across the entire project lifecycle from problem scoping and proof of concept
through to high volume production deployments. If you'd like to be one of our first clients, head to why carrot.com and click partner with us to tell us how we can help. Again, that's why carrot, Y C a double R O T.com.
Very nice. And yeah, you've been working closely with customers on adopting Tranium. So it's something that's picking up a lot of speed, probably because Ron Diamant was on this show a couple of years ago. No doubt, no doubt. And so in fact, huge companies like Apple, Apple joined your reInvent CEO keynote last year to talk about their use of Inferentia and another chip called Graviton, which you'll need to explain to us in a minute. Yeah.
Because we haven't talked about that on air ever. But yeah, they talked about their use of Inferentia and Graviton and why they're excited about Tranium 2, another thing that we haven't talked about yet.
in this episode. So what are some of the most interesting technical challenges that customers like Apple are trying to solve? So let's start there. You can tell us about this Graviton chip, the Tranium 2 chip, and maybe this kind of relates to a general question that I've been meaning to ask you this whole episode and have just continued to forget with each wonderful explanation that you give after another, which is that why should somebody, why should a listener, for example, consider
consider using an accelerator like Tranium and Inferentia instead of a GPU? Maybe that's a great question to start with. And then I'll remind you of the other, the series of questions that led me to that question. Sounds good. Thank you. Thank you. Yeah. So, I mean, fundamentally at AWS, you know, we really believe in customer choice. Like we believe in a
We believe in a cloud service provider that enables customers to have choice about data sets, have choice about models, and have choice about accelerated hardware. We think it's good for customers to have that ability and to have real options that is ultimately best for consumers and that's best for customers. So fundamentally, that's the direction.
Annapurna Labs is an awesome company. Annapurna Labs has been building infrastructure for AWS for many years. So Annapurna Labs is a startup that Amazon acquired in 2015, primarily to develop the hypervisor actually. So they developed what's called the Nitro system. Yeah, we'll talk it through. So they developed, yeah, it's like the coolest story in tech that is the least told. So here's the scoop.
So in 2015, the way people were doing cloud 10 years ago is you had this thing called the hypervisor. And the hypervisor essentially was this giant monolithic software system that managed the entire host of all servers. And the challenge with the hypervisor systems is that
it made it really hard to innovate for the cloud because all of the control, the communication, the data at the server level was implemented in this giant monolithic thing called the hypervisor.
So Annapurna had this crazy idea of decoupling the parts of the hypervisor that you need to scale at the cloud at the physical level. So they developed what's called the Nitro system today, which provides physical separation for things like the data that's running on the instance from the communication that's controlling the instance.
And so this is both how AWS scales and how AWS provides such strong security guarantees is because physically there are two different controls. There's one physical chip or there's one physical component of the hardware system that is managing the data.
customer's data, and there's a different physical control that's managing the governance of the instance. And so every modern EC2 instance today is built on the Nitro system. So that was the first major development for Manifernal Labs was Nitro. So that's Nitro, like nitroglycerin, N-I-T-R-O. N-I-T-R-O, yeah. Explosive. Yes, yes.
So after the Nitro system, Annapurna started developing their second sort of main product line, which is Graviton.
So Graviton are custom CPUs, custom ARM-based CPUs developed by Annapurna Labs. And if you watched re:Invent, one of the highlights that you saw is that today more than half of new compute that comes onto AWS is actually Graviton CPU.
Oh, yes. So when you're looking at instances on AWS, when you see that little G at the end of a family, so like a C6G or even a G5G, that second G means it's a Graviton CPU. Uh, so that means you're going to get much better performance at a very, you know, competitive price. Uh,
And so the Graviton CPU is our second main product line. And then Traneum and Infrentia is the third main product category from Annapurna Labs, which is now let's take this whole
awesome ability that we've created in developing infrastructure and scaling infrastructure across AWS. And let's focus that on AI ML. And so Inferentia, of course, was developed and, you know, came out a number of years ago. Tranium 3 is our third generation chip.
So it's the third generation accelerator for AIML. And that is why it's such an exciting moment, right? Because you see the breadth and the scope and the incredible results that Annapurna has delivered over the years. And now this is totally focused. And now a large focus is AIML.
And so when customers are taking advantage of this, fundamentally, they're interested because they get the benefits of price performance. More than anything, it's this benefit of highly optimized compute that is scarily energy efficient. Aperna is so good at identifying
at identifying improvement areas to just take cost out of the equation and reduce complexity and pass performance and pass cost savings back to customers.
while meaning performance and in many cases exceeding performance. So TRM2 is actually the most powerful EC2 instance on AWS for AIML. Full stop when you look at the performance metrics that we're seeing. It's a very exciting moment. It's an exciting moment for customers, exciting moment for the whole group. Tranium2 is the most powerful on AWS.
Correct. Wow, that's super cool. And so what are the key differences between the first generation Tranium chip and Tranium 2? Is it, you know, this is all stuff that's new since Ron's episode on the show, since episode 691 a couple of years ago. And so...
Is it like one or more kind of big conceptual changes that lead to this leap from training one to two? Or is it kind of a bunch of incremental changes that together combine to have all this power in training two and such cost effectiveness?
Yeah, sure. So we try to keep it easy for you. And so the way we keep it easy for you is that the core compute engine design isn't that different, actually. The neuron core itself, particularly between tier 1 and tier 2, is pretty much the same.
So what's nice about that is it means the kernels that you write for tier and one to tier and two and the development, um, modeling code would say like NXT is really, really easy to just move up from tier and one to tier and two. The big difference. Can I interrupt you for one quick second? It sounds like, so you're, are you saying TRN one and TRN two? Is that like, that's like an abbreviation of training them? Yes. Okay. Okay. Okay. Gotcha. That's also the, that's the name of, yeah. Yeah.
abbreviation of the name, and it's the instance name directly. Right, right, right, right. So yeah, gotcha, gotcha, gotcha. Nice. Yeah, TIRN1 and TIRN2. My apologies for interrupting. Carry on. No worries. Yeah, so the key differences between TIRN1 or TIRN2 is that on TIRN2, you have 4x that compute. Yeah, that's a big number. That is a big number. Now, the reason why that happens is because you have four times more neuron cores per
per card. So in tier one, you have two neuron cores that are packaged up together in a single card. And then you have two HBM banks. And that is the accelerator, the combination of those.
Tier in 2, you have 8 neuron cores. So just multiply by 4. You have 8 neuron cores, you have 4 HBM banks, each card itself is 96 gigs of HBM capacity. On the instance as a whole, you have 16 of those cards.
So at the instance as a whole, you have 1.5 terabytes of HBM capacity. And then we gave you an UltraServer. So the UltraServer is where you take four tier-2 instances
And then these are all combined in one giant server, actually. So the reason we say that, so it's two racks, four servers, and then 64 cards that are all connected by Neuron Link, which is our chip-to-chip interconnect.
such that there is a minimum of two connection, a minimum of two hops from one card to any other card. When you do like a, um, a neuron top or a neuron LS on a single tier in one instance, uh, it's going to show you 128 trainable accelerators because you have 128 neuron cores on your single instance. Um,
We actually have a way of grouping those. You can group them by what we call like a logical neuron core, which is kind of cool because then you can change the size of the accelerator that you want based on the workload, which I think is very fun.
and then, yeah, those are all packaged up into this giant ultra server. Um, if you watched reinvent, actually Peter DeSantis wheeled out an ultra server on stage, um, and spent his whole man, much of his keynote, just talking about it. It's, it's, it's such an awesome moment. Um, but so ultra servers are,
unambiguously the best way to train and host the largest language models on AWS. And you have the most powerful instances combined in a really compelling, you know, innovative way that connects all of the cores and makes them very easy to train and then to host while minimizing the number of hops that you need to do between hosts because they're all logically one server. So UltraServer is pretty cool.
Ultra server does sound pretty cool. I might be putting you on the spot with this question, but how many model parameters of a large language model say, can you fit on an ultra server? Yeah. So it's kind of a weird question to answer, to be honest, because you can fit a lot of
But no, but what I mean is that realistically, you don't actually want to max out the memory. Like realistically, you want to give yourself space like for your batch size, for the optimizer, for the adapters. If you're training it, you're going to want to have multiple copies of it for something that's really large. If you're hosting it,
you're also going to want multiple copies of it because you're responding to many different users at a point in time. So it's actually a pretty complex question to answer and it's highly use case dependent. The rule of thumb we use though is like, and it's not what will fit, but again, it's what is actually good for a normal use case. So for a normal use case, language models that are in the 70 billion parameter range,
we recommend those for tier-in-one. Tier-in-one is a good candidate for language models that aren't gigantic, but that are still sizable. And tier-in-one gives you competitive and powerful compute for training and hosting those models. Language models that are significantly larger than that
go to DRN2. By all means, go play with an ultra server and get all those neuron cores and see what you can do with it. And again, what's nice about it, what I love about the stack is that NXD gives you both the connection into the compiler. So when you implement your modeling code in NXD by
By default, you get a nice sync with the neuron compiler and all of the lower level XLA benefits. But we also shard the model for you. So when you want to play with different TP degrees, like say you want to try a TP degree of 8 on tier N1, but then on tier N2, you want to try TP32 and TP64 and TP128, because why not? And sort of see what happens. Like,
NXT makes it super easy to do that because you're just changing a parameter right at the top of the program to then shard your checkpoint itself and redefine your distribution method. And so, yeah, NXT handles all of that for you, which I just absolutely love.
Do you ever feel isolated, surrounded by people who don't share your enthusiasm for data science and technology? Do you wish to connect with more like-minded individuals? Well, look no further. Super Data Science Community is the perfect place to connect, interact, and exchange ideas with over 600 professionals in data science, machine learning, and AI. In addition to networking, you can get direct support for your career through the mentoring program where experienced members help beginners navigate.
Whether you're looking to learn, collaborate, or advance your career, our community is here to help you succeed. Join Kirill, Adelant, and myself and hundreds of other members who connect daily. Start your free 14-day trial today at superdatascience.com and become a part of the community. Nice. And so to define for our listeners that kind of that idea of TP8, TP32, TP64, it's the precision of the digits at these model parameters, right? Yeah.
It is not. Oh, it's not? Yeah, no, it's not. What I meant by TP was like tensor parallel degree. Oh. So how many cores or how many, yeah, how many neuron cores you'll use to host one neuron
like copy of your tensor, for example. So if you are doing a TP of eight, that means they're going to consume eight neuron cores to do operation X with your tensor. If you're doing a TP of 32, that means you're going to shard your model over those 32 neuron cores.
In a data type world, you would be thinking like FP32 and BF16 and like INT8. I know they're similar, but very different meaning. I said defined for our audience, but I ended up meaning defined for me. And so tell me about this. So this TPA, TP32 that now you've just explained, why would I make those changes and what impact does making those changes have?
Yeah, sure. So it is pretty impactful. It impacts the collectives a lot. Actually, it impacts how much time your workload will spend in an all reduce, for example, or in a reduced scatter or in a gather scatter. So those are these collective operations that
support the program that you define and support your model. And they communicate across all of the cores and they collect information. And so you use collectives when you're running distributed training and hosting very regularly. And it's important to understand like the impact those collectives can have on your compute when you're profiling your workload and trying to improve it. And so, yeah,
When you experiment with different TP degrees, it can improve performance and it can also degrade performance because of the impact of the collectives, the impact on memory, it'll impact how large of a batch size you can hold, it'll impact your overall step time, etc.
And so that's why it's helpful to have this ability to easily test different TP degrees. Also in tier two, because you have this
like LNC feature, logical neuron core feature that lets you actually change the size of the accelerator logically based on grouping it in sets of one, which is LNC1 or grouping it in sets of two, which is LNC2. And so what that does is it actually shrinks the total number of available accelerators to your program.
So on tier two, an LNC of one shows you like 128 trainable devices or available devices to your program. But when you set LNC to two, that shrinks the number. So instead of 128, then you see 64 LNC.
and it makes it, you know, slightly more available, uh, in the HBM bandwidth. Like the banks obviously don't stay the same. I mean, the banks stay the same physically. The hardware doesn't change at all between those two settings, but it changes how much is available per core to your program. So it's, it's, you know, changes and modifications like this that let you, um,
you know, find like the optimal balance in your program and in your workload while easily experimenting with them through NXT. I got you. So these configuration parameters like TP degrees, uh,
when we are dealing with a large language model that's huge, so it's distributed across many different accelerators, many different compute nodes, these kinds of configuration parameters like TP degrees need to be configured to figure out for exactly your model in the situation you're using it, what is the optimal config. Totally.
How many tokens per second can you get? What's your time to first token? How can you reduce your overall cost by, you know, having fewer resources, but still, you know, being able to respond to a number of responses, you know, at a time. So all of these questions we need to consider when we're trying to find like the right instance and trying to find the right instance settings. Very cool. All right. So now we have a great understanding of why
A Tranium chip or a Tranium 2 chip might be the obvious choice for a listener when they're thinking about training a large language model, or Inferentia might be when deploying a large language model. Give us some real-world examples of customers that you've had that have been able to take advantage of these chips to great effect.
Yeah, sure. So our flagship customer example, of course, is Anthropic. So Anthropic has been a very active developer and customer with Trayman and Forentia for quite some time. And so the partnership has been phenomenal. Anthropic is a great team. It's an absolute privilege to support them as a customer. And we're
And we are developing some big projects together. So I don't know if you heard about Project Rainier. But Rainier is a absolutely gigantic cluster that we are developing in collaboration with Anthropic with obviously state-of-the-art terranium cards and instances. And so it's just...
And it's just a pleasure to innovate with them. So Anthropic is a great example.
Fantastic, yeah. They certainly are one of the leaders at the forefront of AI. For me personally, you wouldn't know this, Emily, regular listeners probably would, that my go-to large language model for most everyday use cases is Claude. And it has been for some time. So yeah, love Anthropic. And I'm not surprised to hear that there's amazing, intelligent people to work with there on big mountains of a problem like Project Rainier. Totally.
For our listeners around the world, Rainier Mountain is a big mountain in Washington State, not Washington, D.C. It is. That is correct. But yeah, no. And then so obviously we have customers across the spectrum. So Anthropic, you know, is such an important customer. We also work with startups. So we work with startups like Arcee,
or Ninja Tech who are training and hosting small language models. And in the small language model space, it's exciting for customers because our price performance and our overall availability is just really compelling. They love the benefits that they get. They love the price. They love the performance. They love the models. They love the software stack. So we definitely see some great movement there.
We also see customers like Databricks. We are doing some big projects with Databricks. Not a small startup. Not a small startup. Yeah, yeah. No, we're doing some great work with Databricks. And then now we're expanding into the academic sector with Build on Tranium. Cool. What's the Build on Tranium program?
Yeah. So Build on Trinium is a credit program that we are running, which is $110 million in credits that we are offering to academics who are working on the future of AI.
So fundamentally, this is a way for universities, academics, PIs, principal investigators to submit their research ideas to us about their big ideas. We want to know sort of what they've already tested on Tranium, what their early modeling and early kernel results are. And then we are working to scale those results with them.
on a cluster that is up to 40,000 tier and one cards. So we have a very significant cluster that is available for researchers, uh, for the best day of projects in the world. And so, um,
Yeah, this is a big project, of course. We've been working for it on quite some time. Sounds really cool indeed. We'll have a link to the Build on Training program in the show notes for those academic listeners out there who would like to take advantage of this $110 million program from AWS.
I'd also like to highlight another client of training and inferential chips that I'm aware of is Poolside. I'm aware of that because back in episode 754, we had Jason Warner, the CEO of Poolside, on the show. It's a really cool startup. It isn't Databricks size yet, but
But Poolside, they're trying to tackle artificial general intelligence from the perspective of software, of code generation. And there's compelling arguments that Jason makes in episode 754 about how that might be feasible. So cool episode to highlight there and another Trinium Inferentia customer.
Absolutely. We're very excited about the pull-set partnership. When you're trying to figure out what the right instance is, we've talked about TRN1, TRN2,
Tranium, sorry, Inferentia chips as well. There are other kinds of instances that are available on AWS. How do you pick the right kind of instance type for a particular machine learning task? Yeah, sure. So of course, when you're, let's assume we're in the Tranium and Inferentia space for the moment. So when you're in that space, I mean, really, you have a couple of questions. Obviously,
We have two product lines, training and Infrentia. The Neuron core itself, though, like the fundamental acceleration unit is the same. Actually, the Neuron core is the same. The software stack is also the same. So you can mix and match, go back and forth. Good compatibility.
What's different between the two is that the instance topology is just configured differently. So with tier 1, we assume that you're going to be training. So we connect the cards in what's called a torus topology or a 4D torus topology, which means that the cards are connected to each other.
in a way that you can easily do a backward pass. You can easily gather the results from all of the cards and then update the optimizer state. So the connectivity between the cards is much more suited for complex backward pass.
Whereas in the Inferentia line, again, the same neuron core, but the topology is more aligned for just a forward pass. So when you study the architecture, you'll see that
you have just one row of the cards, for example. It's not this 4D topology. It's sort of more aligned for just taking a large tensor, sharding a large tensor on the fleet, and then doing a forward pass. So that's some of the difference. Another difference is that in Inferentia,
you have more choices. You have many different options for instance size between how many accelerators you want, your HBM capacity as a result of that,
Whereas in Tranium, it's sort of really small and really large. So that's why we see, you know, a good benefit on training where you're doing your small development with like a single, you know, tier in one, and then you're scaling it up for, you know, one large instance, and then as many instances as you can get. And you don't really need that flexibility. Whereas in Inferentia, you
You might want to host your 7 billion or your 11 billion parameter model that isn't going to have the same compute requirements. Nice, that was a great explanation as they have been throughout this episode. And actually, speaking of your great explanations, you do have a history of
of education. I mentioned that earlier in the episode we would talk about some of the educational stuff you did. So for example, you wrote a big 15-chapter book called Pre-trained Vision and Large Language Models in Python. It's got a good subtitle too, End-to-End Techniques for Building and Deploying Foundation Models on AWS. So short form title, Pre-trained Vision and LLMs in Python.
And so that's a big 15-chapter book. And you have also been an adjunct professor and a startup mentor. You created a course called Generative AI Foundations on AWS. So we'll have that in the show notes as well. And when I put all of that together,
into context for our listeners, it's probably totally unsurprising that you're involved in the Build on Tranium academic program that we were talking about earlier because that involves amazing research universities like UC Berkeley, Carnegie Mellon, the University of Texas at Austin, and Oxford University. So,
Very, very cool. I have our, I think this comes from our researcher, Serge Massis, who's always pulling in really interesting pieces from our guests' backgrounds. I have this quote that at a Swiss machine learning conference called AMLD, you told the story of Francesco Petrarch, a poet from the Italian Renaissance.
and how this story relates to the AI development project. So could you elaborate on this story and how it influences your approach not only to AI development, but also your efforts in AI education?
Sure. Yeah. So there's, there's a lot to unpack there. Let's, let's try and take it step by step. Um, so again, because I don't have an undergrad in computer science, uh, and I don't have a PhD in computer science, I didn't have that opportunity. I feel like I've had to teach myself a lot. Uh, and obviously I've had, you know, phenomenal mentors and have worked on phenomenal teams that push me, um, worked with phenomenal customers who pushed me. Um,
What I love about technology, the reason why I love software so much is because in software, if you build it, you can understand it. At least that's how I feel. It doesn't matter how complicated something is.
Does it matter if I didn't take that class? Does it matter if I didn't have a PhD in whatever it is? If I can code it, I can convince myself that I can probably understand what's going on. And so from that perspective, that is the perspective by which I teach because I understand that we live in a world where not everyone has every opportunity that maybe they wish they had. But nonetheless, here we are and we're doing our best.
And so I love teaching because I love taking things that were hard for me to understand, that were hard for me to, you know, explain to myself. But because it was challenging, somehow I was able to find a way to simplify it to myself. And
And I love sharing that with other people because I know it simplifies their journey and it simplifies their path, certainly simplifies their experience on AWS with my own technology stack. And so that ethos, I guess, I just love. I've always loved. And so it's education is a part of why I enjoy that. And it's a way to scale that and help others grow. So that I just really enjoy. Yeah.
I want to address the Petra point. I love that that came up. It's sometimes surprising how, you know, things you post on the internet show up later. So that's beautiful.
I'm also just a humanist. Like I love so many things in this world. I love art. I love art history. I love philosophy. I love thinking about things in ways that didn't previously, you know, consider to me. So the reason why I did that, I was prepping, um,
to do an invited, you know, talk like an invited keynote at this conference in Switzerland. And this was at the time when LLMs were like just becoming to be popular and foundation models were just becoming to be popular. So I wanted like a nice quote about intelligence that would feel like culturally relevant. So I thought Petrarch would be a good quote. So I got some nice, you know, phrase about,
human intelligence and the impact of human intelligence. And it's, it's funny that like, we live in a world where like, we need to talk about human intelligence as like an important thing that matters. Like, I don't know, I see, you know, so much going on in the LLM space and the AGI space. I'm like, don't get me wrong. Obviously I'm all about scaling out, you know, computers and like developing AI, but I also care a lot about human intelligence. Um,
I find it super valuable in my own life to maintain my own intelligence as a goal. I find that valuable in the life of my team and people that I work with. It's like we need to continue to grow our own intelligence while we're obviously growing the intelligence of the machines.
But that balance between those two, I really enjoy and I just find so fun to consider. Well, that idea of AI being for humans and supporting our intelligence and, you know, our...
as individuals and as a society, that's actually a theme of two recent episodes of this show. So we have two episodes largely dedicated to that kind of idea, which is episodes 869 and 873 with Varun Godbole and Natalie Monbaio, respectively. So yeah, there seems to be, it's something that kind of has just started to come onto my mind as well. And at the time of recording, I'm preparing a keynote kind of along the way
those themes as well. So yeah, I think there's something special there. My final technical question for you before we get into our kind of wrap-up questions, Emily, is just kind of your insights into what's going to happen next. Obviously, it's a fast-moving field, but you're right at the heartbeat of it. They're working on hardware like Tranium and Inferentia. So what's
Field of AI is moving incredibly fast. What emerging technical challenges most excite you? You know, what do you think is coming next? Yeah, sure. So I think where, you know, five years ago this was a question, unambiguously large language models are here to stay. Like, this is just clear. How?
how these continue to be integrated into applications, the nature of them, the fine tuning of them, the agentic systems that are built on top of them, the pre-training of them, the data set selection for them, the evaluation of them. All of those will change. All of those are in flux. All of those will evolve. For a while, I've seen, particularly in my SageMaker days, how
over time, it just makes so much sense to push knowledge into the model as much as possible. Like it simplifies the lift for development teams, simplifies the lift on data management, simplifies the lift on the application management. So I think you'll continue to see like a variety of ways that people try to push knowledge into LLMs, like push knowledge into an LLM in the pre-training stage, right? When you're creating the foundation model from scratch.
You do it when you're doing supervised fine-tuning to teach it how to follow commands. You do it when you're aligning the language model to perform complex reasoning. You do it when you're designing your RAG system. You do it when you're designing your agentic system. But really, all of those are just fluff compared to what's actually in the neural network itself.
And so what I think you'll continue to see is this synergy between people solving problems at the agentic system, at the agentic level, that then are absorbed by their reg, that are absorbed at pre-training, that are absorbed by the data set itself. And then obviously hardware is going to keep cruising. Like we have a lot...
in store um tier and three you know was pre-announced a reinvent tier and three is on the way um we are very much just getting started and what you're gonna see from training and in front yeah um with nikki with bill on training so stay tuned um but in terms of the llms themselves like
Yeah, there's a lot that's going to continue to be the case. But it's also, you know, it's kind of encouraging that, like, the core problem is the same. Everyone's still trying to train their model the best they can and figure out how to host it and figure out how to do inferencing on it the best, like,
That has not changed. I don't expect that to change ever. But now the focus, obviously, is language models and doing that the most efficiently with the best results, with the best mixture of results. And so I think there's a lot that you'll continue to see in that space. Fantastic answer. Thank you. Lots to look forward to. And of course, driven by hardware. That's what's happening right now.
Fantastic. Emily, this has been a sensational episode. I've learned a ton. I marked down for our show notes maybe a record number of links that I'm going to have to, like terms, interesting terms that people will be able to dig into after the episode. So clearly a huge amount of concrete information
useful content conveyed in this episode. Thank you so much. Before I let you go, I always ask my guests for a book recommendation. Yes! So...
I got this book today actually delivered by Amazon. I don't know if this is coming through. It's probably not coming through. Most of our listeners are listeners only. There are some YouTube viewers. Emily was holding up the book delivered by Amazon onto the video camera.
What's the title of the book? Yeah. So the book is called voice for the voiceless and it's a book by his holiness, the Dalai Lama. Um,
So I mentioned to you that I, you know, I love to meditate and I'm a Buddhist practitioner. So, of course, I love to read, you know, just personally, I love to read the words of His Holiness the Dalai Lama. So I'm very much looking forward to reviewing this book, to reading it and, you know, sympathizing with his, you know, struggles, but also with his wisdom. You know, I find His Holiness to just really, you know, do a remarkable job of,
combining, you know, wisdom with compassion in the modern time while also holding on to the, you know, strength of his lineage. And so I'm very much looking forward to reading this book and I would tentatively offer my recommendation for it on that basis. Nice recommendation. I'm sure it's an exceptional book. I have read books by His Holiness the Dalai Lama in the past. I read An Open Heart by
some years ago, which I thought was great. It was, uh, it includes some kind of introductory tips on meditation. Um, I was already, I had already been meditating for a few years at that point, but there were some great pointers in there and just some great life advice. He seems to be quite sage. He's a sage maker. If you read his books. Indeed, indeed. Um,
he will, yeah, he'll, he makes his readers sage. Um, so there you go. A nice AWS joke. Um, all right, Emily. So how can we follow you after this episode? Yeah, sure. So you're welcome to follow me on LinkedIn. I will warn you that I am just that active on social media. So how could you be Buddhist centered and active on social media? Those are incompatible. Yeah.
I'm not saying they're incompatible. I'm just not that active on social media. I bet it makes it harder. But yeah, you're actually the first podcast that I've ever done. No. It's true. It's true. So I'm excited to burst my podcast bubble. And yeah, follow me on LinkedIn, but on GitHub. I'm actually super active on GitHub. There you go. I forgot to mention this. On...
so for build and trainium, uh,
We just wrapped a competition actually. So we are offering $25,000 in cash to the top team who can develop the fastest Niki Llama, the fastest Llama implementation using Niki actually. So the competition is over, but I am totally expecting projects like this to pop up again. So definitely stay tuned for more.
But yeah, I'm really active on GitHub. When you're, you know, cutting issues in the Neuron SDK or the Nikki SDK, like feel free to tag me. I will respond. You know, shoot me your kernels. I'd love to see the work that people have. And yeah, let's go build.
In the words of Werner Vogels. And a reminder that we talked about NICI earlier in the episode, but it's N-K-I, Neuron Kernel Interface. And I'll have a link to that in the show notes too. Beautiful. Thank you, John. Thank you, Emily, for taking the time. I'm so delighted to have had you on your first podcast appearance. You are a natural and every podcast should have you on. I don't even care if they're in data science or not to explain something wonderful about the world. Thank you, Emily.
Thank you, Jim. What a sensationally interesting episode with Emily Weber in it. She covered how her background in meditation and Buddhist practice provided mental tools that helped her excel in computer science by developing focus and calm problem-solving abilities.
She talked about the Nitro system developed by Annapurna Labs that was acquired by Amazon in 2015 that physically separates data and instance control in cloud infrastructure, creating better security and scalability. She talked about how the Build on Tranium program is AWS's $110 million investment program, providing cloud credits to academic researchers working on cutting edge AI at institutions like Berkeley, Carnegie Mellon, UT Austin, and Oxford.
She talked about how Tranium 2 offers 4 times the compute power of Tranium 1, with 8 neuron cores per card instead of 2, and 1.5 whopping terabytes of high bandwidth memory capacity per instance.
She talked about the AWS Neuron SDK that helps developers easily optimize and deploy models on Tranium and Inferentia chips through tools like NXD, which handles model sharding across accelerators. And she talked about hardware design decisions like TP, tensor parallelism degrees, that significantly impact model training and inference efficiency, requiring careful optimization for specific workloads.
As always, you can get all the show notes, including the transcript for this episode, the video recording, any materials mentioned on the show, the URLs for Emily's social media profiles, as well as my own at superdatascience.com slash 881.
And if you'd like to engage with me in person as opposed to just through social media, I'd love to meet you in real life at the Open Data Science Conference ODSC East running from May 13th to 15th in Boston. I'll be hosting the keynote sessions and along with my longtime friend and colleague, the extraordinary Ed Donner, I'll be delivering a four-hour hands-on training in Python to demonstrate how you can design, train, and deploy cutting-edge multi-agent AI systems for real-life applications.
Yeah, and we could also just meet for a beer or whatever there. Thanks, of course, to everyone on the Super Data Science Podcast team, our podcast manager, Sonia Breivich, media editor, Mario Pombo, partnerships manager, Natalie Zheisky, researcher, Serge Massis, writer, Dr. Zahra Karche, and our founder, Kirill Aromenko. Thanks to all of them for producing another fabulous episode for us today.
for enabling that super team to create this free podcast for you. We are deeply grateful to our sponsors. You can support the show by checking out our sponsors links, which are in the show notes. And if you yourself are interested in sponsoring an episode, you can get the details on how by heading to johnkrone.com slash podcast.
All right. Share this episode with people who might like to have it shared with them. Review the episode on your favorite podcasting app. I think that helps us get the word out about our show. Subscribe if you're not a subscriber. Feel free to edit our videos into shorts or whatever you like. Just refer to us. But most importantly...
Just keep on tuning in. I'm so grateful to have you listening and hope I can continue to make episodes you love for years and years to come. Until next time, keep on rocking it out there and I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon.