We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Software and hardware acceleration with Groq

2025/4/2

Practical AI: Machine Learning, Data Science, LLM

AI Deep Dive AI Chapters Transcript

People

Dhananjay Singh

Topics

Dhananjay Singh: 我是 Groq 的一名员工机器学习工程师。Groq 提供快速 AI 推理解决方案，涵盖文本、图像和音频，速度比传统提供商快得多。我们开发了 Groq LPU，这是一个软硬件平台，通过软件优先的设计理念实现低延迟和高吞吐量。我们首先开发了软件编译器，它负责 AI 模型中每个操作的调度，以实现确定性计算和网络。我们追求确定性系统，避免硬件组件和算法带来的延迟，以提高效率。我们的编译器不同于传统的基于内核的系统，它在更细粒度的级别上控制操作调度，避免了传统 GPU 架构的延迟问题。它控制模型在多个芯片上的精确分割和执行，以获得最佳性能。我们的软件栈大部分是定制开发的，底层使用一些 Linux 原语和 MLIR 系统。我们提供 REST 兼容的 API，方便开发者集成。在 LLaMA 370B 等模型上，我们实现了每秒数千个 token 的吞吐量。对于企业客户，我们提供专用实例和多租户架构。我们相信速度和准确性同样重要，更长的推理时间可以带来更高质量的结果。我们重视准确性、速度和成本，并支持多种模态。我们提供多种访问方式，包括网站上的聊天界面、API 以及为企业客户提供的专用实例。我们不断改进编译器以支持新模型，并努力减少厂商特定的硬编码。我们的系统核心是矩阵乘法和向量矩阵乘法，这与大多数机器学习模型相符。我们支持广泛的模型，并且我们的编译器可以独立于模型和架构工作。我们关注的是如何从新架构中获得最大性能。未来，我们预计基于边缘的部署和 API 调用将是首选接口。我们面临的挑战包括 AI 行业的快速发展和对新架构的适应。我对 AI 在编码方面的进步、推理模型、多模态和它们的融合感到兴奋。 Daniel Whitenack: 作为主持人，我主要对 Groq 的技术架构、软件栈以及其在不同模型上的性能表现提出了问题，并探讨了 Groq 的商业模式和开发者社区。 Chris Benson: 作为主持人，我主要关注 Groq 的技术细节，例如编译器的工作方式、软件栈的构成以及与其他平台的比较。

Deep Dive

Chapters

Groq provides high-speed AI inference solutions, surpassing traditional providers. The company developed its own software and hardware platform, Groq LPU, to achieve this. This contrasts with the traditional approach of developing hardware first.

Groq delivers AI responses at significantly faster speeds than traditional providers.
Groq developed the software compiler before the hardware.
Groq LPU is a software and hardware platform for fast AI inference.

Shownotes Transcript

Welcome to Practical AI, the podcast that makes artificial intelligence practical, productive, and accessible to all. If you like this show, you will love The Change Log. It's news on Mondays, deep technical interviews on Wednesdays, and on Fridays, an awesome talk show for your weekend enjoyment. Find us by searching for The Change Log wherever you get your podcasts.

Welcome to another episode of the Practical AI Podcast. This is Daniel Witenack. I am CEO at PredictionGuard, and I'm joined as always by my co-host, Chris Benson, who is a principal AI research engineer at Lockheed Martin. Hi, Daniel.

How you doing, Chris? Doing very well. How's it going today, Daniel? It's going great. Yeah, it's been a fun, productive week in the AI world over here at Prediction Guard, so no complaints. But this is a really, I'm excited about this episode because it's one I've been wanting to make happen for quite a while. Today we'll be talking about

both AI hardware and software with DJ Singh, who is a staff machine learning engineer at Grok. How are you doing, DJ? DJ SINGH

Hey, Daniel. Thanks for having me. It's been going well. Yeah. Yeah. Good. Good. Yeah. And I guess we should specify for our audience, this is Grok as in G-R-O-Q. I imagine maybe some people get confused these days with that. But yeah, this is one that I've been really excited about, DJ, because

I've been observing what Grok has been doing for some time and, of course, innovating in a lot of different ways, like I mentioned on the hardware side and on the software side. So could you maybe just set the stage for us a little bit in terms of

the overall ecosystem as you see it in terms of kind of what may be a bloated term of like AI accelerator or hardware and also the software that goes along with that and kind of where Grok fits into that ecosystem. Right. So I think I'll first start and just quickly brief about Grok. So Grok is, of course, a company which provides fast AI inference solutions. So whether it's

text, image, or audio, we are delivering AI responses at blistering speeds and order of magnitude more than traditional providers, right? Now, you spoke of AI accelerators, and traditionally, training and inference has been done on GPUs, but I think in the last few years, we've seen...

All sorts of AI accelerators come in play. So there are those more mobile device-oriented ones that phone companies like Samsung and Apple come up with, right? And then there are more stuff happening on the server side, part of which is what Kroc is also leading towards.

Yeah, that's great. And on the server or hardware side, am I correct? Is sort of Grok does have their own sort of hardware that they've developed over time? Is that right? What's kind of been the progression of that and a current state?

Absolutely. So Grok developed this technology which we call as Grok LPU. It's essentially a software and hardware platform which comes together to deliver that breakthrough performance of low latency and high throughput. But how Grok got into it was first to develop that software. So we developed the software compiler first before moving on to the hardware side.

kind of a shift in how traditional development was done previously. And did that, I mean, that does seem very unique to me. So what was the, I guess, the motivation or the thought process behind taking maybe that non-standard approach kind of compiler first than hardware? Yeah, no, absolutely. So traditionally, as I mentioned, development is done and that

a new accelerator is developed. So if somebody makes the hardware first and then the software has to deal with the inefficiencies of the hardware.

Whereas when Grok decided, and this company is founded by Jonathan Ross, our CEO, who was a co-founder of Google's TPU program, the Tensor Processing Unit program. And based on his learnings from there, one of the key decisions was let's develop the software first.

So we have developed this software compiler, which helps to convert these AI models into this code which runs on the Glock LPU. But specifically, the compiler is responsible for scheduling each and every operation of that AI model. So you can think of it like an AI model in terms of computers made up of additions and multiplications.

And it kind of, the software compiler, it decides where and when to schedule something. And that goes into our, you know, various design principles, one of which, of course, I mentioned is to be software first, right? Now, you might ask, why do we do this, right?

So one key consideration is that not only does the software have to deal with hardware inefficiencies, right? But there are other aspects of the hardware which can add on delays, whereas Grok prefers to have a deterministic system in place. So determinism, I would say, is like deterministic compute and networking. So kind of have an understanding of

of where and when to schedule an operation. So to understand this, we can consider an analogy. Now imagine a car driving along the road with several stop signs. Stopping at every sign is essential for safety, but it does add some delays, right? Now, what if the world was perfectly scheduled and we knew where to start the car,

and drive at maximum speeds so that there are no collisions, right? So there will be no need for these stop signs, no delays as such. And it also makes a more efficient use of the road since you can then have more cars and everybody's like going at maximum or near maximum speeds.

So to reflect this analogy and back to the hardware space, Grok chose to remove components which can add delays. So it could be, let's say, network switches or

or other even algorithmic delays, some sort of algorithms which control packet switching. These all things add non-determinism into the system. I did want to maybe some of our listeners out there, like you've been talking about this compiler level, which, you know, I think of a compiler similar to what you said as, hey, I'm writing some higher level software code.

That's compiled to these instructions that run under the hood on the actual hardware components doing, as you said, additions or whatever those sorts of numerical operations are.

But people might sort of be confused also in terms of the software stack. They may be familiar with something like CUDA, which helps, you know, have drivers to run on certain hardware like NVIDIA GPUs. Or I know, you know, we've worked a little bit with Intel Gaudi processors and there's driver package Synapse, which is similar in that sort of way, helps translate

kind of your higher level code to run on these hardware components. Could you help us kind of map out that software stack, like where this compiler fits in? And are there other components like these drivers that would have a parallel in the Grok world? Yeah.

Yeah. So traditionally, as you've mentioned, like on, let's say, NVIDIA ecosystem, there are like tons of engineers who go and create these kernels, which are invoked when you have some sort of model operations. So there would be...

Maybe even thousands of engineers in the company would work towards developing this very specialized kernels to go and execute things. However, due to the structure of the GPU itself architecturally, this is not the best philosophy for design.

I'm sure the audience is familiar with GPUs. I remember playing games on them growing up and editing videos. And these grew up to be more powerful in the recent decades. But GPUs started in the 90s and the design hasn't changed all that much. We've had an addition of high bandwidth memory and other hardware components to it.

all of it essentially is still the original design, originating from the original design. It does make the system to be, again, less deterministic. So that goes back to the compiler system here. And let's talk about the NVIDIA GPU kernels here, right? So they have to deal with the different hierarchies of memories as an example. So for those of the listeners who are familiar with

the different memory systems in a computer system. You might be familiar with an L1 cache, which has an access time of one nanosecond. But you do then have these bigger memories, which are high bandwidth memories, which are closer to 50 to 100 nanosecond.

For a task to be processed performantly, data needs to be fetched from between these different memories onto the compute which is there. That transfer of data adds in more delays

And since this is a conservative system, right? So let's say you have two operations and one depends on the other. It's waiting on that operation to complete. So it adds on further delays, you know? So one operation is stuck on waiting on the data. The other operation is stuck on the second operation, right?

So that kind of incrementally adds more and more delays into this. So that's an example of how the traditional compiler or the traditional kernel-based system doesn't scale as well. What Grog chooses to do, of course, is not have any kernels whatsoever, but have a compiler which controls this at a fine-grained level.

A typical system will have multiple chips. AI, I'm sure people are familiar with models like Lama 70 billion. These models tend to be split across multiple GPUs and even on multiple GROK chips. This compiler controls how this model is precisely split across these different chips and how it's executed.

and to get the best performance out of it down to the level of the chipset and the networking. So as I mentioned before, we've removed a lot of the hardware which adds delay and this sort of scheduling is done

by Grok's compiler alongside with some assistance from, of course, the firmware, which is there. And I appreciate that. As we talk, I'm trying to get a good sense of kind of how the whole stack looks as you're starting to dive into it. And you've talked a bit about the kind of the compiler versus having a kernel kind of at the model layer there. But with you guys covering both the hardware and the software component,

Is Grok, would you say Grok is, I try to understand kind of that whole business model that you're approaching it with. Is it more of an integrator that's full stack all the way from the hardware up through the OS and into the model layers or like from an integration layer? Or do you think you're,

are you writing most of the software stack that's touching the hardware? Like how do you choose whether to go pick, and I'm just pulling things out of the air, not attributing to you, but going and picking Linux and picking CUDA and picking this and picking that versus what you're writing to create your own full stack. How do you, I'm trying to get a sense of kind of how that's distributed, those decisions. Yeah.

from a design standpoint? Yeah, that's a great question. So all the way from our starting stack, right? So let's start at the top. So most folks end up when think about using AI models in production would end up using some sort of API.

So we, our cloud organization designed a rest compatible API. It's compatible with open AI spec, which is there, which makes it very easy for developers to really integrate with it. And,

And then that ties into all the way into our rest of our stack. And to answer your question directly, yes, most of the stack has been custom written. We are, of course, using some Linux-based primitives which are there underneath.

system. And there are, of course, some components such as for the compiler, there is this MLIR system which is being used. MLIR is like a compiler term. I don't want to go super deep into it, but it's like a multi-level intermediate representation which kind of helps to transform things in between.

So overall, I would say this entire design pattern has been thought through from scratch and it's taken the company a couple of iterations to get to that point.

Well, friends, I am here with a new friend of mine, Scott Dietzen, CEO of Augment Code. I'm excited about this. Augment taps into your team's collective knowledge, your code base, your documentation, your dependencies. It is the most context aware developer AI. So you won't just code faster. You also build smarter. It's an ask me anything for your code. It's your deep thinking buddy. It's your Stan Flo.

antidote okay scott so for the foreseeable future ai assisted is here to stay it's just a matter of getting the ai to be a better assistant and in particular i want help on the thinking part not necessarily the coding part can you speak to the thinking problem versus the coding problem and the potential false dichotomy there

A couple of different points to make. You know, AIs have gotten good at making incremental changes, at least when they understand customer software. So first and the biggest limitation that these AIs have today, they really don't understand anything about your code base. If you take GitHub Copilot, for example, it's like a fresh college graduate understands some programming languages and algorithms.

but doesn't understand what you're trying to do. And as a result of that, something like two thirds of the community on average drops off of the product, especially the expert developers. Augment is different. We use retrieval augmented generation to deeply mine the knowledge that's inherent inside your code base. So we are a co-pilot that is an expert

and they can help you navigate the code base, help you find issues and fix them and resolve them over time much more quickly than you can trying to tutor up a novice on your software. So you're often compared to GitHub Copilot. I got to imagine that you have a hot take.

What's your hot take on GitHub Copilot? I think it was a great 1.0 product, and I think they've done a huge service in promoting AI, but I think the game has changed. We have moved from AIs that are new college graduates to in effect AIs that are now among the best developers in your code base. And that difference is a profound one for software engineering in particular.

You know, if you're writing a new application from scratch, you want a web page that'll play tic-tac-toe, piece of cake to crank that out. But if you're looking at, you know, a tens of millions of line code base, like many of our customers, Lemonade is one of them. I mean, 10 million line monorepo, as they move engineers inside and around that code base and hire new engineers,

Just the workload on senior developers to mentor people into areas of the code base they're not familiar with is hugely painful. An AI that knows the answer and is available seven by 24, you don't have to interrupt anybody and can help coach you through whatever you're trying to work on is hugely empowering to an engineer working in unfamiliar code.

Very cool. Well, friends, Augment Code is developer AI that uses deep understanding of your large code base and how you build software to deliver personalized code suggestions and insights. A good next step is to go to AugmentCode.com. That's A-U-G-M-E-N-T-C-O-D-E.com. Request a free trial, contact sales, or if you're an open source project, Augment is free to you to use.

Learn more at AugmentCode.com. That's A-U-G-M-E-N-T-C-O-D-E.com. AugmentCode.com. So DJ, you mentioned that a lot of the focus around, you know, really that design from the hardware layer up through those software layers and digging into all of those was to achieve the

fast inference. Could you tell us a little bit about the kinds of models that you've run on Grok and just some, you know, some highlights in terms of when you say fast performance, what does that mean in practice? Now, I've seen some

pretty impressive numbers on your website. So I won't steal your thunder, but yeah, just talk a little bit about kind of what is achievable with what kinds of models on the Grok platform. Yeah. So first of all, you know, I'll share some numbers, but we are just getting started. So these numbers are only going to get better with time. But like, let's say, let's take Lama 370 billion as an example.

tends to be one of those industry standards for comparing performance. So we've had numbers all the way from like 300 tokens per second to like multiple thousands tokens per second, depending on those use cases. And yeah, we've had some smaller models which go up to several thousand tokens per second.

we've had our, one of our speech to text models called Whisper, which is again an open AI model running on Grok. And this model, I think we've gotten around 200x as the speed of factor as they discuss it in the audio world. Yeah, and

And maybe talk a little bit about, and maybe for those out there that aren't, they're trying to process these thousands of tokens per second. What does that imply? I would say, you know, if you're using a chat interface, for example, and something is responding at thousands of tokens a second, it's, it's, it's, you know, potentially a wall of text to sort of

almost all at once as far as our human eyes see it. Could you talk a little bit about the implications of that? So I mentioned the chat interface, which certainly some people are using chat interfaces, right? But at the enterprise level for true enterprise AI use cases, why is fast inference important?

for these kinds of models, why is that important? Because like in a chat interface, I can only read so much text

so fast, right, with my own human mind as it comes back to me. Could you give us some, you know, and I certainly am, you know, have my own thoughts on this, but I'm wondering if you could think about why does that speed matter in enterprise use cases and why does it matter to push that maybe, you know, further than, you know, our own speed of reading, for example? - No, great question.

So I think if you were to start with what Google studies from a decade ago, right? People's perception or like search results is like if it takes longer than I think it's about 200 milliseconds or so, somebody like lose interest.

So speed is critical, whether it's for the enterprise or everyday people. I mean, we've demonstrated this several times and you can try out for yourself. You can have like, let's say you open ChatGPT with something like O1 or you have Grok on the side with one of our reasoning models and you can try comparing them side by side.

So what becomes more critical as I'm coming to is that we, like everybody thinks of speed as being, yes, it's important for real time applications, but then there is the aspect of accuracy, right? So if you could reason for longer, for let's say in the case of our reasoning model, so we've had like DeepSeq R1, for example, right? And these models, they generate a lot of tokens and

And if you can reason for longer, you can get higher quality results as a consequence of this. So while not making the system too slow for the user, whether, again, it's enterprise or it's for everyday users,

speed can translate to quality as well. So to extend that just a little bit, if you are, and we kind of been talking directly about inference speed and stuff like that, more from the practitioner standpoint, if you're maybe a business manager or a business owner out there,

and you're looking at Grok and you're kind of comparing it against more traditional inference options that are already out there. When you're talking in terms of speed and, for instance, being able to have the time to do the research and stuff, what are some of the use cases from a business standpoint where they need to go –

it's time for us to reassess kind of the more traditional routes that we've taken on inference and look at Grok for these solutions. Could you talk a little bit about what some of those business cases would be? Yeah, I mean, if you care about accuracy, speed, or cost, you should consider Grok.

So not only are we fast, the Grok LPU architecture allows us to give really low cost or I would say our costs per token are really low and we pass on those savings to all of our customers.

So if you are concerned about any of these cases and you want to work with different modalities, if you care about image, text or audio, if you care about drag, if you care about reasoning, we are there for you. Yeah. And just to tie into that as well, some people might be listening to this and thinking in their mind,

Oh, Grok has this whole platform that they've designed, hardware and software. I don't have a data center. It's going to be expensive for me to spin up racks of these things. Could you talk a little bit about... I mean, I think I could be mistaken, so please correct me. I think that is something that can happen. I mean, there are physical systems that people can...

access and use and potentially bring into their infrastructure. But I know also I, you know, I see a login, I see API, as you mentioned, REST API on your, in your previous answer about the developer experience. So maybe just talk through some of those access patterns and also how you as a company have thought about

which of those you provide? Because certainly there are advantages on the hardware side of maybe a fixed cost, but then there's the burden to support that. So just talk us through a little bit about the strategy that you all have taken because you are deploying this whole platform. How have you thought about providing that to users and what sort of access patterns, I guess?

Right. So I'd say to start with, one can go to our website grok.com and just experience the speed themselves. It's a chat interface. And then it's trivial to sign up for our account over there. And on a free tier, we offer like tons of tokens over there for free. You can sign up and get access to our APIs.

Once you get access to our APIs, and let's say you've already been using an existing API, let's say you're using OpenAI, it's pretty easy for you to switch to Grog. It's maybe a single or two-line change. Just try it out for yourself. We firmly believe in letting people experience the magic themselves, other than us talking about it. I think just actions speak louder, so

Yeah. For, of course, our deep enterprise customers, we, of course, do offer other services on that side, right? So you're talking about single tenant and then there's, of course, multi-tenant based architectures.

over there. So we do offer dedicated instances where there's a real need for that. And we do manage that. So now Grop kind of deploys its own data centers and we offer those all over an API. So it's very easy for our customers to go and sign up and use them.

Could, I'm going to ask you if you would, if you could kind of talk a little bit about it, just because as folks are listening and stuff, and they will go try that out after that. And I know that we'll have links in the show notes to the site so that they can do that. But could you talk a little bit about, and you could pick your example, but, you know, kind of, you mentioned like

the open AI, you know, and something that they've probably had, you know, had experience with, you know, it's one of those things that kind of everybody is at least touched at some point out there and you're providing a better experience here. And could you talk a little bit about what that is when you talk about go experience this yourself and you're going to see how, how amazing it is. Could you talk through what you've seen your customers experience with

in that way, just so that listeners will kind of get a sense or maybe a preview of what they should experience having messed around with open AI for a while. And now they're going over to Brock and they're doing that and they're going, whoa, this is amazing. What is that amazing that you're expecting them to see?

Well, first of all, people are just amazed by the speed that they get, like the speed of the output that comes up, you know, whether it's text or audio, you just get the output right away, right there. It's really, really fast. And it's, I think, really makes people think of new ways of doing things.

So, you know, one example from our developer community, and, you know, our developer community has grown to over a million developers now. So one recent example from a hackathon was that somebody developed this snowboarding navigation system based on Grok, taking images and kind of trying to guide people while snowboarding. And my mind was blown by these creative geniuses out there.

Just amazing. So all sorts of new applications out there enabled by the speed. Well, DJ, I do want to follow up on some of what you had talked about there on the developer community. So could you maybe clarify one thing for me? So there's the Grok systems that you have deployed and models that you have deployed in those systems, which it sounds like if I'm

interpreting things right, people can just use, I'm assuming your programming language clients or rest API to access that API and, and build off of those models that are in that environment. So in that case, it's sort of accessing models, maybe in a, like you say, in a similar way to they would access open AI models and that sort of thing.

Is there another side of the developer community that is saying, hey, well, we're actually we have our own custom models, whatever those might be? What is the process? I guess my question is, what is the process of development?

Getting a model supported on Grok, you've talked about mainly kind of the Gen AI level models of, you know, LLM or vision or transcription. How wide is the support for models in terms of, hey, if I if I have this model?

model, you know, I'm thinking in my mind, in a manufacturing scenario, if I have a model, that's a very specific model that needs to run at extremely fast speeds to like classify the quality of products coming off of a manufacturing line, right? But it's a custom model. And I say, okay, I, you know, Grok has the fastest inference. What is the kind of

what should I expect in terms of model support as of now in terms of architectures and then your vision for that in the future and also how maybe people could contribute there if there is an opportunity? Yeah. I think right now, one can just reach out to our sales team and we can figure it out. Based on the workload and the size of the model and things like that, we could figure out what's the best path going forward.

Now, going to the future, we have some very exciting developments, but I don't want to spoil that right now since it's still a work in progress. So I guess we'll disclose that whenever we can. And maybe kind of along with that, I know we have...

you know, even my team, we've, we've tried out, uh, running models on a variety of kind of GPU alternatives. Sometimes what happens there is, you know, the latest model comes out on the market, right. And it's maybe, you know, uh, supported in certain driver ecosystems very quickly. And then maybe on some of these alternates, there's a

there needs to be a kind of longer pathway for support in kind of custom software stacks that aren't, you know, aren't GPU based. How do you all navigate that right now? I know, you know, of course our team is, is small and it's hard for us to navigate that. And maybe you have people thinking about those things every day, but yeah, how do you, how do you navigate that challenge as an engineering team to support all of these different models as they're,

as they're coming out, given that you have a completely different software stack than, you know, others are working with in the ecosystem.

Yeah, if you think about it, we don't have to write kernels per model level. So when a new model comes out, generally on the GPU world and even other custom accelerators, typically people spend a lot of time writing more optimal versions of it. So you might hear about

new coda kernels being launched. Let's say, after the original attention, there was the flash attention one. So that's like more optimal way of running some of these models on the GPU. But we don't have to do this at a per model level. What ends up happening is as we enhance our compiler over time, all these enhancements just reflect onto all of the models that we end up supporting.

and the process to support different models on Grok is kind of similar. We end up spending some time removing vendor-specific hard codings, right? So there tends to be a lot of GPU-specific code which we end up removing.

And then we kind of run our compiler to translate this into, you know, finally to the Grok hardware. But there are a lot of knobs we tweak and turn to give you the best possible performance of that. And as the compiler improves with time, we just end up passing on these improvements to all the models right away.

So our effort per model is not as high, you know. So just to clarify on that point, these models would kind of roll out. You would kind of build into the compiler kind of less vendor specific thing or, you know, more general functionality over time, which would expand your ability to support certain types of operations.

but you wouldn't necessarily be able to say, hey, I've got this random model. I created my, you know, some research team created their own architecture, right, of this crazy thing. It may take some effort to kind of map that into the into the GROK software stack. But maybe if I'm hearing right, sort of less burden over time as the ecosystem develops. Is that the right way to interpret that?

Partially, yes. But I would add that if you think about what the Grok system is at the heart of it, it's matrix multiplication and vector matrix multiplications. And that's what most machine learning models are there. Yes, when we have a generational shift like transformers, one might want to go and look at

what's the new model type and how well does it map to our hardware. We might want to have some strategies to address some of that. But fundamentally, models haven't changed all that much after the transformers have been introduced.

Now, you kind of hear about diffusion models, even in the text world most recently. But as long as these fundamentals don't change frequently, I think our core belief of

just supporting this wide ecosystem of models continues to live sturdy. If you look at other air accelerators, some of them have gone and hard-coded, let's say, to the transformer architecture itself.

And their bet is that super specialization is the way to go. But our belief is that we would like to support a more wider scale of models. And that's pretty much what our compiler system would do to kind of map between this high level, let's say, PyTorch model into the Grok platform, converting it to, let's say, a

an intermediate layer where the compiler could work independently of what model it is. So there's no hard coupling, let's say, to a particular model or to even an architecture type. It's very low coupling.

I'm curious, I've been really kind of spinning on the speed of what you're talking about in terms of inference and some of the capabilities that your stack offers. As in general, as the model ecosystem has been developing into the second half of last year and into this year, raging into this year, kind of agentic AI, and then that's kind of evolving into physical AI. And so you're dealing with robotics,

and autonomy and things like that that you're supporting to where we're expecting an explosion of devices out there in the world that these systems are supporting. What is your strategy forward and approach for thinking about kind of physical AI that we're evolving into where you have

agents that are interacting with physical devices that are interacting with us in the real world. So it's not all in the data center, but the data center is supporting that. How does that fit into your overall view forward? - Yeah, I think the AI industry revolves very rapidly. Personally, I don't think there can be any long-term strategy which will not need adjustments based on developments.

But our belief is still that I think edge-based deployments and calling things over APIs will be the preferred interface going forward for a long time.

So sure, your, let's say, mobile chip might be able to perform some basic level tasks over there, but if you need really high accuracy, high quality model inference, doing this over an API, I think would get you there. So compared to the model size, which you can actually deploy on a mobile phone. So just another example for an edge device.

I have one question for you just as a as an engineer that has been working on at the kind of forefront of this inference technology. What has been some of the some of the challenges that I guess you faced as you really dug into these problems, maybe that were challenging?

unexpected or maybe they were expected for you, what would have been some of the biggest challenges and maybe some learnings that looking back on your time working on this system, you can share with the audience? Yeah, no, great question.

As I said, I think the AI industry moves really fast and sometimes there are these shifts, right? So we saw this shift to large language models and that's when the company itself kind of pivoted to focus on this. So Meta releasing Lama and the Lama 2 series of models was really what got our company to focus on this side and really push on this, right?

So similarly, I think we are a startup. We are always pushing on all fronts, always trying to improve on things. So whenever there's some new architectural change, we look to see how we could best adapt our system for that to get to maximize throughput. So sometimes there are these kind of changes and

This is something which actually excites me about Grok and working at such a talent-dense company. My colleagues really come up with really great, exciting new ways of doing things to really push the bar on some of these things. So maybe it's like, could be a mixture of experts or reasoning models. Whenever something new comes up, right?

I think that's getting the maximum performance out of that is something we care about. We deeply care about. And yeah, I think that's been one of the key areas. Awesome. Well, as we kind of get close to an end point here, this has been fascinating. I'm wondering, DJ, if you could just close this out by sharing your

Some of the things that you think about personally kind of going into this next year, as you mentioned, things are moving so fast. There are shifts that are happening. What are some of the things that are most exciting for you as you kind of head into this this next year of development and work?

So as a developer and like amateur data scientist, I would say that for me, the push on the coding side of the AI world has been very exciting. It helps me kind of think about how can I have more impact, whether it's at Grok or in the world in general. So the push of AI on the coding side, reasoning models, multiple modalities,

And the fusion of all of this, right? I think that's what I really want to look forward to for the next couple of years. There's of course the robotics bit which we touched upon, but that I feel is probably a couple of years down the line.

Awesome. Well, thank you, DJ, for representing Grok and congratulations on what you and the team have achieved, which is really amazing and monumental work. So great work. Keep it going. We'll be excited to follow the story and hope to get an update again on the podcast sometime soon. Thanks. Sounds great, guys. Thanks for having me.

All right, that is our show for this week. If you haven't checked out our ChangeLog newsletter, head to changelog.com slash news. There you'll find 29 reasons, yes, 29 reasons why you should subscribe.

I'll tell you reason number 17, you might actually start looking forward to Mondays. Sounds like somebody's got a case of the Mondays. 28 more reasons are waiting for you at changelog.com slash news. Thanks again to our partners at Fly.io, to Breakmaster Cylinder for the beats, and to you for listening. That is all for now, but we'll talk to you again next time.

Software and hardware acceleration with Groq 43:24 Share

Practical AI: Machine Learning, Data Science, LLM

Deep Dive

Shownotes Transcript

Software and hardware acceleration with Groq