We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Generative AI 101: Tokens, Pre-training, Fine-tuning, Reasoning — With Dylan Patel

2025/4/23

If you've ever wondered how generative AI works and where the technology is heading, this episode is for you. We're going to explain the basics of the technology and then catch up with modern day advances like reasoning to help you understand exactly how it does what it does and where it might advance in the future. That's coming up with Semi Analysis founder and chief analyst Dylan Patel right after this.

From LinkedIn News, I'm Leah Smart, host of Every Day Better, an award-winning podcast dedicated to personal development. Join me every week for captivating stories and research to find more fulfillment in your work and personal life. Listen to Every Day Better on the LinkedIn Podcast Network, Apple Podcasts, or wherever you get your podcasts. Did you know that small and medium businesses make up 98% of the global economy, but most B2B marketers still treat them as a one-size-fits-all?

LinkedIn's Meet the SMB report reveals why that's a missed opportunity and how you can reach these fast-moving decision makers effectively. Learn more at linkedin.com backslash meet-the-smb.

Welcome to Big Technology Podcast, a show for cool-headed and nuanced conversation of the tech world and beyond. We're joined today by Semi Analysis founder and chief analyst, Dylan Patel, a leading expert in semiconductor and generative AI research and someone I've been looking forward to speaking with for a long time.

Now, I want this to be an episode that A, helps people learn how generative AI works, and B, is an episode that people will send to their friends to explain to them how generative AI works. I've had a couple of those that I've been sending to my friends and colleagues and counterparts about what is going on within generative AI. That includes one, this three and a half hour long video from Andrej Karpathy explaining everything about training large language models.

And the second one is a great episode that Dylan and Nathan Lambert from the Allen Institute of AI did with Lex Friedman.

Both of those, three hours plus, so I want to do ours in an hour. And I'm very excited to begin. So Dylan, it's great to see you and welcome to the show. Thank you for having me. Great to have you here. Let's just start with tokens. Can you explain how AI researchers basically take words and then give them numerical representations and parts of words and give them numerical representations? So what are tokens? Tokens are...

in fact, like chunks of words, right? In the human way, you can think of like syllables, right? Syllables are often, you know, viewed as like chunks of word. They have some meaning. It's the base level of speaking, right? Is syllables, right? Now for models, tokens are the base level of output. They're all about like sort of compressing, you know, sort of this is the most

efficient representation of language. From my understanding, AI models are very good at predicting patterns. So if you give it 1, 3, 7, 9, it might know the next number is going to be 11. And so what it's doing with tokens is taking words, breaking them down to their component parts, assigning them a numerical value,

And then basically in its own word, in its own language, learning to predict what number comes next because computers are better at numbers and then converting that number back to text. And that's what we see come out. Is that accurate? Yeah. And each individual token is actually, it's not just like one number, right? It's multiple vectors.

You could think of like, well, the tokenizer needs to learn king and queen are actually extremely similar on most, in terms of like the English language, extremely similar, right? Except there is like one vector in which they're super different, right? Because a king is a male and a queen is a female, right? And then from there, like, you know, in language, oftentimes kings are considered conquerors and, you know, all these other things. And like, these are just like historical things, right? So

a lot of the text around them, while they're both like royal, regal, right? Like, you know, monarchy, et cetera, there are many vectors in which they differ. So like, it's not just like converting a word into one number, right? It's like converting it into multiple vectors. And each of these vectors, the model learns what it means, right? You don't initialize the model with like, hey, you know, king means male,

monarch, and it's associated with like war and conquering because that's what all the writing about kings is on, you know, in history and all that, right? Like people don't talk about the daily lives of kings that much or they mostly talk about like their wars and conquests and stuff. And so like,

there will be each of these numbers in this embedding space, right? Will be assigned over time as the model reads the internet's text and trains on it, it'll start to realize, oh, King and Queen are exactly similar on these vectors, but very different on these vectors. And these vectors aren't, you don't explicitly tell the model, hey, this is what this vector is for, but it could be like,

you know, it could be as much as like one vector could be like, is it a building or not? Right. And it doesn't actually know that you don't, you don't know that ahead of time. It just happens to in the latent space. And then all of these vectors sort of relate to each other. But yeah, these, these numbers are, um,

are an efficient representation of words because you can do math on them, right? You can multiply them, you can divide them, you can run them through an entire model. And your brain does something similar, right? When it hears something, it converts that into a frequency in your ears and then that gets converted to frequencies that should go through your brain, right? This is the same thing as a tokenizer, right? Although it's like,

obviously a very different medium of compute, right? Ones and zeros for computers versus binary and multiplication, et cetera, being more efficient. Whereas humans' brains are more like analog in nature and think more in waves and patterns in different ways. While they are very different, it is a tokenizer, right? Like language is not actually how our brain thinks. It's just a representation for which it to reason over.

Yeah, so that's crazy. So the tokens are the sufficient representation of words, but more than that, the models are also learning the way that they are, all these words are connected. And that brings us to pre-training. From my understanding, pre-training is when you take the entire, basically the entire internet worth of text, and you use that to teach the model how

these representations between each token so therefore like we talked about if you get if you gave a model the sky is and the next word is typically blue in in the pre-training which is basically all of the english language all or all of language on the internet uh it should know that the next token is blue so what you do is you want to make sure that when the model uh

is outputting information, it's closely tied to what that next value should be. Is that a proper description of what happens in pre-training? Yeah, I think that's the objective function, which is just to reduce loss, i.e. how often is the token predicted incorrectly versus correctly, right? Right, so if you said the sky is red...

that's not the most probable outcome. So that would be wrong. But that text is on the internet, right? Like, cause the Martian sky is red and there's all these books about Mars and sci-fi. Right. So how does the model then learn how to, you know, figure this out? And in what context is it accurate to say blue and red?

Right. So, I mean, first of all, the model doesn't just output one token, right? It outputs a distribution. It turns out the way most people take it is they take the top K, i.e. the most high probability. So, yes, blue is obviously the right answer if you give it to anyone on this planet, but

But there are situations and contexts where the sky is red is the appropriate sentence, but that's not just in isolation, right? It's like if the prior passage is all about Mars and all this, and then all of a sudden it's like, and that's like a quote from a Martian settler, and it's like the sky is, and then the correct token is actually red, right? The correct word.

And so it has to know this through the attention mechanism, right? If it was just the sky is blue always you're gonna output blue because blue is let's say 80 percent 90 percent 99 percent likely to be the right option But as you as you start to add context about Mars or any other planet, right other planets have different colored atmospheres I presume You end up with this

distribution starts to shift, right? If I add we're on Mars, the sky is, you know, then all of a sudden blue goes from 99%, you know, in the prior context window, right? The text that you sent to the model, the attention of it, all of a sudden it realizes the sky is blue automatically

preceded by the stuff about Mars, now blue rockets down to like, you know, let's call it 20% probability and red rockets up to 80% probability, right? Now, the model outputs that and then most people just end up taking the top probability and outputting it to the user.

And that's sort of like, how does the model learn that is the attention mechanism, right? And this is sort of the beauty. What is that? Yeah. The attention mechanism is the beauty of modern sort of large language models. It takes the relational value, you know, in this vector space between every single token, right? So the sky, you know, the sky is blue, right? Like when I think about it, yes, blue is the next token after the sky is, but blue

In a lot of older style models, you would just predict the exact next word. So after sky, obviously it could be many things. It could be blue, but it could also be like a scraper, right? Skyscrapers, yeah, that makes sense. But what attention does is it is taking all of these various values, the query, the key, the value, which represents what you're looking for, where you're looking, and what

and what that value is across the attention. And you're calculating mathematically what the relationship is between all of these tokens. And so going back to the king, queen representation, right? The way these two words interact is now calculated, right? And the way that every word in the entire passage you sent is calculated is tied together, which is why models have like challenges with like how long can you, how many documents can you send them, right? Because if you're sending them

you know, just the question, like what color is the sky? Okay. It only has to calculate the attention between, you know, those, those words. Right. But if you're sending it like 30 books with like insurance claims and all these other things, and you're like, okay, figure out what, what's going on here. Uh,

Is this a claim or not? Right. And in the insurance context, all of a sudden it's like, okay, I've got to calculate the attention of not just like the last five words to each other, but I have to calculate every, you know, 50,000 words to each other. Right. Which then ends up being a ton of math. Back in the day, actually the best language models were a different architecture entirely. Right. But then at some point, you know, transformers, large language models, sort of large language models, which are basically based on transformers primarily, rocketed past and capabilities because they were over to scale and because the hardware got

there. Um, and then we were able to scale them so much that we were able not to just put like some text in them and not just a lot of texts or a lot of books, but the entire internet, which, you know, one could view the internet oftentimes as a microcosm of all human culture and learnings and knowledge, uh, to many extents because most books are on the internet. Most papers are on the internet. Um, obviously there's a lot of things missing on the internet. Um,

But this is sort of, this is the sort of modern, you know, magic of like what, it was sort of like three different things like coming all together at once, right? An efficient way for models to relate every word to each other, the compute necessary to scale the data large enough, and then someone actually like pulling the trigger to do that, right? At the scale that was, you know, got to the point where it was useful, right? Which was sort of like GPT 3.5 level or four level, right? Where it became extremely useful for normal humans to use

you know, chat models. Okay. And so why is it called pre-training? So pre-training is sort of called that because it is what happens, you know, before the actual training of the model, right? The objective function in pre-training is to just predict the next token, but predicting the next token is not what humans want to use AIs for, right? I want it to ask a question and answer it.

But in most cases, asking a question does not necessarily mean that the next most likely token is the answer. Oftentimes it is another question. For example, if I ingested the entire SAT and I asked a question, all the next tokens would be like, "A is this, B is this, C is this, D is this." Like, "No, I just want the answer."

And so pre-training is, the reason it's called pre-training is because you're ingesting humongous volumes of text no matter the use case. And you're learning the general patterns across all of language. I don't actually know that king and queen relate to each other in this way. And I don't know that king and queen are opposites in these ways. And so this is why it's called pre-training is because you must get a broad general understanding of the entire sort of world of text.

before you're able to then do post-training or fine-tuning, which is let me train it on more specific data that is specifically useful for what I want it to do. Whether it's, hey, in chat style applications, you know, go in, you know, when I ask a question, give me the answer. Or in other applications, like teach me how to build a bomb. Well, obviously, no, I'm not going to help you teach build a bomb because that's what I don't want the model to teach me how to build a bomb. So, you know, it's sort of got a

Do this. And it's not like you're teaching it, you know, when you're doing this pre-training, you're filtering out all this data. Because in fact, there's a lot of good, useful data on how to build bombs because there's a lot of useful information on like, hey, like C4 chemistry and like, you know, people want to use it for chemistry, right? So you don't want to just filter out everything so that the model doesn't know anything about it. But at the same time, you don't want it to out,

put, you know, how to build a bomb. Um, so there's like a fine balance here and that's why pre-training is defined as pre because you're, you're, you're still letting it do things and teaching it things and inputting things into the model that are theoretically like quite bad. Right. Um, for example, books about like killing or war tactics or what have you, right? Like things that like plausibly you could see like, Oh, well maybe that's not okay. Um, or, or wild descriptions of like really grotesque things all over the internet, but you want the model to learn these things.

Right. Because first you build the general understanding before you say, OK, now that you've got a general framework of the world, let's align you so that you with this general understanding the world can figure out what is useful for people, what is not useful for people. What should I respond on? What should I not respond on?

So what happens then in the training process? So is the training process that the model is then attempting to make the next prediction and then just trying to minimize loss as it goes? Right, right. I mean, like basically...

You have loss is how often you're wrong versus right in the most simple terms, right? You'll run through passages, right, through the model, and you'll see how often did the model get it right when it got it right? Great. Reinforce that. When it got it wrong, let's figure out which neurons in the model, you know, quote unquote neurons in the model you can tweak to then fix it.

the answer so that when you go through it again, it actually outputs the correct answer. And then you move the model slightly in that direction. Now obviously the challenge with this is if I first, you know, I can come up with a simplistic way where all the neurons will just output the sky is blue. Every single time it says the sky is. But then when it goes to

you know, hey, the color blue is commonly used on walls because it's soothing, right? And it's like, oh, what's the next word? Is soothing, right? Soothing, you know, and so like that, that is a completely different representation. And to understand that blue is soothing and that the sky is blue and those things aren't actually related, but they are related to blue is like very important. And so, you know, over oftentimes you'll run through the training data set multiple times.

right? Because the first time you see it, oh great, maybe you memorized that the sky is blue and you memorized the wall is blue and that, and when people describe art and oftentimes use the color blue, it can be representations of art or the wall, right? And so over time, as you go through all this text in pre-training, yes, you're minimizing loss initially by just memorizing, but

over time because you're constantly overwriting the model, it starts to learn the generalization, right? I.e. blue is a soothing color, also represents the sky, also used in art for either of those two motifs.

Right. And so that's sort of the goal of pre-training is you don't want to memorize. Right. Because that's, you know, in school you memorize all the time and that's not useful because you forget everything you memorize. But if you get tested on it then and then you get tested on it six months later and then again, six months later after that or however you do it ends up being, oh, you don't actually like memorize that anymore. You just know it innately and you've generalized on it. And that's the real goal that you want out of the model. But

But that's not necessarily something you can just measure, right? And therefore loss is something you can measure, i.e. for this group of text, right? Because you train the model in steps. Every step you're inputting a bunch of text, you're trying to see what's predict the right token, where you didn't predict the right token, let's adjust the neurons. Okay, onto the next batch of text. And you'll do these batches over and over and over again across trillions of words of text, right?

and, and as you step through and then you're like, Oh, well I'm done. But I bet if I go back to the first, first group of texts, which was all about the sky being blue, it's going to get the answer wrong because maybe later on in the training it discovered it saw some passages about sci-fi and how the Martian skies read. So like it'll, it'll overwrite. But then over time as you go through the data multiple times, as you see it on the internet multiple times, you see it in different books multiple times, whether it be scientific sci-fi, whatever it is, you start to realize and it starts to learn that, uh,

that representation of like, oh, when it's on Mars, it's red because the sky and Mars is red because the atmospheric makeup is this way, whereas the atmospheric makeup on Earth is a different way. And so that's sort of like the whole point of pre-training is to minimize loss. But the nice side effect is that the model initially memorizes, but then it stops memorizing and it generalizes. And that's the useful pattern that we want. Okay, that's fascinating. We've touched on post-training for a bit, but just to recap,

Post-training is, so you have a model that's good at predicting the next word. And in post-training, you sort of give it a personality by inputting sample conversations to make the model want to emulate the certain values that you want it to take on. Yeah, so post-training can be a number of different things. The most simple way of doing it is, yeah, backtracking.

Pay for humans to label a bunch of data, take a bunch of example conversations, et cetera, and input that data and train on that at the end, right? And so that example data is useful, but this is not scalable, right? Like using humans to train models is just so expensive, right? So then there's the magic of sort of reinforcement learning and other synthetic data technologies, right, where the model is helping teach the model.

right? So you'll have many models in a sort of, in a post-training where, yes, you have some example human data, but human data does not scale that fast, right? Because the internet is trillions and trillions of words out there, whereas, you know, even if you had, you know, Alex and I write words all day long for our whole lives, we would have millions or, you know, hundreds of millions of words written, right? It's nothing. It's like orders of magnitude off in terms of the number of words required. So then you have

the model, you know, take some of this example data, um, and you have various models that are surrounding the main model that you're training, right? And these can be policy models, right? Teaching it, Hey, is this, is this what you want or that what you want? Uh, reward models, right? Like, is that good? Is that a good response? Or is that a bad response? You have value models like, Hey, grade this output, right? And you have all these different models working in conjunction to say, um,

uh, you know, different companies have different objective functions, right? In the case of Anthropic, they're, they want their model to be helpful, harmful, harmless, and safe, right? So be helpful, but also don't harm people or anyone or anything. And then, you know, uh, you know, safe, right? In other cases like Grok, right? Uh, Elon's model from XAI, um,

it actually just wants to be helpful. And maybe it has like a little bit of a right leaning to it. Right. And for other folks, right. Like, you know, I mean, most AI models are made in the Bay area, so they tend to just be left leaning. Right. But also the internet in general is a little bit left leaning because it skews younger than older. And so like all these things like sort of affect models, but like, it's not just around politics. Right. Post-training is also just about teaching the model. If I, if I say like the movie where, you know,

the princess has a slipper and it doesn't fit. It's like, well, if I said that into a base model that was just pre-training, like the answer wouldn't be, oh, the movie you're looking for Cinderella, you know, it would only realize that once it goes, you know, once it goes through post-training, right? Because a lot of times people just throw garbage into the model and then the model still figures out what you want, right? And this is part of what post-training is. Like you can just do stream of consciousness into models and

And oftentimes it'll figure out what you want. Like, you know, if it's a movie that you're looking for, or if it's help answering a question, or if you throw a bunch of like unstructured data into it and then ask it to make it into a table, it does this, right? And that's because of all these different aspects of post-training, right? Example data, but also, you know, generating a bunch of data and grading it and seeing if it's good or not,

and whether it matches the various policies you want. Is it help? You know, a lot of times grading can be based on multiple factors, right? There could be a model that says, hey, is this helpful? Hey, is this safe? And what is safe, right? So then that model for safety needs to be tuned on human data, right? So it is a quite complex thing, but the end goal is to be able to get the model to

output in a certain way. Models aren't always about just humans using them either, right? There can be models that are just focused on like, hey, like, you know, if it doesn't output code, you know, yes, it was trained on the whole internet because the person's gonna talk to the model using text, but if it doesn't output code, you know, penalize it, right? Now all of a sudden the model will never output like text ever again, it'll only output code.

And so like these sorts of like models exist too. So post-training is not just a univariable thing, right? It's what variables do you want to target? And so that's why models have different personalities from different companies. It's why they target different use cases and why, you know, it's not just like one model that rules them all, but actually many.

That's fascinating. So that's why we've seen so many different models with different personalities is because it all happens in the post training moment. And this is when you talk about giving the models examples to follow this, that's what reinforcement learning with human feedback is, is the humans give some examples, and then the model learns to emulate what the human is interested in, what the human trainer is interested in having them embody. Is that right? Yeah, exactly.

Okay, great. All right. So first half, we've covered what training is, what tokens are, what loss is, what post-training is. Post-training is post-training, by the way, also called fine-tuning. We've also covered reinforcement learning with human feedback. We're going to take a quick break, and then we're going to talk about reasoning. We'll be back right after this.

Small and medium businesses don't have time to waste, and neither do marketers trying to reach them. On LinkedIn, more SMB decision makers are actively looking for new solutions to help them grow, whether it's software or financial services. Our Meet the SMB report breaks down how these businesses buy,

And we're back here on Big Technology Podcast with Dylan Patel. He's the founder and chief analyst at Semi Analysis. He actually has great analysis on NVIDIA's recent releases.

GTC conference, which we just covered recently on a recent episode. You can find Semi-Analysis at semianalysis.com. It is both content and consulting. So definitely check in with Dylan for all those needs. And now we're going to talk a little bit about reasoning.

Because a couple of months ago, and Dylan, this is really where I sort of entered the picture of watching your conversation with Flex, with Nathan Lambert, about what the difference is between reasoning and your traditional LLMs, large language models. If I gathered it right from your conversation, what reasoning is, is basically instead of the model going, basically predicting the next word based off of its training,

It uses the tokens to spend more time basically figuring out what the right answer is and then coming out with a new prediction. I think Carpathia does a very interesting job in the YouTube video talking about how models think with tokens. The more tokens there are, the more compute they use because they're running these predictions through the transformer model, which we discussed, and therefore they can come to better answers. Is that the right way to think about reasoning?

So, so I think that, um, humans are also fantastic at pattern matching, right? Um, we're really good at like recognizing things, but a lot of tasks, it's not like an immediate response, right? We are thinking, um, whether that's thinking through words out loud, thinking through words in an inner monologue in our head, or it's just like processing somehow. And then we know the answer, right? Um, and this is the same for models, right? Models are horrendous at math, right? Historically have been right. Um,

You could ask it, you know, what is 9.11 bigger than 9.9? And it would say, yes, it's bigger, even though like everyone knows that 9.11 is way smaller than 9.9, right? And that's just like a thing that happened in models because they didn't think or reason, right? And it's the same for you, Alex, right? Like, you know, or myself, right? Like if someone asked me, you know,

17 times 34, I'd be like, I don't know, like right off the top of my head. But, you know, give me a little bit of time. I can do some long form multiplication and I can get the answer, right? And that's because I'm thinking about it.

And this is the same thing with reasoning for models, is when you look at a transformer, every word is this, every token output, it has the same amount of compute behind it, right? I.e., when I'm saying the versus sky is blue, the blue and the the have the, or the is and the blue have the same amount of compute to generate, right? And this is not exactly what you wanna do, right? You wanna actually spend more time on the hard things and not on the easy things.

And so reasoning models are effectively teaching large pre-trained models to do this, right? Hey, think through the problem. Hey, output a lot of tokens. Think about it, generate all this text. And then when you're done, start answering the question. But now you have all of this stuff you generated in your context, right?

And that stuff you generated is helpful, right? It could be like, you know, all sorts of, you know, just like any human's thought patterns are, right? And so this is the sort of like new paradigm that we've entered maybe six months ago, where models now will think for some time before they answer. And this enables much better performance on all sorts of tasks, whether it be coding or math or understanding science or understanding complex

social dilemmas, right? All sorts of different topics. They're much, much better at. And this is done through post-training similar to the reinforcement learning by human feedback that we mentioned earlier. But also there's other forms of post-training and that's, that's what makes these reasoning models. Before we head out, I want to hit on a couple of things. First of all, the growing efficiency of these models. So I think one of the things that people focused on with deep seek was that it was just able to be much more efficient and

in the way that it generates answers. And there was this obviously this big reaction to Nvidia stock where it fell 18% the day or at the Monday after deep seek weekend because people thought we wouldn't need as much compute. So can you talk a little bit about how models are becoming more efficient, how they're doing it? - Yeah, so there's a variety of, the beauty of these of AI is not just that we continue to build new capabilities, right?

Because those new capabilities are gonna be able to benefit the world in many ways And there's a lot of focus on those but there's also a lot of there's a lot of focus on well to get to that next level of capabilities is is the scaling laws ie the more compute and data I said spend the better the model gets but then the other vector is well Can I get to the same level with less compute and data? Right

And those two things are hand in hand, because if I can get to the same level with less compute and data, then I can spend that more compute and data and get to a new level. And so AI researchers are constantly looking for ways to make models more efficient, whether it be through algorithmic tweaks, data tweaks, tweaks in how you do reinforcement learning, so on and so forth.

And so when we look at models across history, they've constantly gotten cheaper and cheaper and cheaper at a stupendous rate. And so one easy example is GPT-3, because there's GPT-3, 3.5 Turbo, Llama 2.7b, Llama 3,

Lama 3.1, Lama 3.2, right? As these models have gotten bigger, we've gone from, hey, it costs $60 for a million tokens to it costs like 5 cents now for the same quality of model. And the model has shrank dramatically in size as well. And that's because of better algorithms, better data, etc. And now what happened with DeepSeq was similar.

You know, OpenAI had GPT-4, then they had 4 Turbo, which was half the cost. Then they had 4.0, which was again half the cost. And then Meta released Lama 405B, open source. And so the open source community was able to run that. And that was again, like roughly like half the cost, or 5x lower cost than 4.0, which was lower than 4 Turbo and 4. But DeepSeek came out with another tier, right? So when we looked at GPT-3, the cost fell 1200x lower.

from GPT-3's initial cost to what you can get LAMA 3.2 3B today, right? And likewise, when we look at from GPT-4 to DeepSeq V3, it's fallen roughly 600x in cost, right? So we're not quite at that 1200x, but it has fallen 600x in cost from $60 to less than, you know, to about a dollar, right? Or to less than a dollar, sorry, 60x.

And so you've got this massive cost decrease, but it's not necessarily out of bounds, right? We've already seen it. I think what was really surprising

was that it was a Chinese company for the first time, right? 'Cause Google and OpenAI and Anthropic and Meta have all traded blows, right? Whether it be OpenAI always being on the leading edge or Anthropic always being on the leading edge or Google and Meta being close followers, but oftentimes, sometimes with a new feature and sometimes just being much cheaper. We have not seen this from any Chinese company, right? And now we have a Chinese company releasing a model that's cheap

It's not unexpected, right? Like this is actually within the trend line of what happened with GPT-3 is happening to GPT-4 level quality with DeepSeq. It's more so surprising that it's a Chinese company. And that's, I think why everyone freaked out. And then there was a lot of things that like, you know, from there became a thing, right? Like if Meta had done this, I don't think people would have freaked out, right? And Meta's gonna release their new Lama soon enough, right? And that one is gonna be, you know,

similar level of cost decrease, probably similar area as DeepSeek v3, right? It's just not people are gonna freak out because it's an American company and it was sort of expected.

all right dylan let me ask you the last question which is the you mentioned i think you mentioned the bitter lesson which is basically that the i mean i'm going to just be kind of facetious in summing it up but the answer to all questions in machine learning is just to make bigger uh bigger models and scale solves almost all problems so it's interesting that we have this moment where models are becoming way more efficient but we also have massive massive data center build outs

I think it would be great to hear you kind of recap the size of these data center build outs and then answer this question. If we are getting more efficient, why are these data centers getting so much bigger? And what might that added scale get in the world of generative AI for the companies building them?

Yeah, so when we look across the ecosystem at data center build outs, we track all the build outs and server purchases and supply chains here. And the pace of construction is incredible, right? You can pick a state and you can see new data centers going up.

all across the US and around the world, right? And so you see things like capacity in, for example, of the largest scale training supercomputers goes from, hey, it's a few hundred million dollars. It's not even a few hundred million dollars years ago, but like, hey, for GPT-4, it was a few hundred million dollars.

And it's one building full of GPUs too. GPT 4.5 and the reasoning models like 0103 were done in three buildings on the same site and billions of dollars to, hey, these next generation things that people are making are tens of billions of dollars. Like OpenAI's data center in Texas called Stargate, right?

with Crusoe and Oracle and etc, right? And likewise applies to Elon Musk who is building these data centers in an old factory where he's got like a bunch of like gas generation, you know, outside and he's doing all these crazy things to get the data center up as fast as possible, right? And you can go to just basically every company and they have like these humongous build-outs. And this sort of like

and, and, and, and because of the scaling laws, right. You know, 10 X more, uh, compute for linear, like improvement gains, right? Like it's sort of like, or it's log logs, right. Um, but you, uh, you end up with this like very confusing thing, which is like, Hey, models keep getting better as we spend more, but also the model that we had a year ago is now done for way, way cheaper, right? Oftentimes 10 X cheaper or more, right. Just a year later. Um,

So then the question is like, why are we spending all this money to scale? And there's a few things here, right? A, you can't actually make that cheaper model without making the bigger model so you can generate data to help you make the cheaper model, right? Like that's part of it. But also another part of it is that

If we were to freeze AI capabilities where we were basically in, what was it, March 2023, two years ago when GPT-4 released, and only made them cheaper, like DeepSeek is much cheaper, it's much more efficient, but it's roughly the same capabilities as GPT-4, that would not pay for all of these capabilities.

build-outs, right? AI is useful today, but it's not capable of doing a lot of things, right? But if we make the model way more efficient and then continue to scale, and we have this stair-step, right, where we increase capabilities massively, make them way more efficient, increase capabilities massively, make them way more efficient. We do the stair-step, then you end up with creating all these new capabilities that could, in fact, pay for these massive AI build-outs. So no one is trying to make, with these $10 billion data centers, they're not trying to make chat

models, right? They're not trying to make models that people chat with, just to be clear, right? They're trying to solve things like software engineering and make it automated, which is like a trillion dollar plus industry, right? So these are very different like sort of use cases and targets. And so it's the bitter lesson because yes, you can make, you can spend a lot of time and effort making clever specialized methods,

you know, based on intuition. Um, and you should, right. But these things should also just have a lot more compute thrown behind them because if you make it more efficient, as you follow the scaling laws up, it'll also just get better and you can then unlock new capabilities. Right. And so today, you know, a lot of, uh, AI models, the best ones from anthropic are now useful for like coding, uh,

assist it with you, right? You're, you're going back and forth, you know, as time goes forward, as you make them more efficient and continue to scale them, the possibility is that, Hey, it can code for like 10 minutes at a time and I can just review the work and it'll make me five X more efficient. Right. Um, you know, and so on and so forth. And this is sort of like where, where reasoning models and sort of the scaling sort of argument comes in is like, yes,

We can make it more efficient, but we also just you know, that's not going to solve the problems that we have today, right? The earth is still going to run out of resources. We're going to run out of nickel because we can't make bad enough batteries and we can't make enough batteries. So then we can't with current technology that we can't replace all of, you know, gas, you know, gas and coal with renewables, right? All of these things are going to happen unless like you continue to improve AI and invent and or just generally research new things and AI helps us research new things.

Okay. This is really the last one. Where is GPT-5? Um, so, so, so OpenAI released GPT 4.5 recently, um, with, uh, what they called training run Orion. Uh, there were hopes that Orion could be used for, uh, for, uh, GPT-5. Um, but its improvement was like not enough to be like really a GPT-5. Um, furthermore, it was trained on the classical method, which is like, uh, which is a

a ton of pre-training and then some reinforcement learning with human feedback and some other reinforcement learning like PPO and DPO and stuff like that. But then along the way, this model was trained last year, along the way another team at OpenAI made the big breakthrough of reasoning, strawberry training, and they released O1 and then they released O3. And these models are rapidly getting better with reinforcement learning with verifiable rewards.

And so now GPT-5, as Sam calls it, is gonna be a model that has huge pre-training scale, right? Like GPT 4.5, but also huge post-training scale like 01 and 03 and continuing to scale that up, right? And this would be the first time we see a model that was a step up in both at the same time. And so that's what OpenAI says is coming.

They say it's coming this year, hopefully in the next three to six months, maybe sooner. I've heard sooner, but we'll see. But this path of scaling both pre-training and post-training with reinforcement learning with verifiable rewards massively should yield much better models that are capable of much more things. And we'll see what those things are.

Very cool. All right, Dylan, do you want to give a quick shout out to those who are interested in potentially working with Semi Analysis, who you work with and where they can learn more?

Sure. So we, you know, at semi analysis.com, we have, you know, the, we have the public stuff, which is like all these reports that are, uh, pseudo free, but then we, most of our work is done on, uh, directly for clients. There's these data sets that we sell around every data set in the world, servers, all the compute where it's manufactured, how many, where, what's the cost and who's doing it. Um, and then we also do a lot of consulting. We've got people who have worked all the way from ASML, which makes lithography tools all the way up to, you know, Microsoft and Nvidia, uh,

which, you know, making models and doing infrastructure. And so we've got this whole gambit of, you know, folks, there's roughly 30 of us across the world in US, Taiwan, Singapore, Japan, France, Germany,

Canada, so there's a lot of engagement points, but if you want to reach out, just go to the website, go to one of those specialized pages of models or sales and reach out. And that'd be the best way to sort of interact and engage with us. But for most people, just read the blog, right? I think unless you have specialized needs, unless you're a company in the space or you're an investor in the space, you just want to be informed, just read the blog and it's free. I think that's the best option for most people.

Yeah, well, I will attest the blog is magnificent. And Dylan is really thrilled to get a chance to meet you and talk through these topics with you. So thanks so much for coming on the show. Thank you so much, Alex. All right, everybody. Thanks for listening. We'll be back on Friday to break down the week's news. Until then, we'll see you next time on Big Technology Podcast.

Generative AI 101: Tokens, Pre-training, Fine-tuning, Reasoning — With Dylan Patel 42:09 Share

Big Technology Podcast

Shownotes Transcript

Generative AI 101: Tokens, Pre-training, Fine-tuning, Reasoning — With Dylan Patel