One of the really cool things about this job is just that when something like this happens, I get to kind of talk to everyone and everyone wants to talk. And I feel like I've talked to maybe not everyone and like all the top people in AI, but it feels like most of them. And there's definitely a lot of takes all over the map on DeepSeek, but I feel like I've started to put together a synthesis based on hearing from people.
the top people in the field. It was a bit of a freakout. I mean, it's rare that a model release is going to be a global news story or cause a trillion dollars of market cap decline in one day. And so it is interesting to think about, like, why was this such a potent news story? And I think it's because there's two things about that company that are different. One is that obviously it's a Chinese company rather than an American company. And so you have the whole China versus US competition.
And then the other is it's an open source company, or at least it open sourced the R1 model. And so you've kind of got the whole open source versus closed source debate.
And if you take either one of those things out, it probably wouldn't have been such a big story. But I think the synthesis of these things got a lot of people's attention. A huge part of TikTok's audience, for example, is international. Some of them like the idea that the US may not win the AI race, that the US is kind of getting a comeuppance here. And I think that fueled some of the early attention on TikTok. Similarly, there's a lot of people who are rooting for open source, or they have animosity towards open AI.
And so they were kind of rooting for this idea that, oh, there's this open source model that's going to give away what open AI has done at one twentieth the cost. So I think all of these things provided fuel for the story. Now, I think the question is, OK, what should we make of this? I mean, I think there are things that are true about the story and then things that are not true or should be debunked. I think that let's call it true thing here is that.
If you had said to people a few weeks ago that the second company to release a reasoning model
along the lines of 01 would be a Chinese company. I think people would have been surprised by that. So I think there was a surprise. And just to kind of back up for people, you know, there's two major kinds of AI models now. There's kind of the base LLM model, like ChatGP 4.0, or the deep secret equivalent was V3, which they launched a month ago. And that's basically like a smart PhD. You ask a question, gives you an answer. Then there's the new reasoning models, which are based on reinforcement learning, sort of a separate
process as opposed to pre-training. And O1 was the first model released along those lines. And you can think of a reasoning model as like a smart PhD who doesn't give you a snap answer, but actually goes off and does the work. You can give it a much more complicated question and it'll break that complicated problem into a subset of smaller problems. And then it'll go step by step to solve the problem. And that's called chain of thought, right? And so
The new generation of agents that are coming are based on this type of idea of chain of thought that an AI model can sequentially perform tasks, figure out much more complicated problems. So OpenAI was the first to release this type of reasoning model. Google has a similar model they're working on called Gemini 2.0 Flash Thinking. They've released kind of an early prototype of this called Deep Research 1.5. Anthropic has something, but I don't think they've released it yet. So other companies have...
Similar models to O1, either in the works or in some sort of private beta. But DeepSeq was really the next one after OpenAI to release, you know, the full public version of it. And moreover, they open sourced it. And so this created a pretty big splash. And I think it was legitimately surprising to people.
that the next big company to put out a reasoning model like this would be a Chinese company. And moreover, that they would open source it, give it away for free. And I think the API access is something like one 20th the cost. So all of these things really did drive the news cycle. And I think for good reason, because I think that if you had asked most people in the industry a few weeks ago, how far behind is China on AI models, they would say six to 12 months.
And now I think they might say something more like three to six months, right? Because 01 was released about four months ago and R1 is comparable to that. So I think it's definitely moved up people's timeframes for how close China is on AI. Now, we should take the claim that they only did this for $6 million. On this one,
I'm with Palmer Luckey and Brad Gerstner and others, and I think this has been pretty much corroborated by everyone I've talked to that that number should be debunked. So first of all, it's very hard to validate
a claim about how much money went into the training of this model. It's not something that we can empirically discover. But even if you accepted it at face value, that $6 million was for the final training run. So when the media is hyping up these stories saying that this Chinese company did it for $6 million and these dumb American companies did it for a billion,
It's not an apples to apples comparison, right? I mean, if you were to make the apples to apples comparison, you would need to compare the final training run cost by DeepSeek to that of OpenAI or Anthropic. And what the founder of Anthropic said and what I think Brad has said, being an investor in OpenAI and having talked to them, is that the final training run cost was more in the tens of millions of dollars
about nine or 10 months ago. And so, you know, it's not 6 million versus a billion. Okay. It's a billion dollar number might include all the hardware they bought the years of putting into it, a holistic number as opposed to the training number. Yeah. It's not fair to compare, let's call it a soup to nuts number, a fully loaded number by American AI companies to the final training run by the Chinese company. But real quick, Sax, you've got
you've got an open source model and the white paper they put out there is very specific about what they did to make it and sort of the results they got out of it. I don't think they give the training data, but you could start to stress test what they've already put out there and see if you can do it cheap, essentially. Like I said, I think it is hard to validate the number. I think that if, let's just assume that we give them credit for the 6 million number. My point is less that they couldn't have done it, but just that
We need to be comparing likes to likes. So if, for example, you're going to look at the fully loaded cost of what it took DeepSeq to get to this point, then you would need to look at what has been the R&D cost to date of all the models and all the experiments and all the training runs they've done, right? And the compute cluster that they surely have.
So Dylan Patel, who's leading semiconductor analyst, has estimated that DeepSeq has about 50,000 hoppers. And specifically, he said they have about 10,000 H100s, they have 10,000 H800s, and 30,000 H20s.
Now, the cost of that- Zach, sorry, is they DeepSeek or it's DeepSeek plus the hedge fund? DeepSeek plus the hedge fund. But it's the same founder, right? And by the way, that doesn't mean they did anything illegal, right? Because the H100s were banned under export controls in 2022. Then they did the H800s in 2023. But this founder was very farsighted. He was very ahead of the curve. And he was, through his hedge fund, he was using AI to basically do
algorithmic trading. So he bought these chips a while ago. In any event, you add up the cost of a compute cluster with 50,000 plus hoppers, and it's going to be over a billion dollars. So this idea that you've got this scrappy company that did it for only 6 million, just not true. They have a substantial compute cluster that they use to train their models. And frankly, that doesn't count any chips that they might have beyond
the 50,000 that they might have obtained in violation of export restrictions that obviously they're not going to admit to. We just don't know. We don't really know the full extent of what they have. I just think it's worth pointing that out that I think that part of the story got overhyped. It's hard to know what's fact and what's fiction. Everybody who's on the outside guessing has their own incentive. If you're a semiconductor analyst that effectively is
massively bullish Nvidia, you want it to be true that it wasn't possible to train on $6 million. Obviously, if you're the person that makes an alternative that's that disruptive, you want it to be true that it was trained on $6 million. All of that, I think, is all speculation. The thing that struck me was how
different their approach was. And TK just mentioned this, but if you dig into not just the original white paper of DeepSeq, but they've also published some subsequent papers that have refined some of the details. I do think that this is a case, and Sax, you can tell me if you disagree, but this is a case where necessity was the mother of invention. So I'll give you two examples where I just read these things and I was like, man, these guys are really clever. The first is, as you said,
Let's put in a pin on whether they distilled O1, which we can talk about in a second. But at the end of the day, these guys were like, well, how am I going to do this reinforcement learning thing? They invented a totally different algorithm. There was the orthodoxy, right? This thing called PPO that everybody used. And they were like, no, we're going to use something else called, I think it's called GRPO or something. It uses a lot less computer memory and it's highly performant.
So maybe they were constrained, SACs, practically speaking, by some amount of compute that caused them to find this, which you may not have found if you had just a total surplus of compute availability. And then the second thing that was crazy is everybody is used to building models and compiling through CUDA, which is NVIDIA's proprietary language,
which I've said for a couple times is their biggest moat, but it's also the biggest threat vector for lock-in. And these guys worked totally around CUDA and they did something called PTX, which goes right to the bare metal. And it's controllable and it's effectively like writing assembly. Now, the only reason I'm bringing these up is we, meaning the West, with all the money that we've had, didn't come up with these ideas. And I think part of why we didn't come up is not that we're not smart enough to do it,
but we weren't forced to because the constraints didn't exist. And so I just wonder how we make sure we learn this principle, meaning when the AI company wakes up and rolls out of bed and some VC gives them $200 million, maybe that's not the right answer for a Series A or a Seed. And maybe the right answer is 2 million so that they do these deep-seek-like innovations. Conspirant makes for great art. What do you think, Friedberg, when you're looking at this?
Well, I think it also enables a new class of investment opportunity. Given the low cost and the speed, it really highlights that maybe the opportunity to create value doesn't really sit at that level in the value chain, but further upstream. Apology made a comment on Twitter today that was pretty funny or...
I think it reflects this. About the wrapper? Yeah, he's like, turns out the wrapper may be the... The moat. The moat. Which is true. At the end of the day, if model performance continues to improve, get cheaper, and it's so competitive that it commoditizes much faster than anyone even thought...
then the value is gonna be created somewhere else in the value chain. Maybe it's not the wrapper. Maybe it's with the user. And maybe by the way, here's an important point. Maybe it's further in the economy. You know, when electricity production took off in the United States, it's not like the companies are making a lot of money that are making all the electricity. It's the rest of the economy that accrues a lot of the value.