We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

860: DeepSeek R1: SOTA Reasoning at 1% of the Cost

2025/2/7

Super Data Science: ML & AI Podcast with Jon Krohn

Jon Krohn: DeepSeek R1推理模型在性能上与OpenAI的GPT-4和Google的Gemini 2.0 Flash相当,但训练成本却大大降低,仅为它们的1%。DeepSeek是一家中国公司,其R1模型的成功对全球经济产生了影响,并对美国的技术制裁提出了质疑。该模型采用了混合专家模型等现有概念,并结合了GPU通信加速器DualPipe等创新技术,实现了高效的训练。DeepSeek还开源了其V3和R1模型的源代码和模型权重,为AI社区做出了巨大贡献。虽然DeepSeek的iOS应用存在隐私问题,但用户可以通过Ollama等平台私下使用DeepSeek模型。DeepSeek R1的出现,使得AI模型的开发、训练和运行更加经济,并降低了与AI相关的环境问题,使得AI应用能够更广泛地被使用和受益。

Deep Dive

Shownotes Transcript

Translations:

中文

This is episode number 860 on DeepSeek R1.

Welcome back to the Super Data Science Podcast. I am your host, Jon Krohn. Let's start off with a couple recent reviews of the show like we sometimes do on Fridays. The first one's from Rem Nassa, who provided an Apple podcast review that says that they listen often and that they've been listening to the Super Data Science Podcast for a couple of years now, and they always find the content fascinating. And they say that

Sometimes the content is a bit over their head, but you can bet that they look up the information and learn from every episode. Very cool. Nice to hear that, Ramnasa. And we had a second Apple podcast review as well. This one is from SailATX. It says the Super Data Science Podcast is a fantastic way to keep up with the world of AI and the people who work in the industry. They say that the guests that we have on the show are spot on and always interesting.

They also have nice things to say about my YouTube calculus course, and then it's helping them brush up their math for a data science course that they're taking. Cool. Good luck with that course, SAIL ATX. And I hope both of you, Ram Nasa and SAIL ATX, continue to enjoy the show. Thanks for all the recent ratings and feedback on Apple Podcasts, Spotify, and all the other podcasting platforms out there, as well as for likes and comments on YouTube videos.

As a bit of friendly competition, I mentioned this for the first time a couple weeks ago. Regular listeners may know that I've guest co-hosted the excellent Last Week in AI podcast half a dozen times, and both regular hosts of that show, Andre and Jeremy, have been my guests on the Super Data Science podcast. Well, despite their show being many years younger than the Super Data Science podcast, they are closing in on the Super Data Science podcast in terms of number of Apple podcast ratings. At the time of recording, we have...

286 and they're at 255. So we're staying ahead. And since I last mentioned this a couple weeks ago, both podcasts have had about five ratings each. So we're staying neck and neck. But it does seem like I need you to keep

going at it and press towards 300 ratings there on Apple Podcasts. So help me stay ahead of Andre and Jeremy by heading to your podcasting app and rating the Super Data Science Podcast there. Bonus points, if you leave written feedback, I'll be sure to read it on air like I did today. All right, into the meat of today's episode now. In recent weeks, I'm sure you've noticed there's been a ton of excitement over DeepSeek, a Chinese AI company that was spun out of a Chinese hedge fund just two years ago.

DeepSeek's v3 stream of consciousness chatbot style model caught the world's attention because it was able to perform near state-of-the-art models like OpenAI's GPT-4 and Google's Gemini 2.0 Flash. But it was DeepSeek's reasoning model, which is kind of like OpenAI's O1 reasoning model. So this is instead of having a stream of consciousness, these reasoning models review

what they've been quote-unquote thinking before necessarily pumping something out. And this kind of reasoning has turned out to be great for the same kinds of tasks that you might ponder and reason with a pencil and paper over, you know, math problems, computer science problems, those kinds of things. So you can hear more about these kinds of reasoning models in episode 820 of this show.

Um, but, uh, suffice it to say that deep seeks reasoning model called R one caused a huge economic disruption, such as both Nvidia's share price falling by 17% and the NASDAQ falling several percent last Monday at the time of writing deep seeks R one reasoning model is statistically. So within a 95% confidence interval tied, um,

for first place on the overall LM Arena leaderboard with the top models. It's literally in first place, statistically speaking, alongside GPT-4.0 and Gemini 2.0 Flash from Google. So the LM Arena leaderboard is one of many kinds of leaderboards that you could use to compare LLM performance

But the LM Arena Lena board is particularly interesting because it involves humans blindly rating the performance of one output versus another.

And so, yeah, it's kind of it's an interesting leaderboard. You can actually hear a ton about that leaderboard in episode number 707 of this podcast if you'd like to. Anyway, this great performance being on top of the L.A. Marina leaderboard and other kinds of leaderboards out there caught global attention first because DeepSeek is an obscure Chinese company. While all the previous top models were devised by Americans, specifically by Bay Area tech giants.

More consequentially than even great power politics, however, DeepSeek's R1 caused a global economic tsunami because it is comparable in performance to the best OpenAI, Google, and Anthropic models while costing a fraction as much to train. There are all kinds of complexities, externalities, and estimates to take into account when trying to make a comparison in cost between two different LLMs at two different companies. For example, what about the cost of training runs that didn't pan out? But

Speaking in rough approximations, training a single DeepSeek V3 or DeepSeek R1 model appears to cost on the order of millions of dollars, while training a state-of-the-art Bay Area model like O1, Gemini, or Claude 3.5 Sonnet reportedly costs on the order of hundreds of millions of dollars, so about 100 times more.

As I've stated on this show several times, even without conceptual scientific breakthroughs, simply scaling up the transformer architecture that underlies O1, Gemini, or Claude, such as by increasing training dataset size, increasing the number of model parameters, increasing training time compute, or in the case of reasoning models like O1, increasing inference time compute.

And so doing any of that kind of scaling will lead to impressive LLM improvements that overtake more and more humans on more and more cognitive tasks and bring machines in the direction of artificial general intelligence. If you don't know what AGI is, you can check out episodes number 748 and 820 for more on all of what I just said in the last sentence.

Implicit in this scaling statement that I made, however, is that if researchers can devise major conceptual scientific breakthroughs with respect to how machines learn, so actually making scientific breakthroughs instead of just scaling things up, we could accelerate toward AGI even more rapidly.

If conceptual breakthroughs on AI model development can allow machines to improve their cognitive capabilities while also learning more efficiently, this would reduce server farm energy consumption, loss of fresh water through server cooling, and of course, it would just save plain old financial costs associated with running AI models.

DeepSeek has achieved such a conceptual breakthrough via combining a number of existing ideas like mixture of experts models. You can learn more about those in episode 778. Yeah, combining those kinds of existing ideas with brand new major efficiencies, such as a GPU communications accelerator called DualPipe.

that schedules the way data passed between the couple thousand GPUs DeepSeq appeared to train R1 with to get the breathtaking results that they did. Now, 2000 GPUs might sound like a lot, but it's again about 1% of the number of chips Meta's Mark Zuckerberg and XAI's Elon Musk brag about procuring in a given year for potentially training a single, ever larger, next large language model.

I'm not going to go further into the technical details of the DeepSeek models in this episode, but if you'd like to dig into the technical aspects more deeply, I have provided a link to DeepSeek's full R1 paper, as well as an exceptionally detailed, well-written blog post on an online tech news site called Next Platform that breaks down that paper.

Moving beyond technical aspects to geopolitics, DeepSeek's success demonstrates that American sanctions that prevent Chinese firms from accessing the latest, most powerful Nvidia chips have been ineffective. These sanctions were explicitly designed to prevent China from being able to overtake the US on the road to HEI, particularly given the military implications of having access to a machine that could far exceed human cognitive capabilities. But

Now, a Chinese firm has figured out how to approach U.S. firms' AI capabilities with about 1% of the quantity of chips at about 1% of the cost and using less capable NVIDIA chips than American firms have access to. On a side note, in a separate quandary for the Chinese Communist Party, for geopolitical reasons, they'd probably prefer that DeepSeek's intellectual property be kept proprietary, and yet...

DeepSeq graciously open sourced their work for the world to leverage and advance AI research as well as AI application development. All of the DeepSeq v3 and R1 source code and model weights are available on GitHub. I've got a link to that in the show notes. And all of that source code and model weights are available for use under a highly permissive MIT license.

All aspects of proprietary models like those from OpenAI, Google, Anthropic, and XAI are, on the other hand, proprietary. So that's another big positive for the AI community from the folks at DeepSeek.

This level of openness from DeepSeq is far beyond even what so-called open LLMs like Meta's Lama family offer because Meta provides model weights but not source code, and Meta's unusual license includes constraints such as limiting Lama model usage to companies with fewer than 700 million active users.

Beyond providing their models open source, DeepSeek also created an iOS app. It was number one in the Apple App Store at the time of recording this episode. But I would caution you against using the DeepSeek app because per the app's privacy policy, anything you input into DeepSeek's app is collected by the company and stored on servers in China.

If you'd like to privately use a DeepSeek model but don't want to take your time or money to download the model weights and run it on your own hardware, you can use a platform like Ollama. I've got a link in the show notes to the R1 model from DeepSeek provided by Ollama, so you can do just that.

Okay, so hopefully you're excited that you now have untethered access to state-of-the-art AI capabilities, but that should only be the beginning of your excitement. So markedly more efficient LLM training does make recent $6 billion raises by OpenAI, XAI, and Anthropic, much of which would have been earmarked for training ever-larger transformer architectures for ever-longer inference time. It looks like those raises may no longer be very well-allocated capital.

And the DeepSeek release ended up being coincidentally but nevertheless comically timed with the announcement of the $500 billion Stargate AI infrastructure project that included the CEOs of OpenAI, Oracle, and SoftBank alongside Donald Trump. That enormous $500 billion Stargate figure probably only made sense when bean counters assumed LLMs would keep growing and growing by orders of magnitude in the coming years.

And yeah, correspondingly, NVIDIA's share price took a 17% hit in one day. Although at the time of writing this podcast episode and recording it, some of this share price hit had recovered. But that share price took that 17% hit because shareholders realized the LLM size increases they'd baked into future GPU orders may no longer come to fruition. But for most of us, certainly for me, and probably for most listeners,

Markedly more efficient LLM training and a rehashing of the open source model that dominated AI model research until just a few years ago is fabulous news. Increased LLM efficiency in particular means fewer environmental issues associated with AI, and it means that developing, training, and running AI models is more economical, and therefore developing practical AI applications becomes cheaper and more widely available for all to use and benefit from around the world.

These are exciting times indeed. Dream up something big and make it happen. There's never been an opportunity to make an impact like there is today.

All right, that's it for today's episode. If you enjoyed it or know someone who might consider sharing this episode with them, leave a review of the show on your favorite podcasting platform. Tag me in a LinkedIn or Twitter post with your thoughts. And if you aren't already, obviously subscribe to the show. Most importantly, however, I just hope you'll keep on listening. Until next time, keep on rocketing out there and I'm looking forward to enjoying another round of the Super Data Science Podcast with you very soon.

860: DeepSeek R1: SOTA Reasoning at 1% of the Cost 13:08 Share

Super Data Science: ML & AI Podcast with Jon Krohn

Deep Dive

Shownotes Transcript

860: DeepSeek R1: SOTA Reasoning at 1% of the Cost