We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Robots, Small Models, and RL with DeepSeek Alumnus Zihan Wang — #86

2025/5/22

Manifold

AI Deep Dive AI Chapters Transcript

People

Steve Hsu

Zihan Wang

Topics

Zihan Wang: 我认为机器人技术在未来几年会有很大的进步，因为现在的大型语言模型具备了视觉能力，可以进行语义理解。我在人民大学高瓴人工智能学院学习，这是一所被低估的学校，但他们发表了很多有影响力的论文。中国大学的变化速度非常快，现在的尖端研究可以在任何地方进行，而且开源基础设施降低了研究的门槛。中国人总是能快速发现有前途的算法并扩大规模。我认为开源就像一种博弈论，当每个人都合作时，社会利益最大化。我希望人工智能能帮助加速人类登上月球，并期待AI在代码调试和机器人领域的应用。 Steve Hsu: 我发现你在Twitter上发表了一些关于AI研究和DeepSeek实习的有趣文章。中国大学的变化速度让美国人感到震惊，现在的年轻教授和学生非常优秀。你认为中国独特或有创意的创新会超过美国吗？即使没有庞大的计算预算，你也可以在大学里进行有影响力的研究。现在的情况比几年前好多了，即使是密切关注AI的投资者和风险投资家，通常也不太了解中国的情况，他们对中国的模型质量感到惊讶。

Deep Dive

Shownotes Transcript

Translations:

中文

I think definitely robotics could be one of the biggest improvements for the next years because previously all of the robots can just not do any semantic understanding. But for current large language models with vision capabilities, I think this is going to be more true. Welcome to Manifold. My guest today is Zihan Wang. He is a PhD student at Northwestern University.

Zahan, welcome to the show.

Hi, nice to meet you, Steve. Nice to meet you. I'm very happy to be here. Yeah, it's great to have you on the show. I discovered you on Twitter because you wrote some very interesting posts about AI research and the internship that you did at DeepSeek. And also, I think you translated some interviews with the DeepSeek founder, for example. I want to start by talking about your background. So you're quite young, right? You only graduated from college recently. Is that right?

Yeah, yeah, yeah. I actually just graduated, like, last year, 2024, like, from Viennese University. Yeah. And now I'm a first-year PhD student. And so in China, is there a name for someone? Like, are you called the 2000s generation or something like this? Yeah, definitely. And there is a name, like, I'm not sure. Like, if I directly translate it, it'll be, like, 00 after 00. I'm not sure. Yeah. It's like Lingling Ho. Yes. So, yeah, Lingling Ho. So, yeah.

You're not from Beijing, but you attended university in Beijing. Is that correct? Yeah, yeah, yeah. I grew up in Wuhan. And like, I think I got most of my education in Wuhan. Like I got my high school in number one middle school affiliated to like CCNU, which is like a Chinese normal university. And also, I think all of my like high school students are just so fantastic here. And actually, I just...

came up to know that, like, many of them are also working on artificial intelligence, and they got very, like, good places to study in the U.S., for example, like, CMU, like, Berkeley. And I am also very fortunate to be studying at U.S. for, like, CSUHD. Yeah. And...

In terms of my college study, I studied at Airbus University. And actually, I think it's a little bit underrated school. Actually, I just talked to some AI models this morning and let it see whether it understands or knows about my undergrad study at undergrad school. And

I just find that their knowledge is pretty old. Like they evaluated my school as some like top 1000 universities all around the world. But actually our, our university has been like publishing a lot of influential papers recently. For example, I think some of the, Hmm.

Professors at Remy University just published a LADA, like Large Language Diffusion Model, which is very popular on X. And also some of my alumni also launched OpenManus, which is an open source reproduction of very trending coding agents. I think it's a general computer use agent.

called Menace and they just reproduced it like within several hours after their release so I think it's so fantastic and I think like the

The school and also the university is growing. Actually, I grew up in... I got my undergrad study at Gowling School of AI, which is very new. And I think it just was established five years ago. Yeah, so I think there has been a trend in China that a lot of people are trying to study artificial intelligence, not only for studying using the web, but also for serious research.

undergrad studies and graduate studies. So I think it's just so exciting to be studying or just living in this age, in this era. So let me dig into that a little bit. So I've been to Renmin University. If I recall, it's right next to Beida. Is that right? Very close. Yeah, yeah, yeah, yeah. We collaborate a lot with Beida.

Yeah. But I think in my, for people my age, I think Renmin is more famous maybe for humanities and social science and not as well known for technical subjects, but maybe that's changing now. It's interesting that in China, there are many programs. It sounds like your, was your undergraduate major actually specifically AI and not just computer science? Yeah.

Yeah, yeah. Like the school is mainly targeted for AI. And actually, I think, I'm not sure if, yeah, so I will use that function. Yeah, I think it's mainly because the school, like Gaoling School of AI was launched five years ago. We are like the CEO of Gaoling. I'm not sure if its name in English is Gaoling.

High Fuel or High Heel, something like that. Is it High Sense? I'm not sure about his English name, but the CEO is Zhang Lei. Yeah, he's a very great investor that has invested in a lot of great companies. And I think it's one of the top VC in China. And I'm not sure because I'm not an expert in investing, but I think he...

He really invested a lot into the school, especially targeting for artificial intelligence. And all of the professors at that school, I think they are all very expert. They have so much expertise in artificial intelligence. For example, some of them are just several very early researchers for information retrieval. And this is why Levin University has a very, very high information

information retrieval ranking around the world, I think it's like top five around the world. And they just extend this to general artificial intelligence. For example, they also have many professors that have

strong expertise in machine learning and also like computer vision or something like that. Yeah. So I think it's, it's been growing. And I think the main reason they have been growing is that there are so much money like investigated, invested into that. So I think that's a reason. And for what you have said about the like humanity side of Aram University, I think this is basically like, I think this is something unique about it.

because it does not only do well in artificial intelligence for these years, but also they have deep connection with some other subjects, for example, like humanity or something. I think the motto of the Gaoli school is like,

to create some warm AI. Actually, yeah, something like that. I'm not sure how to translate it directly yet. So I think the school is trying to build fundamental AI research and also make it have some humanity, something like that, to have interdisciplinary research. Yeah.

Now, for my audience, not everybody in my audience knows that much about China. Can you translate what Renmin means in English? Oh, yeah. I think it just means people. People's university. Yeah, yeah, yeah, yeah. But I think it originally is named, like translated to like a people's university, but I'm not sure why they just change the name. Maybe they just want some alignment between like how they are spoken in Chinese and how they are spoken in English, for example, if

We just speak it People's University. And then like when this like English audiences come back to come to China for a visit, they will not know that Renmin University is People's University. So they just change it back to the original name. Yeah. So again, for my listeners who have not been to Beijing, there's a part of Beijing where there are, you know, very big, very famous universities. Beida, Tsinghua and Renmin are all like kind of in a triangle there. It's one of the main concentrations of brainpower.

in China. Yeah, yeah, yeah, yeah, yeah. I think like the center of the triangle is called Zhongguancun. Zhongguancun is like a place where, for example, Google's office, before they closed it, Google China was there and Microsoft Research and all kinds of venture capital firms and startups and stuff like this. So it's a high-tech area of Beijing. I think I was just there actually recently. So, Zahan,

Most people who are older, like professors in the U.S., at the time you were born, the Chinese universities were not as strong. I think you've been said like in the old rankings, Renmin was, you know, number 1000 in the world. But I think what's difficult for Americans to appreciate because, you know, America has been a rich, top place in the world basically since World War II or even before that.

And the rate of change in China is one of the most shocking things for Americans. So when I talk to Americans, I would tell them something like, well, maybe the older professors, even at Redmond, are not necessarily that great, but the young ones are really sharp and the students are really sharp. And so maybe you could comment a little bit on that. Like, I would guess the students you went to college with are as good as any undergraduates in the United States. Do you think that's fair? Yeah, I...

I think actually this is mainly because the previous scope is not that much. I think there's a trend that people's education level has been raising a lot in China. Previously, the Gaokao acceptance rate is very low, but now everyone can go to universities or colleges. And there's even something we call inflation in education level. But I think actually it's a good thing because once people get more educated, they will know more about the world. They will just...

like having more ability to do something fascinating. So I think this is mainly because of the trend, for example, because previous professors at schools like Renmin University, they are great, they are doing good research. Actually, Renmin University is one of the first schools that started database research in China. So their database ranking has always been very top.

among other Chinese schools. But at that scope, at that time, this is just the scope is so small. So like there is not that much voice, both like in China or just globally. So people does not know that. But I think all of the

so-called necessary people, for example, those who are really in charge of the database systems in China. They all know about the schools and know about how they could contribute to the country, something like that. So I think this is just like a scope change, you know, like an expertise change. But maybe I'm wrong because I actually, I work mostly with those young professors and

Because I think my topic is mainly related to the young professor's research interest or something like that. But I do believe that there is still like those expertise is not beautiful from today, but just from long ago. But I like the scope, the population, or just like how much money they're putting into the school is going to be changing. It's more and more today.

So I looked at your CV before the interview, and it's very impressive because you've already been involved in very cutting edge research projects, both at the university and then probably in your internship at DeepSeek. And you're already you only just finished your undergraduate degree, right? So so already as an undergraduate, maybe as a senior or junior, you're already involved in pretty cutting edge frontier research projects.

Can you talk a little bit about your decision about coming to the United States in graduate school? Like for most talented CS majors or AI students in China, do they all want to come to the U.S. or would they rather stay in China and do their Ph.D. at Tsinghua? How does that how does that thinking go for a kid? Oh, yeah. So I think these are mainly two questions like my choice and like the other Chinese students.

student choice. So for my choice, I think it's just like case by case because I know about my current advisor. She's a really great advisor and I think she's one of the greatest advisors that I can meet all over my career for like research. She's very supportive to students and she has a very strong vision. She has a great connections to a lot of like

cutting edge professors working on different directions for example computer vision robotics large-range models foundation models agents so i i think i mainly like chose her like not united states or china or not like northwestern or other schools i mainly chose her yeah so this is my case and actually there's also like a lot of people asking like why i i'm not like staying in like my

previous affiliation or like for example like both either remuniversity or deep seek i think like what i am doing on different directions for today is a little bit different from what my previous affiliations are focusing on for example we are focusing a lot on vision language models and robotics so remuniversity is good at information retrieval and deep seek is good at like foundation models but

I think it's a little bit hard for me to get some views or get some research experiences in robotics for today if I continue studying at China.

That's not because of different countries or something. It's just because of my background. But I know that my professor, my advisor, Quintly, is working on agents, robotics, and I really hope to learn more, just get more research about it. Actually, I think I have an ongoing project about robotics, but it just kind of started. So I think this is mainly a choice of direction. Yeah, that's it. Can you describe the situation, though, for a typical...

Let's say a kid who went to one of the better CS programs in China and is thinking about getting a Ph.D. What is their thinking about whether to stay in China or try to come to the United States or some other Western country for their Ph.D.? Yeah, yeah, yeah. Yeah, I think I think in China is more like a half half for now.

Some of the students, for the students who have very great records, for example, they have several first-author peer review paper during their undergrad study, and they have a lot of research experiences, for example, at the top labs, like in Stanford, like Berkeley. I think it's really a half-half. Some of them go to United States, and some of them stay in China for their PhD studies. I think...

This is also a case by case, but I think I'm not confident to give you a statistical conclusion. But I think, for example, one student of me left, I think one of my lab mates just left back in China for his graduate studies. He's from Renmin University and he's now at

in like Tsinghua University for PhD study. And this is because like Tsinghua is one of the best schools in information retrieval and even better than Renmin University. So he's already studying in the most cutting edge institute and the most cutting edge institute is in China. One of them, right? So he just chose that for study and also for his connections because he has been interning in that group for some time. So he just chose to like go PhD studies like in Tsinghua.

And some of the other of my friends, they come to the United States because I think this is also because of the professors, both for how they...

are leading their students or how they are advising their students and for their research directions. For example, one of my friends went to Berkeley and he just went to an efficiency group because he believed that efficiency could be very important for current foundation model research, especially the efficiency for attention and for the MOE. And that group just is doing so well on these directions.

So he just went there for PhD study. Yeah. So a few years ago, it seemed like you really had to be at a frontier lab because only they had the compute budget to do the pre-training and only they had access to the models because there were no really good open source models. So I was a little worried at that point that the frontier in this whole field would shift into closed private labs, right?

But now the situation maybe is a little bit different and maybe you can do really impactful research, even though you don't have a huge compute budget and you're actually at a university. So I'm wondering if you can comment on that situation. I have a lot of comments on that. Actually, I can share a little bit about my two recent public release projects.

like the region and the COE, where region is like we're trying to enable the agents to learn from self-evolution with reasoning, and COE is chain of experts, where we change mixture of experts a little bit to enable them to communicate with each other. Actually, both these projects just cost us less than $1,000 USD. Yeah.

Yeah, so... Was the base model something like QN32? Or what model were you using? We are basically using a smaller version of QM models. For example, like 0.5 billion or just like...

Yeah, for the COE project, we utilized the infrastructure of DeepSeq v2 public release on Huggy Face. But we changed, since we're doing pre-training, like the checkpoint is not that important. We initialize from scratch. So we just change the hyperparameters to like make the model smaller. So to fit our budget.

Yeah, so I think that is the key. So for previous research, you can definitely try to verify your idea with a small budget. But after that, if you want to enlarge it, you can find some money from other funding resources. And I definitely believe that once this kind of idea shows some potential, I think there could be a lot of money trying to finance or something like that. Yeah.

So I think this is one of the cases where current cutting edge research could be done everywhere. I think there is also some top 100 university undergrad students who just find me and say, hey, I have an idea. I'm not sure if you are interested, but maybe we can discuss about it. And I just ask them to try on their own cloud resources. Their scale is even smaller than

For example, I just tried QN 0.5 billion. They can even try a smaller model. But they're just using some collab or some other cloud resources. We just call them less than $100 for each month. But they can also get it to work. So I think current cloud architecture makes it better or makes it more...

for us to be cheaper, to verify the correctness of an idea. And once we can verify it very initially, we can just try to publish it a little bit. I don't think we should say publish, maybe to open it or release it a little bit with our blogs or just code. And people will see this post and

like be like judge it and then like you will see like whether your idea could be accepted by public yeah yeah I think I just want to talk a little bit more yeah so I think there's another factor which is the open source infra

Yeah. So I think just one year before when I was trying to implement something about online learning, which means that the model can generate some trajectories and they can get some feedback. And based on the feedback, they learn and then they just want to improve themselves. At that time, this was just so hard to implement because I think most of the training architecture supports like super-wide fine-tuning at that time, but not like online learning because...

the model who generates the model must be fixed because when you want to change the parameter of the model, you actually need very delicate management of the memory. So at that time, if you want to generate, make

makes the model generate some trajectories and then use that to update the model, this would be an effort and we are not able to do that. But for now, I think different infra, for example, like the one we're using, like the VRL and also like OpenRHF. And I think there are recently a lot of infra like Open Reasoner Zero

something like that has been really making like a lot of people being able to have their, like, like have those open source infra and build, build their own things on this infra. I think this just like standing on the giant's shoulder or something like that. Yeah. So all of the things I think make, make it like less a barrier for current research. If someone would just want to do some research.

The infra that you're talking about, were those projects built already based on LAMA? Was it because LAMA was available and then people start building that infra? Or did it actually require things like DeepSeq and Qen to exist to drive that infra development? Yeah. So infra basically means that you have a model where you want to train them. You want to correctly train them. For example, if you want to correctly train them, it's not just about data and...

just like to run it and calculate the loss function, then do backward and then to like optimize. It's basically like when you train a small model, that is okay. But when you train a large model, you need some experiments. I think you need some to build something for the experiments. For example, you need a lot of metrics.

And you need to monitor the metrics. And for previous trainers, maybe they just do not support this kind of function for you to easily monitor these important metrics. But for now, all of the inference they just submit automatically.

the experimental metrics to a platform called 1DB. I'm not sure if I'm spelling that correctly. Maybe it's 1B. They could help you organize and also to see about the different metrics. So you can be easily knowing whether the model is trained well or not.

It's just like you train a model, but you want to know more about it. You just do not want to only know about the loss, but also like other things like the different factors of the loss. For example, the loss may be like some different things sum up. You also want to know about like whether your GPU is leveraged well. For example, some of

People, they just have strong GPUs, but the leverage rate is low. So they are just wasting their GPUs. With those lot of metrics, people would be able to learn whether they are really training well or not. But for previous infra, I think mostly they are just to make sure that you could make it run. But how well is it? I think there is like people has been building a lot for today. Yeah.

And what's the split between academic labs and other entities in producing this open source infra that you're talking about? So I'm not sure, maybe are you asking about their ability or like willingness? No, who's actually building it and releasing it? So is it academics in universities or is it, you know, DeepSeek that's releasing it? Who's actually building the tools that you're using the most?

I think it depends. Yeah. For example, Vero is developed by, yeah, like by Danz. And like Open Reason Zero is built by, yeah, I need to check. Yeah, it's built by like StepFun. Yeah. That's another Chinese company, right? Yeah, yeah, yeah, yeah, yeah, yeah. Okay, but so are you mostly using tools produced by Chinese companies in open source? Yeah.

I think this is mainly because my friends are using it. So whenever I have some questions, I can ask them. Got it. Got it. Okay. But so one of the things that this is now maybe totally irrelevant, but a year or two ago, I was sort of, if you look at my tweets on X, I was complaining that if everything is dominated by

for-profit closed companies that don't release these tools to the academic community, then AI research is, the overall progress is going to be slower because, you know, other people can't get involved. But it looks like the situation is much better now than it was a couple of years ago.

Yeah, yeah, yeah, definitely. I just think like open sourcing is like, I'm not sure like there's some like game theory results, like when everyone tries to collaborate with each other, like the benefit of the society is maximized. But like when everyone tries to betray each other, like their own benefits are maximized, but the whole society's benefit is minimized. I think there could just be like maybe a little bit bunch of people, they're trying to open source.

And then like more and more people will open source because people will always give praise to those people who open source. And when they like just to hit a several limit, the other people will just choose to open source because not open source. It means that maybe they will make money, but they will not be praised. And they're like scale would be limited or something like that. So I think like these days, like the machine learning community just hits limit.

that threshold where like after that threshold, more people will open source. Yeah. I think you're right. I mean, I think, I think the trend is very strong, at least right now. I think Yang, I think even in the interview, one of the earlier interviews that you translated, you know, he makes a big point about this. Do you, do you think they're very sincere? Like a deep sink will keep open sourcing its models for many years to come? I think so. Yeah. Yeah. I think so. Yeah. Because I think, I,

Well, I'm not sure. But based on my naive opinion, I think he doesn't want to make money. Maybe he has. Yeah, because he has made enough. Yeah, maybe he has enough money. But okay. But let me ask you. So next week, I'm going to be in Silicon Valley. And I've known people in that industry for many, many years now. And

Even the people that follow AI very closely, so investors, venture capitalists, even people who are CTOs or something, they're generally not very aware of what's happening in China. So the deep-seek thing was a little bit of a surprise to them. They don't know what Kimi is. They don't really actually know what Quinn is. So I think they're kind of sleeping on the general quality of the models coming out of China, which I think are actually quite high. I'm curious what you think about this.

Yeah, yeah. I do believe that current Chinese either companies or schools, they are building very fast. Yeah, I think fast is always some of the characteristics in China because like there is a model that we have been taught since our primary school. Like you must be very diligent. Yeah.

Yeah. So this is just like some, some information that is like, like driven in our DNA. Yeah. So, so, so Chinese people always make things very fast. And,

And I think like for the innovation, for example, for the optimizers that drives like large-ranger motor training and for some important algorithms, I think this is just globally, for example, they can be produced or just first discovered by US or any other like Europe. Yeah, something like that. And I think...

Chinese people could always detect which algorithm is more promising and trying to scale it up, something like that. So, yeah, yeah, yeah. So this is one of the things that I discovered. When it comes to more fundamental improvements, like, say, actually going beyond the transformer architecture, or I think you already mentioned diffusion models, right?

Can you see a point in the future, maybe the near future, where actually the really unique or creative innovations are actually coming more from China than the U.S.? I think this is actually a good question. I have no confidence for any conclusions. Yeah, I don't have any confidence for any conclusions because things is happening too fast and things is changing too fast. Actually, like,

I don't think the me like three months ago could ever predict like what kind of status I have right now. I don't mean like anything, anything else. I just mean that like maybe three years, like maybe three months ago, I'm working on my own project about, um,

structured reasoning about agents. That was a very small project and we have been building a very delicate algorithm on that. But after DeepSeq releases his R1, we just removed like 90% of the algorithm and find it work for agent. Yeah. So I don't think anything is predictable.

I agree. My priors from, say, three or six months ago are totally different than my priors now. So everything is changing so fast, it's almost hard to keep track. Yeah, yeah, yeah. And I think nobody can even predict what will happen three months later. Yeah. So maybe I can get into some slight details about research. So you mentioned R1 and RL.

And so for the audience, I think one of the lessons that came from the R1 paper from DeepSea was that you could get very far by having a kind of reinforcement learning where you're giving the model very well-defined problems where there's definitely a correct and incorrect answer. And in a kind of automated sense, the model is attempting to solve these problems. You're feeding back from what it does into adjustments of its internal parameters and

And amazingly, it's able to learn how to reason well.

very fast, very effectively from that kind of somewhat automated process. And I think that was a surprise to a lot of people. I know from personal knowledge that a lot of the U S labs were paying a lot of money to pay humans to solve problems and use that as fine tuning, training data, et cetera. But this, this RL method is more elegant and doesn't require as much human effort, uh, involved in the process. Um,

I have a few questions since you're a real expert on this. So one of my hypotheses is that for a given initial model, so pre-trained model, it has a certain strength. And then you put it through the RL process. It looks to me, all the curves I've seen, it shows rapid improvement. But then there's some kind of asymptotic behavior where unless you make the original model stronger, you're going to hit some max performance from RL.

And is that a plausible interpretation of the data that you see? So I'm not sure if I understand you correctly. You mean that like for IOL, we have like an upper bound for the performance? Yes. But for a surprise fine-tuning? Yeah, as a function, no, as a function of the strength of the base model, okay, there is some upper bound and, you know, maybe asymptotically you approach that upper bound.

No matter how clever you are about the RL, you're probably still limited by some upper bound based on the... This is definitely... I think this is very obvious. For example, if you have a... I don't think this is only constrained by data or model touch capture. We just talk about model size. For example, you just want to predict the weather. Yeah. And you have an accuracy threshold.

And it's definitely like the larger model, it can contain more information about like every instances, like for example, like the, I mean, the weather condition in every part of

of the region, and then they will calculate it more efficiently. So I think definitely the performance of the model is constrained by a lot of things. But I'm not sure whether it's constrained by RL or it's constrained by the model size itself. Maybe we can imagine that we have an infinite size model or near infinite size model. I'm not sure whether RL will still make it a constraint here. Because in the scaling laws, people always say that you are...

you should always be clear of what is the constraint for now. But I'm not sure the current trend of RL training, piece and upper bound is because of RL or any other factors. Okay. I mean, the reason why this is kind of a crucial question is because there is some feeling that for the pre-trained models,

There might be a data bottleneck or something which prevents them from making the pre-trained model much better than GPT-4. So, for example, 4.5 is not really better than 4, right? CLAWD 3.7 is maybe only a little bit better than 3.5. So the question is, if there's some bottleneck for the pre-trained model, no matter how much RL you do, you're still going to be limited. You can't get all the way...

to AGI or ASI without also passing that bottleneck on the pre-trained part of the model. Does that make sense? Yeah, yeah. I understand about your point. But actually, we need to know whether it's about

model problem or it's about a data problem. Yes. Because I am definitely sure that GPT reads more than any of us throughout our life. Yes. If that, like, data account cannot make it understand the world, I'm not sure what kind of data could be used for it to understand about our world. Right. So, well,

Well, let me just be more precise. So let's assume we're sticking to the transformer architecture. Okay. Obviously, there could be some innovation where we make it literally like our brain or something. But let's suppose we stay within something fixed, like a transformer architecture, that maybe those original scaling laws were true. And you do need a 3x or 10x in data to go to a 10x larger model, right? So...

In that scenario, there seems to be some bottleneck. Like reasoning by itself is not going to get us all the way where we want to go, right? I'm just curious what you think about that. Yeah, yeah, yeah. So I'm just talking about data and I think current data is definitely sufficient, but I'm not sure if current model size is sufficient enough. For example, we can train...

on the same data but with a different model. For example, we just use a 10 times larger model and we could just find out, for example, after pre-training, the 10 times larger models, like validation loss, is larger than the smaller model. Actually, this always happens. For example, if I am training on a larger model, sometimes it's just a...

a little bit smaller, a little bit slower to converge to the final loss trend. But in that case, I'm not sure if RL will help even more. Because

larger model I think there's a theory that larger model tends to make the processing of the data more smoothly for example like when you are trying to have a very small model they are training on some strange data then they definitely could lead to some like overfitting but if you just increase the model size still using the

strain data and the model will like find some of the smooth transition between different data itself you can just increase the model size and the overfitting issue will like be lessened so yeah i'm just a curious i'm also curious but i think like someone will help me answer this question like maybe some of the researchers will help you answer this question like for example we know that

data is just that much. We do not have more data. But if the model, if the base model could be larger, like could it, like just to raise the upper bound of RL than the current model?

Yeah, I think this is definitely worth exploring, but this is definitely money burning. So yeah, I think like, so like current, there's definitely like another trend of doing research is how to make your model more efficient. Like for example, like in the same compute budget, like how to,

or in the same money budget, how to make your model better. For example, like MOE, for example, like MLA, those sort of kind of research are just like moving toward this direction because we know that we are far from maximized efficiency. And in order to like,

whether the current model size or anything is a limit, we must have, in order to not pay that too much, we can like scale up the model while doing a lot of research on efficiency. And yeah,

Yeah. So, so I think these are just two different directions. For example, if we just do not train a larger model, but we experiment on efficiency for just a long time, for example, for, for five years. And after five years, we find that efficiency has like a 1000 times of boost. At that time, current larger model is just so cheap to train the

So at that time, we can just try to see if, like, at that time, we can train a larger model to solve all of these problems. Yeah. I still remember very deeply that, like, the first time I saw BERT, I just was so impressed by it. But also, I'm so surprised that, okay, in order to train a model, we need to spend, like, millions of dollars. But for now, I think, like...

Any lab can pre-train a BERT-based model, like, based on, like, all of the improvements people have been doing on efficiency all over the years. I think it's just not too many years. I think seven years, something like that. But people, like, I think everyone can, maybe not everyone, but every major lab can pre-train a BERT model without any too much cost. Right. I mean, but you're still talking about millions of dollars for the pre-training, right? Yeah.

So for now, pre-training a BERT, I think it's 10K to 100K. Okay, but BERT's not maybe... I mean, for a state-of-the-art model, something that's as good as V3 or GBG4...

It would at least take millions of dollars, right, to pre-trade that model. Yeah, yeah. This is because I think this is because people have been trying to scale up models and also trying to scale up the efficiency. So I think there is some balance. A lot of people, they feel scaling up model is workable, so they're trying to make the model scaling up.

And other people are feeling like efficiency could be much better to work on, and they work on efficiency. And finally, like, the budget all over the world achieves a balance between scaling up research and efficiency research. Because, like, yeah. I mean, at one extreme, you have X, you have Elon's company, which, you know, they have, like, whatever, 100,000 GPUs, H100s, and they can just throw money at the problem.

And they get a model which is good, but it's not necessarily better than the model that DeepSeek trained for six million of pre-training costs. A pretty wide range of strategies here that people are executing. Yeah, yeah, yeah, yeah. One of the arguments that I've had both with people at the big labs and also with investors who invest in this space is let's suppose we are not able to make money.

a pre-trained model which is significantly better than GPT-4 or V3, can we still get to our goal of AGI or ASI just by being better and better at RL and reasoning? And so this is a very important question because nobody knows how to

push the pre-training one order of magnitude better. But people feel like, oh, we're still seeing these gains in reasoning. So maybe we don't have to worry about the pre-training bottleneck. Reasoning is enough to get us to where we want to go. And I personally am skeptical about that, but I'm curious what you think. Yeah, yeah, yeah. I think this is just like...

When you are playing a game, you have different features. For example, you can enhance your attack, you can enhance your defense, and you can also enhance your dodge or something. And it's just like when you have a bottleneck in a certain feature, you can try to focus on another feature. So yeah, I think current RL is far from bottleneck.

So it's definitely very, I think it's obvious when there's a lot of people trying to work on RL instead of like doing scaling up for these days because RL seems to be more like workable than scaling up these days. Yeah. But I think when like RL comes to a bottleneck, people will find a lot of other new things to work on. For example, like efficiency. And also, I think, I'm not sure about like what kind of,

the model we'll be having at that time. But I believe like,

If at that time AI could help people do research, then people will definitely have a lot of new things to do. Oh, absolutely. Yeah. So let me, coming back to RL now, I don't know if you know this paper, their acronym is LIMO. It's some researchers from Shanghai Jiao Tong University. And they claim they were able to develop very high level math capability, I think using QEN32B technology.

But only giving something like 900 or 1,000 examples. These were handcrafted examples. So they're made in collaboration between humans and big models. But only those 1,000 were used, and they were able to bring this relatively small model to pretty much state-of-the-art math capability. Are you familiar with this paper?

I'm not familiar with the paper itself, but I think its thoughts could be similar with DeepSeq R1, where they use some cold start data. It's also very small scale, but then after making it, they can apply the model with RL, and then you can develop very good math capabilities. Yeah, but in this case, the number of examples was so small. It was only like 1,000. Yeah.

The bigger hypothesis these researchers had as a consequence of this result is that

These abilities for particular steps in the reasoning, the ability to do a particular step is already inherent, even in a fairly small model like Quinn 32B. But it's just a matter of giving it the right example so that it knows how to proceed in the reasoning process. And a surprisingly small number of examples are enough to get it to fully utilize the capabilities that were already in the pre-trained model.

To me, this hypothesis is very plausible, actually, but it has a lot of implications. Because it means even a very small group with very little budget can produce models that are really at the cutting edge of a narrow capability. I think it's also very reasonable. And actually...

I'm not sure maybe like if your audience is no more, like know much about Chinese golf call where like for the math problems, we just remember, try to remember about some basic knowledges, but like the,

the final exam could be very difficult. But it's just because we have been training a lot on some medium level or just difficult level tasks in our real life. For example, we try to learn something from how to organize this different...

of thoughts together to make some like connected thoughts. Then we can try to summarize from the connected thoughts and build a lot of conclusions about that. And like the first layer, second layer, third layer. And finally, the Gauss-Kauss problem could be very difficult, but we still have the chance to solve it. And I think like recently there is another paper called like Atom of Thoughts, which is like

also very popular on X. I haven't checked it in detail, but like the core authors are just also my like university, like schoolmates. Yeah. So I think this just, they just claim that like using this sort of atom of thought, like any model can enhance its performance down like

previously people are using COT. COT, I think it's more of like a natural flow, but not very structured thinking. But they are using structured thinking and they somehow like develop some thought atoms and then try to connect them together to build a lot of like fancy conclusions and they can finally try to solve these problems. Yeah, I'm not sure if this could connect, but like, yeah, so I definitely believe that like some of the

thinking patterns, some of the basic thinking patterns, for example, a very simple one, like to reflect could be like just learned very simply, but it depends on how you use it. You can use it for like very delicate thinking patterns. Like you can insert this as a function in your very delicate thinking patterns. But like the very, very basic assumption, I think it's just limited. And I think there's a chance that it can be contained in thousands of data. You know, it's funny that

Usually people, when they talk about the Gaokao, they just complain about how many years students have to prepare for it and they don't get to have as much enjoyment when they're teenagers because the Gaokao is looming over them the whole time. But you're the first person now who's actually said like, hey, there's a really good aspect of the Gaokao because if you do manage to layer all those strategies together. I think people complain because they do not get a chance to go to the university that they dream to be in.

Yeah, once everyone could go to the university, they think, oh, wow, it's good. I can choose the university. I'm happy about that. I think Gaokao will never be some pressure. Okay. But there's also the stereotype that kids in South Korea and Japan and China, preparing for the Gaokao, they miss out on parts of their childhood, right? They don't have as much free time. Yeah, yeah, yeah. I think this is not solely a problem with

the Gaokao itself. I know Gaokao has some shortcomings. For example, it's like you test it for once, but it could depend on your past life. I think this is definitely one of its shortcomings. But also, I think this is more of like social educational resource. For example, only like in a province, for example, like there may be like 100K people who are trying to Gaokao like each year, but only like

top 100 or just like top 200 of them will go to like Peking and Tsinghua University so people will be just so worried about that and they were just focusing too much on that so I'm not sure how this could be solved because I although I'm in Renmin University I'm not an expert in like society study or something like that but I think um

Yeah, so like the God called test problem itself, I think it's funny. But maybe this is also because like I can enjoy it. Maybe some people just cannot enjoy it. But yeah, I just, yeah, I have to say I learned a lot from it. Although I know that like my high school has been a little bit frustrating because I have to do a lot of tasks every day. Yeah, so this is just a little bit critical about it. Yeah.

So we've been talking a lot about reasoning, and I think you and I both agree there's still a lot of untapped potential in reasoning and using RL to make the models better for reasoning. And obviously everybody in the world is working on this right now. I wanted to switch and talk a little bit about your paper on chain of experts. Yeah. So for the audience, one architecture of these models, which has turned out to be very efficient, is

is to have a mixture of experts. So instead of one giant dense model, you have different models that are slightly different in nature

And there's a gating function or some kind of allocation function at the beginning where for a particular type of query, the query is routed to a particular expert or subset of experts who try to answer it. And so not all of the connections are activated. Not all of the parameters are used in the, quote, thinking of the model. And so this is a more efficient way to do large language model processing. And Zhehan...

and his collaborators just recently wrote a paper where they did something interesting, where I think the way a physicist might say it is you're making a kind of superposition of experts, right? So, so there's a coefficient in your formulas, G sub I, I think, and you're, you're mixing the experts, I guess, at every step of the, of the inference. So maybe just talk about what you guys did. Yeah, yeah, yeah, definitely. So I think I can talk from the intuition of the project. So,

When I was thinking about mixture of experts one day, I just think like, these are just kind of some customer service. For example, like there is a token here and it's just like an issue and you would like it to be passed to some of the experts. And then like when they solve it, this can be like closed. But I've been always thinking like in real world, like people do not just make separate experts handling this ticket and then close it.

Instead, they will build up a chat group of the different experts and let them communicate with each other. So at the beginning of my research, I was just believing that if I can make the, for example, if I can choose the experts first and

and then make the experts to be having this token be processed for many times. And each time the token might be processed a little bit differently, but the expert is the same. It's just like to handle the token like for multiple times and each time like different experts are having like processing different part of it. I was a little bit sure that it could work and I do some experiments. But finally, I find that the code

it's a little bit hard to write because you would like to lock some experts and they just be used to process this token. And I find that the code is a little bit hard to write. And at that time, I was just thinking if I could...

not to limit the expert that is used to process the token, but just to make the machine learn freely. Because we know that sometimes we humans just push too many constraints to the machine. But if we can open the constraints, maybe the machine could just be better. So I just removed the constraints and finally find that the model is even learning very well and even better than not passing the experts. So I think I can now formulate it more of...

a union of like two different information that I want to like take as takeaways. For example, one of them is that

the experts need to be processing the token sequentially. So previously, people have just been finding that, okay, the experts could be processing the token in parallel and they can handle the token very well. But for now, for example, a token could be passed to a group of experts, let's say group A, at the first iteration and then to group B for the next iteration. And

this could be effective because I think this could just enhance the effective layer of the MOE layer. For example, like for previous MOE research, MOE layer is just one layer. But we just find that, we just think that like if we make it more of a sequential processing, we are actually making this MOE layer to be different layers. For example, like the first iteration is

expert group A to be processing the token and then for the next iteration is expert group B and then just layer stack. So we believe that such communication could enhance the effective layer of the models. I think previously some chain of thought research also pointed out that chain of thought is also trying to make this language model trying to enhance its effective layer by making the token trying to predict sequentially. I

I think there are some relevant papers and a lot of relevant papers trying to theoretically to prove that COE is effective. And I think this is something similar in like COE. And we also use the chain of the name of chain of. Yeah. And the other thing is like I feel like in the COE paradigm, like we can enhance expert specialization.

For example, if there is still an expert who is really good at handling a several token, it has a chance to process the token for many times in different iterations. For example, there is always this expert trying to handle this token. But each time it processes the token, it actually processes the different status of the token. For example, this is just like an issue. It is solved, a half solved first, and then, okay, the expert sees how the expert itself and its commonality

colleagues trying to solve this token and then the token is passed to the expert again. And now it could see the token is half processed and now I can just process the second half of it. But this all based on the assumption that this expert is really good at handling this token. So we haven't got too many experimental evidence on that, but I think definitely this can be somehow to be

verified by just calculating whether the two different iterations, the experts are the same or not. Some of the metrics help us to interpret the experimental results. So basically, there are just the two assumptions, like I said, but I think we definitely need more experimentalists

experiment to verify them before our next more comprehensive release or something. I think there's also a side thing to talk about. I think people have been trying to get a small release first and then get a comprehensive release after that. So people have been building from journals to conferences and then archive and now there's Twitter. Yes. I just believe that I have been adopting the practice of

having a very small but relatively comprehensive release at first. But for all of these second releases, I'm still preparing so much for it. So I just hope to resolve all of the questions people have been asking since my first release because they are really genuine feedback that can help me improve the paper or just the project to learn what are people considering about it. For example, some Twitter users suggested

having a lot of comments after like like in the comment section i really learned a lot from it so i think i will like try to resolve every comment before my next release that's great well that's a great way to do i mean it's like real-time science where you you're giving the seminar on x and you're getting a lot of good questions back yeah yeah yeah this is a good metaphor i think so i this is kind of a dumb question because i'm sure you said this in your paper but when you were just explaining it i wasn't sure what the answer is so you have these extra layers are you

Actually, re-pretraining the whole model? Once you establish the chain of experts architecture, then do you need to basically retrain the whole model? Because you've got the layers, the layer connections probably depend, right, on or probably should be changed in order that the experts do the right thing.

Yeah, we just train all models from scratch. Okay. Yeah, so this is why we are choosing like 0.5 billion models and it's even a MOE model. So the activated parameters even less. Right. So I think this is the limitation of current COEE. With this style of research, you can show that there's a delta, an improvement of the small model using this different architecture. But a skeptic would say, yeah, but what happens when it's a...

you know, a hundred billion parameter model. Like we want to know what, you know, is it the same qualitative improvement or is it, is it better than what you saw or is it smaller than what you saw? So, so obviously there's no substitute for trying to do things at scale eventually. Yeah. Yeah. Yeah. So,

I believe another important topic that we will do next. Yeah, I know some of your audiences are also like working on tech. So if any of them have similar ideas, I think they can feel free to let me know. I was happy to be taught in this way. Yeah. So I think one of the definitely important issues like topic we would do next is to transfer knowledge for current MOU models into a COE partner.

So we don't need to pre-train anymore. We can just leverage the knowledge for currently pre-trained models. And that could definitely be a hard idea because I know that current MOE models are trained to maximize the efficiency for parallel. So they can maximize, like, so for example, the experts can only handle this token for once.

So they will maximize the information that it can do to process this token. But for COE, we definitely want the experts to process the token for more times. And each time they can communicate with each other. So the objective is a little bit different. So I'm not sure how much knowledge we can transfer from the

current MOU models to like a COE partner. But I definitely think this could be something to do because so first, we don't have that much money to pre-chain a model from scratch. And second is like people would always want your method to be working on any kind of things like without with as less assumption as possible. So current assumption is that you need to initialize the model from scratch and try to train it

But when we make this assumption to be a little bit better, I mean, to be more widely applied, whether we can leverage currently pre-trained model and trying to transfer it to COE, because we have been proving that training COE from scratch is useful. But what about training COE from MAU model?

Yeah, I mean, so outside of the big labs, you know, like something that for them would not be such a big run, like maybe a few hundred thousand dollars or a million dollars. It's still a lot of money for an academic group to come up with that, right? To actually do some fully pre-trained. Yeah, yeah, yeah, yeah. It's harsh. I just calculated it's hard.

it's like the annual salary of all PhD students. Yes. Yes. Yeah. I have a former colleague who was in theoretical physics, but now he does AI and he's at the Allen Institute in Seattle, which is funded by the, the, the money foundation of Paul Allen, who was the co-founder of Microsoft a long time ago. And it was dead now, but, but,

But, you know, they're kind of in the middle where I think they have some resources that maybe a university group wouldn't have. And they're actually trying to create like pretty much almost competitive type models, but fully open source with even the training data is open source. So it's very admirable what they're trying to do. Yeah, yeah, yeah. I learned a lot from their like O-L-M-O-E training.

Yeah, I actually, like, we just got our price estimation because they open sourced, like, all of the things, like, GPU hours. So we can estimate it based on that. I think they're just so fantastic because they're trying to open source anything. Like, anything they can open source. Even, like, the 1B logs, like, experimental logs with a little bit, with a lot of, like, metrics, which I believe, like, like,

I'm not sure if they are the first to open source, but actually, I think they are the first to open source a pre-trained model's experimental logs. Yeah. I know I can be wrong, but this is the first that I can see. It's the only case I'm aware of. I don't think any of the other, even DeepSeq or Meta, they don't give you that much, right? So maybe only the Allen Institute does.

Yeah, yeah, yeah, yeah. I think this is just a huge benefit for the researchers because for researchers, they know that what kind of parameters could work, but they just want to learn more from the detailed metrics that you use for each of your experiments. So actually, we are not currently open sourcing the 1B because it's in a mess for now, but we are trying to open source it later for all of our releases because we know that once we release it, it might be not...

That much helpful for most of the audiences, but it could be helpful to those who really want to do research. Good. Well, I told you we'd talk for about an hour and we're just right about an hour now. So let me start winding up a little bit. Let me ask you if you have any thoughts about how things are going to play out over the next few years. Like, are there any...

non-obvious predictions that you want to make or things that you think are going to happen for sure? Anything that maybe are surprising? Anything you want to say about the future? I'm not sure if AI could help accelerate people to go to the moon.

You're not sure. But I hope so. Yeah, I hope so. I hope so. I think definitely robotics could be one of the biggest improvements for the next years because previously, all of the robots can just not do any semantic understanding. Yes. But for current large language models, we...

With vision capabilities, I think this is going to be more true. Yeah, I think this just raises a lot of the probability that we can see AI in the real world. For example, it's just like humanoid AI and just help you to do a lot of things, for example, like household chores, something like that, because they really understand your language.

So previous AI, they pretend to understand your language, but it's like the fixed function, for example, you wanted to do function A and function B, and these are just predefined when they are produced. But now you can ask it to do a lot of things for you. They

They have the possibility. I know that like for current robotics AI, I think there is still like a generalization problem where they are trained on task A. They can do a little bit well on task B, but not that much well. So if people can resolve this question, I think like within like several years, like people will see the robotics to be really like incorporated into our life.

And after that, it's about acceleration of research. So I've been posting privately on my WeChat that I can't imagine that I can have two first authored paper release. I think it's blog release, to be honest, in just one month.

So I think research has been accelerated for today. So they are basically because you can get the necessary information for your project. So this process has been accelerated because previously when people want to know about some knowledge, they can only read the predefined documents. But now people can really ask any of the AIs and say,

say that, okay, I already know A and B, please interpret the C to me. And like everyone has different A and Bs, but AI can give the correct C for different audiences that have different A and Bs. So like the information, like the obtaining of the information is really getting faster.

But I think this is the acceleration for current stage AI. And I think definitely like in 2025, I think it's just within this year, AI could help you to debug your code. I think it's very obvious because current AI can already help me debug my code, but at like file level. So I can like get some knowledge from AI in a single file. But for the repository level debugging, I think it's not doing pretty well now, but for some cases it's doing well. But if...

At repository level, the AI can understand my research progress. For example, if I ask it a query for a second time, I don't need to input anything the same as the first time. I can just pretend that it's aware of

anything, any progress that I'm working on my project, then I can really have a great assistant that can help me do my project. And whenever I have a bug, I don't need to ask someone else to help me or just like have an afternoon just for this bug. But I can just ask AI to help me detect where the bug is and what kind of code I need to write. And they just help me write 90% of the code. And for the 10% of the code, they are not sure. And they ask me, so whether, what kind of,

approach that I want to do with this and I can just do the 10% of work. So I think this is just like the most significant boost for now because at that time, I think someone has said that maybe it's Andrew Cobbethy that we only need to write code in natural language. We just need to tell them about our ideas and we actually do not need to write code ourselves. We just need to be understanding the code that they write for us

And I think this is just another big part of an acceleration of research at that time. And when all of this, like both of these parts are just merged together, I'm not really sure about what kind of research progress we will have at that time. Yeah. Very good. Yeah, I agree with everything you just said. I think we're right on the edge of being able to

have the AI understand our repositories quite well and then be able to say in natural language or, you know, very kind of natural pseudocode what we want and it builds it knowing what tools it has available in the repository. I think we're very close to that in some settings. And one thing I was just saying to my research group earlier today is if you want a review article,

Like there's some new area you're trying to understand and you want someone to write a review article of that new area. The AI will do it. And you can even, as you said, you say, I already know A and B. C is what I'm trying to learn about. Please use these articles as context and write me an introductory review article so I can understand it quickly. That is a thing which, you know, I would not have imagined that would have been possible a few years ago, but it's totally possible now.

I would have never imagined that like one year ago. Yeah. It's insane. Yeah. Great. Well, Zahan, I really appreciate this time. I'm sure my listeners will enjoy this conversation. And thanks so much for joining me. Yeah, yeah, yeah. Thank you so much. Actually, I really enjoyed chatting with you because like your questions just...

like have me thinking a lot of things that I haven't been able to be thinking with in the past, like research life, because like, you know, like research life is sometimes they're very inspiring, but most of the times it's just so boring. I need to like write a lot of things that, yeah, they're just a suck. So I'm very happy to be chatting with you today. And I really learned a lot of new perspectives.

Robots, Small Models, and RL with DeepSeek Alumnus Zihan Wang — #86 01:07:22 Share

Manifold

Deep Dive

Shownotes Transcript

Robots, Small Models, and RL with DeepSeek Alumnus Zihan Wang — #86