The current training paradigm assumes that all GPUs must communicate very fast, which is only feasible in a centralized data center setup. This assumption was made in the early 90s and has persisted due to the convenience of having all GPUs in one place.
The bandwidth on the internet is much smaller than the bandwidth between GPUs in a centralized data center, making it difficult to synchronize training across distributed systems.
DisTrO allows GPUs to train independently and only share the most important insights, reducing the need for high-speed interconnects and enabling training over standard internet connections.
DisTrO reduces bandwidth requirements by 857 times compared to traditional methods, making it possible for small teams and individuals to train models using peer-to-peer networks, democratizing AI innovation.
The fear that major open-source AI providers might stop releasing models like Llama 4 prompted the question: 'Is there a way to make Llama 4 ourselves without 20,000 H100s?' This led to the development of DisTrO.
DisTrO requires 857 times less bandwidth and can perform equivalently to traditional methods, making it possible to train models over standard internet connections instead of high-speed interconnects.
DisTrO could enable a global community to train AI models collaboratively, breaking the monopoly of large organizations with massive compute resources and high-speed interconnects.
While DisTrO reduces the need for high-speed interconnects, NVIDIA's CUDA stack and GPU hardware remain essential. The shift could lead to a redesign of chips, focusing more on VRAM and processing power rather than interconnects.
Traditional methods require all GPUs to synchronize after each training step, while DisTrO allows GPUs to train independently and only share key insights, reducing the need for high-speed communication.
The community's willingness to contribute their GPUs and computational power is crucial. DisTrO's success depends on activating this willingness into actual action, enabling decentralized training on a global scale.
So suppose tomorrow every company said we can't release open source models anymore. I'm sorry, we just like it's just bad business. Where would we be in the open source AI space? What are the technical problems that would keep us from being able to replicate that on ourselves? And it turns out the real big problem is that when it comes to training models, the current paradigm requires that all of the GPUs that train the model, they all have to be like in the same room.
The way the technology stack has grown assumes that all of the little brains that are training the model can talk to each other very fast. And if you are a single entity who can just put it into one data center, it's all good.
But we, and sort of the collective we of the open source AI movement, aren't one entity. So how can we cooperate to actually train a state-of-the-art AI that we all own? It turns out that most assumptions in the AI space right now are a product of that's just how things had been done. So someone made an assumption maybe in the early 90s that everyone just kind of has continued to go with, and we now can revisit it and look at those and see that there's actually a lot of growth there.
Once again, thank you for listening to the A16Z AI podcast. I'm Derek. You're about to hear a very interesting discussion between A16Z General Partner Anjane Mitha and Bowen Pang and Jeffrey Kuesnel from Noose. If you want to Google it, that's N-O-U-S, Noose Research.
They discuss it in more detail during the interview, but if you're not familiar with NOOSE, the short version is that it's a small team of researchers dedicated to making cutting-edge AI more accessible via open-source projects. Their guardrail-free HERMES models, for example, are quite popular in the AI builder community. However, the catalyst for this discussion is a recent paper the team released about a project called DISTRO. And DISTRO is an algorithm for training AI models on distributed infrastructure, utilizing the public internet. 100 megs down, 10 megs up.
The NUS team claims Distro required 857 times less bandwidth than the standard approach to distributed training and could perform even better with optimal parameter tuning. Of course, anybody who follows AI closely knows that the kind of setup typically required to train even a reasonably sized model can be cost and skill prohibitive for all but the largest shops.
And a big part of that is achieving the fastest possible communication among the GPUs in a cluster, which is why networking plays such a big role in the architecture of AI systems. Although Distro was very early research, the promise should be clear: allowing small teams to train models using peer-to-peer networks, a la what projects like SETI@Home or Folding@Home did in their respective fields, could help break open AI innovation for the 99% of builders who don't have access to massive compute resources or don't want to be fenced in by the limitations of large AI labs.
So, with that background, here are Angenet, Jeff and Bowen discussing Noose, Distro and more. It kicks off with Jeff explaining what Noose is trying to do and the unorthodox origin story of the team and how they met on Reddit.
As a reminder, please note that the content here is for informational purposes only, should not be taken as legal, business, tax, or investment advice, or be used to evaluate any investment or security, and is not directed at any investors or potential investors in any A16Z fund. For more details, please see a16z.com slash disclosures.
Nuce Research is an open source AI accelerator effort. So our goal is to accelerate artificial intelligence and to do it in a way that brings it to everybody. And when we say bringing it to everybody, it doesn't just mean that everybody can use it in the sense that like it's a product that we bring, but it's that everyone can also touch the technology underneath it.
So if you just want to use the AI, you can just use the AI. If you want to open up the code and see the code and use it, we believe that's as important as bringing the access of the actual using it is also bringing the underlying technology to everyone. We've seen that open source innovation has driven, it's a multiplier in the technology stack. Every kid who wants to learn to do something starts by going with the open source free thing that they can get on the internet when they're 10, 12, what can I do in touch thing?
So our goal is to make sure that the building blocks of this transformative technology are kept open and that we're able to do the research to bring state-of-the-art AI to everyone.
What are you guys trying to do in terms of the roadmap or the set of milestones you guys have decided to focus on to get there? I think for my purpose at Nuus, we are doing fundamental research. We try to push the boundary of what we can do with as little compute as possible, contrary to what usually others do. We are really, really split, like,
all kinds of people, right? It's like, it's not a centralized, really cohesive group where we're doing a lot of stuff at the same time. So we explore really a lot of alternatives. We try to do a lot of cool stuff, I think. It's interesting because the AI space right now, if you do it from like a scientific perspective,
unlike a lot of other areas of science that are highly ossified. So if you want to be in biology or chemistry, you have to go through the academic process and maybe you get your PhD and then maybe if you study for a long time, you can make a tiny contribution to the store of knowledge.
We're lucky right now because the state of AI as a science is incredibly new and it truly has a wide green field. And unlike lots of other areas of science where if you look at something and you think, "Why hasn't somebody done X, Y, and Z?" Unfortunately, often the reason is someone did try it and there's like a reason it doesn't work. That really isn't the state of play right now with AI. Like really, you can shake a stick at almost anything and make
groundbreaking novel research in this area. And so because we have that, we're in that time and in that space right now, we bring together a lot of people who have a lot of different views of the world and maybe for whatever reason didn't go through the traditional academic sort of, you know, not to disparage it, sort of your typical Stanford, MIT kind of like path. We bring other like divergent people together and we say there's this incredible thing that's happening where we can do research and make fundamental breakthroughs for everyone
and you can do it as who you are, not just as this one monolithic piece of a machine. So we consider ourselves very highly individualistic and personal versus like we are this one organization. But that's what we're doing. We sit in a unique time and space right now, and so we're taking advantage of that. - And how did you guys get here? What are your backgrounds? - I actually, I grew up in Detroit and worked in automotive for like 15 years, working on autonomous driving and stuff like that. And I actually, I did that for 15 years,
I enjoyed it, but I'm actually not like a car guy. And at one point I actually got like crypto pilled. I was working on my master's research on Zcash, which is a cryptocurrency, but it was kind of like static. And I eventually found Ethereum, which is this like programmable blockchain. I'm like, this is awesome.
And so I started writing all this smart contract code at night and while I was still doing my job and my wife was just like, this is what you love. You know what I mean? Like, this is what you love. And she really gave me the push to be like, why don't you just go do this? So I spent, I left behind like the nine to five, the healthcare, like all of that, left it all behind to sort of strike out on my own and just follow my interests and see what
where that took me. And through that process, I discovered I'd always been sort of AI adjacent because our customers were at my previous company were doing autonomous driving. And then I saw I was just, you know, on the Internet and I saw stable diffusion when the first stable diffusion came out.
And I was like, okay, there's something happening here and I need to find out what's happening. So through that curiosity and wonder of being like, I need to really understand how this works and spent maybe just a year just going through the scientific process of learning everything there is to do about like the modern actual AI, the math behind it, everything behind it.
And I had that freedom now to be able to do that. And through that curiosity and wonder, I'm one of these people who wants to be able to touch the technology that I use. I wanna be able to step through the code because that's the way I interact with the world by stepping through the code.
And what I discovered at the time was that ChatGPT had come out and I'm looking, I'm like, the state of open source AI at the time was way behind the closed competitors. And I was just like, how are we ever going to, like how's open source people like me, how are we ever gonna get to that pinnacle on the hill of like what's happening at OpenAI and some of these other places?
And I just said, well, what are some of the things that are stopping that? Is there a reason that we can't have this from like a scientific perspective? And it turns out it was just some people needed to do the work. But, and we started by, I was just an anonymous person on Discord and doing some research and sharing, started sharing with other people who had a like-minded approach. And there was someone invited me to this Discord called News Research. And it was just,
other anonymous people hanging out exchanging ideas and sort of out of that came what we are today. So that's how it started from here to there. - How did you guys meet and what's your record? How did you fall into it? - Yeah, for me, like that's a long story, but at first, you know, when I was a child,
I really love tinkering around stuff. I love toys, electronic toys. There was some kind of like electronic Lego which you could do like create a circuit like create an AM radio with like just Lego blocks and those type of toys was like so interesting to me because I love tinkering. So when I went into university, college, I wanted to work with computers because computers are so like
you can do everything. That's the promise of technology, is that using computers you can basically do everything a human can and that was like my dream basically.
Surprisingly, when I was in university, my first, actually it was one of my first classes, it was one of the programming classes. The teacher was a, he's called Aaron Corville, and he showed us like the first class, he showed us what machine learning can do. And he showed us slides on like generative models. That was 10 years ago.
And that was amazing. I'd never seen such a thing like how computers can generate images. That was back in like 2014. That was Gans at that time. The Gans was the newest thing. Like nobody knew about Gans, but he did. And he showed us these like really crazy slides about machine learning generative models. And I was like, I have to study this. Then I just went into machine learning. I also did my master's was in computer graphics and machine learning because I always loved
computer games. I wanted to make a game or some kind of thing. But as AI progressed and as ChatGPD came out, those stable diffusion models, I was like, yes, this is the future.
generative models are going to change the world and I really want to be part of it. And then I did some freelance research work and I met Jeff. We met actually just on Reddit. We were working on sort of the same area of research and I made a post on Reddit on our local llama and he made a post and I still remember I was going to pick up a pizza and I just got an email and I opened it up and he had just cold emailed me like, "Hey, I saw all this stuff you wrote. I have these new results too to get together."
And sort of out of that became, you know, what we are today. Maybe it would be helpful to just first talk through some of your earlier work. You know, what kind of projects have you guys released so far leading up to Distro? Probably we're best known for our Hermes series of AI models, which are models that we train that are sort of
neutrally aligned. So we don't, it's not like we're making uncensored models that we like think, oh, you need to be doing all these terrible things with it, but rather it's a series of AI models where what the user directs the model to do, the model feels the model will do that.
And so this is maybe in contrast to some other closed providers who have to put these guardrails around it. And those guardrails exist for maybe for good reasons, especially as a centralized sort of US-based company, maybe, or something like that. But what's interesting is that most AI models that people interact with today take on the persona of what we call the helpful assistant.
I've used a chatbot, the helpless harmless assistant, sort of the front door secretary of AI. And that's a very neutral approach and it will sort of, "Oh, that's a good idea. Have you thought of this? That's nice." But it has been adopted to this specific persona. What we try to do is make models that you can instruct to take on any persona that you derive.
So we say, instead of training the model to be like, you are an assistant under these constraints, help them the best you can. We make models where we say, take what the user says and adopt that worldview. Really adopt the worldview of whatever the user wants you to take on
Give that as the a priori thing to say is the truth and then play out the scenario from there. Now, obviously, the helpless, harmless assistant is one of those personas that an AI model can take on, but it's not all of them. Right. And certainly we try to create that expressiveness, too. And it comes from like this individualistic approach that we take, which is that everyone
should be able to interact with the system, not in a way where it's moralizing down to you, but it's empowering you as a person to be a better person. I actually gotten excited about language models because I love sci-fi and fantasy stories, but I'm a terrible writer. And these helped me to write my own stories and come up with my own plot lines.
And right now, if I tried to do that with some other models, they might say, well, I can't, that's copyrighted material. I can't, you know what I mean? But like these models that we make help me express myself. And so we try to make models that wherever you are, they will help you express and like be an extension of you. And then we also did have done other research, like fundamental research. So we're
We just do looking at specific technical problems. So Bowen here is the lead author of a method we developed called YARN, which is a context window extension method that we released and did the research on. It is now used by every model you use
Everything, Chachi-Piti, Lama, DeepSeek, all of them, they all used the methods that we pioneered in the yarn paper. And that came out of this idea of the open source AI at the time only had these very, very narrow context windows. Like they could only deal with a very little bit of text that you could talk to them. Then the model gets amnesia after that.
And we were looking at what OpenAI had with ChatGPT and they could take like 4,000 words. And right now the only open source models could take like 500. And we're like, how are we gonna bring this to everyone? We wanna bring this to everyone. And so we just did
the technical research to sort of knock down those barriers that kept us from offering the current state-of-the-art to everyone. So we do research on that side. We also have our models, and that's kind of where we've been previously. But when it came to Distro, we had the same sort of attitude, which was what keeps the open source community from being able to create their own state-of-the-art AI from the ground up?
We are in the open source world, at least, very beholden to the goodwill of several organizations. And number one of them probably is Meta. So what they do with the Lama models is phenomenal. And I love Mark. But we looked at the world and we said, what happens if he can't do that anymore?
And that may not even be like up to him, you know, in particular. There's lots of things happening in the legal space, all around that. So we said, well, suppose tomorrow every company said we can't release open source models anymore. I'm sorry. We just like it's just it's just bad business. You know, it's not like an anti thing, but it's just bad business. Where would we be in the open source AI space?
And so what are the technical problems that would keep us from being able to replicate that on ourselves? And it turns out the real big problem is that when it comes to training models, the current paradigm for training models requires that all of the GPUs that train the model, these computers that do the training, they all have to be like in the same room.
And like that seems very counterintuitive, but that is just a cold hard fact. And that cold hard fact is because the way the technology stack has grown, it assumes that all of the little brains that are training the model can talk to each other very fast. That's like the most the simplest way to do it is they can just all talk to each other very fast and they're able to train it. And that's okay if you put it all into a warehouse and if you are a single entity who can just
put it into one data center, of course, you're going to take all your GPUs, put them in one data center, here's your data center, it's all good. But we, in sort of the collective we of the open source AI movement, aren't one entity. So how can we cooperate to actually train a state-of-the-art AI that we all own?
And there were a bunch of technical problems that ultimately led down to the fact that the bandwidth on the internet is much smaller than the bandwidth between these GPUs that they train in these data centers. And that may or may not have been like something that was insurmountable. And in fact, for a lot of time, people thought it was insurmountable, that there really wasn't a way to do this. But we sort of, and Bowen can talk about this too, it turns out that most assumptions in the AI space right now are a product of
That's just how things had been done when there was not nearly as much energy and attention to it. So someone made an assumption maybe in the early 90s that everyone just kind of has continued to go with. And we now can revisit it and look at those and see that there's actually a lot of growth there.
There's so many different problems that a research team like you guys could work on. What is your criteria for deciding what to work on? I think the biggest criteria is that this research should be as fundamental as possible, because when you get into the engineering, the actual training or the actual data collection, that needs really big scale. Right. And as a small group, you cannot go to that scale.
So if we look at everything that's very mathematical, that you can tweak, a lot of hyperparameters can change, then that's really good because we can do smaller experiments and iterate from there.
And it's really those things that are kind of like the 10x power-ups. Yes. Like what would be if you look at all of the sort of blockers and then like you realize there's this one little load-bearing piece that if we could just solve that, all these other things on top of it would collapse down. And now those things that we don't have the scale to reach the mountain because we don't have 20,000 of the most expensive computers. But it turns out that like if we just knock this one little brick out at the bottom of the Django set, this whole thing collapses down.
and sort of looking for those like kind of important pieces where if we just pulled this one piece out and solved it, it would enable a 10x or 100x multiplier for people in the open source space. And in the case of Hermes, what was the bottleneck that you feel like that effort was trying to solve? That particularly was with how the data is collected.
So the early models, especially like CHEP, GPT 3.5, were trained through tons and tons of human data, human data collection. Human data collection is very slow. It's very expensive. And if you wanted to make a model at the time, the idea was you had to go spend all this money to get this human curated data, which is very slow. So Hermes was very early to the idea that you could have synthetic data.
which is that you could make a better model by taking an AI model, having it generate words and text, and then training a new AI model on that output. This is now like fully accepted, everyone's doing it. But two and a half, three years ago, there was this giant open question of like, is that actually
not going to just be this reductionist thing that collapses. Like how could the student ever eclipse the master kind of question? And we said, but we don't have access to the resources that would enable us to go get all the human data. I like to tell our team, we're like the astronauts in, you've seen Apollo 13, where they dump all the stuff out on the table, right? And they're like, we got to make this fit into this using only this.
And that's kind of how we look at the world. Suppose all we have is on the table. How can we make all these pieces fit together to like, you know, get us to the moon and back? Right. When was it that the conversation about going from working on the data bottleneck to the training bottleneck, which is what Distro is, in a sense, in my head, there's sort of, if you think about the AI production pipeline of models as like
first mile, middle mile, last mile, you guys kind of started with the last mile. - Yes, yes. - Right? - And then it's the first mile. - Yeah, yeah. - Distro is kind of a first mile effort. When was it that you guys started to shift your focus to this? Why and then what is the core idea behind Distro? - I think we haven't shifted per se. It's more like we have so many people working on different things and then we prioritize those that we find that are the most promising. So everyone's looking kind of like
at everything. And then we have this data pipeline that showed a lot of promise and we look into it and now it's like this show. So the data collection data parts are also being still worked on, right? It's still being improved and Hermes 4 is probably going to be really great. And now we just have another team like my team, which works on distro.
It's in parallel, right? So then other ideas could pop up and we would have another team working on it. Yeah, but the start of it really was this idea of what if we don't get Lama 4? Exactly. That was a challenge we were appreciating. What if we don't get Lama 4? That's like an existential threat, right? That's like an actual existential threat because the closed providers will continue to get better and we would be
like dead in the water in a lot of sense. So we sort of asked, is there any real reason we can't make Llama for ourselves? And there is a real reason, which is that we don't have 20,000 H100s. We don't have, I think Elon's got 100,000 H100s now. So we don't have that right now. God willing in the creek don't rise, maybe we will one day, but we don't have that right now. So we said, but what do we have? We have a giant activated community.
who's passionate about wanting to do this and would be willing to contribute their GPUs, their power to it if only they could. If only they could. So we have the community who's willing, but we don't have the ability to activate that willingness into actual action. So what are, let's just look at it. Why can't it work? And it turns out really people have been trying to do this for a long time, but there were these very specific technical problems that just made it intractable.
There was a group called Big Science who made Hive Mind, who worked on trying to do this, but they just had technical limitations because they couldn't
ship all the information over the internet. The only way people are connected is over the internet. And so anything that isn't sharing over the internet is not going to work. And so that was the initial premise that came out of it is what if we don't get Lama 4? And then what do we have that we could use to create Lama 4? And if we can't, what are the technical problems that if only we slayed that one technical problem, now sort of the dam of our community can now flow and actually solve the problem. So to summarize, Distro is research.
showing that it's possible to train highly capable models with nothing more than a standard internet connection, right? As opposed to the kind of status quo, which is high-speed interconnects without any dramatic reduction in performance of the model. Is that roughly, would you say, a fair summary? And we would say actually the performance is equivalent.
So we've tried to make sure that that was the case before claiming that. So we made a lot of experiments just to make sure that it is actually not better, but for the same bandwidth is better because now you need 1000 times less bandwidth. Right.
No one's interested in something with a trade-off where, eh, you can do it, but it's not as good. It's got to be like A1. It needs to be as good as the centralized way of doing it. No asterisks to it. And then you don't have to do it. It doesn't have to be in a data center. So the benchmarks are pretty impressive. But before we get to that, could you just explain maybe for folks who are listening who may not be super familiar with AI infrastructure, why does it matter that you guys...
have demonstrated a way for models to be trained with everyday internet connections. Yeah, so LAMA 405B was trained on a huge data center of like 40,000 H100s. Those are highly interconnected and use a lot of power because every GPU needs to be connected to every other GPU. So if you have 40,000 GPU, then that's like a quadratic scaling, right? Every GPU you add, you have to connect to even more and more so that
That requires a lot of power, a lot of cooling, a lot of everything. And basically, to train those models, you need those high-speed interconnects between the GPUs. But what we found is that when we train bigger and bigger models,
the data that you actually need to transmit between the GPUs become relatively less and less. So it doesn't become smaller, but for the size of the model, as you grow the model, the size of the communication grows slower. It doesn't grow as fast. So now you don't need those big, big interconnects. You probably can have maybe four data centers or like 10 data centers separated by the internet and train
the same equivalent network using those data centers. And if you push that boundary to the extreme, then you could have your home computer, your 4090 or like some 4080, any GPU consumer, or even your Apple, your phone, your laptop, everyone could potentially be connected into this huge web of the internet of things which train this single network, right? So just to give people a sense of how dramatic of a change
that is. How many organizations do you think in the world today have the capability to train a co-located model of the scale of a LAMA? You mentioned about 40,000 H100s, right? I mean, it would probably be in the number of ones on my hand and it probably wouldn't use all my fingers. You know? Yeah. I mean, you basically have OpenA, Anthropic, Meta, X, Google, and then you have a few, Mistral, and then DeepSeek, and a couple of other places that are maybe like a specific
country are backing it. But certainly, we can name them all and not run out of appendages to get there. Yeah. Right. And so the second and third order, I guess, effects of needing this high speed interconnect between GPUs in a single location has basically restricted the number of people who can run those models. Correct. Yes. Exactly. Yes. I see. And so the insight was you guys found a way to decouple the scaling of the-- you could say performance of the models from the scaling of the interconnect required.
You guys have essentially tried to decouple those two things. And at one point, that almost becomes like a human coordination problem. If you're a single entity, you are aligned with your own coordination. You're just like, "Aye, we're gonna do what we want." But even if you
could, how would you bring all these people together to work together on one thing? And that's really like, we think that has an unlock to still be discovered and researched what that looks like as like a realization. But the idea that the entire world could work together to create one AI that's representative of the whole world that everyone's contributing to, I mean, there's probably a lot of power in that. And so we're interested in seeing sort of where that could go given this.
How did you guys embark on the first step? What was the first sort of training from a research and systems architectural perspective? What was the first thing you guys decided to do to then try to figure out if this was even possible? I think I thought about it for like a couple of weeks because there's a lot of implications in like the mathematical side actually. So we met in New York. I think we had a long discussion on like, is
Is this even possible? Right. And we came to the conclusion that this should be possible because there are some training dynamics in neural networks that can be taken advantage of, which allowed you to do this type of thing. But that was all purely theoretical. So then we've just wrote the code and tried it. That was a really big leap of faith. It took months before we got like our first data point.
of whether this was really going to be true. And there actually were several false starts with them. And that turned out to have been wrong. But the core mathematical properties actually, like, it's funny, a lot of people have discovered that like everything we have in the AI was invented in the 90s and no one noticed it. Or it was invented in the 80s and it was just like a few math insights that truly were, now that we know everything we know, those insights from maybe 15 or 20 years ago, that people,
that seemed innocuous at the time, turned out to like be extremely powerful when applied to what we know. And then there was also that, just the fact that we had many false starts that didn't work. And we said, okay, like let's go to the next, we gotta keep going. Like, you know, like to have that sort of faith. And a lot of that comes from having the freedom from,
a business perspective, kind of like where we are in the space, to be able to explore that, you know, and to be able to be comfortable with not having to deliver something right away and to just work it out till it's done. - I think with Distro, we hit not a wall, we hit like a bulletproof glass. It's like we saw hints of it working on the other side. Like we saw the other side, but there was like at first a lot of problems that we couldn't solve and that we worked on it for months and months. And one of them was actually scale.
So one of the biggest breakthroughs we had was just Jeff training on a bigger model. And that was the last step. That was just like the amazing results we had. At first, we didn't believe it even. Like we thought this was a false, like we did some mistake or like we use the wrong data. Like it was too good to be true, right? But it did work at the end and we did more benchmarks. And that was, I think the work paid off at the end. Yeah.
In that phase when you're trying to go from theory to hypothesis testing, what was the first major hypothesis that you guys tried to test or metric that you were trying to measure or you're looking for indication that from a practical perspective this would work? So at the beginning we tried fine-tuning because fine-tuning is really fast. So you can take a really small model and just fine-tune it on a small amount of data. And I found out that with fine-tuning could actually decrease the bandwidth.
It was kind of iffy because it's hard to judge the quality of a fine tune. There's a lot of benchmarks, but then when you have like a tenth of a percent, is that meaningful? You don't have a lot of indicators in that kind of scenario. But as we scaled up and we started doing pre-training when you had more and more compute, we got access to more compute, then we could definitely say that this is better than the previous one. - Right, got it. And the results you guys ended up publishing,
We're sort of staggering. There's an 857 X. That is the worst case scenario reduction in bandwidth requirements and that's a conservative estimate. It's a really conservative estimate. What would be the optimistic side of the spectrum of bandwidth reduction? I think we're seeing hints of being potentially like 2000, 3000. So even a little bit more with this method and potentially using quantization, you know, other things that people have
Like so many things people could try with this method that we haven't tried, right? That could unlock another like 10x. So that could go. But we can't really say this already. I would try to be conservative. We want to be sure that it doesn't degrade the model in some other way. Because the benchmarks we have are great, but they only measure in ways that we measure them.
Could you talk a little bit about what you guys benchmarked to? So there's, for example, cross-entropy loss, which is the difference of the output of the model with the ground truth, which is the real data and what it predicted. There's also perplexity, which is the same measure, but in a different way. There's benchmarks like HelaSwag, MMLU, and those type of question answering benchmarks.
But those are like fairly limited. They don't benchmark everything. So let's say in the future LMs are being used in robotics, then these benchmarks won't really matter. So it's really hard to guess whether the distro method would degrade any of the future potential benchmarks. We really want this to be fundamental and not like limiting the network to a subset
of potential applications. We want this to be optimized, like just a general optimizer that can do everything. But what's interesting is it tells us that what's actually happening in the learning, what we call the instrumentability problem, or like just knowing what's actually happening when the models are being trained,
It says that the actual training procedure that's happening, perhaps it's not so much about the whole model learning, but that there's these like key insights that are being learned that are actually almost narrower in thought. And then that you don't, that like the real things that matter, there's only a few high important signals that are within the learning that's actually going on. And that itself just tells us something that we didn't know before.
We know that what needs to be communicated between the different nodes are just these few key pieces of information. And that is a necessary condition, or rather a sufficient condition, to get the equivalent behavior of sharing everything unit.
And that is an insight that is worthwhile in helping us continue to understand what's actually happening when these models learn. Because it is, as for as much research as we've done, it's still very much like a generalization
giant question mark, like what's actually happening inside these models. Yeah, so that's one of the double-edged sword of generative modeling, right? It's a fundamentally empirical space. Yes, it is. The distro results that you guys probably were so dramatically shocking that of course, I would say the primary reaction the community have was one of disbelief. Yes, yes. And we expect that. What would you say the biggest critics, if you had to chat, role play being a critic and a disbeliever in what Noosa just pulled off, what would be your biggest objections today?
So one of the first could be your baseline was wrong. So that is really easy to claim. Training large language models is not easy. It's not really like you just do one thing and you get the best model, right? So you have all of these hyperparameters you have to tweak. And if we didn't do the reference right,
that might be just by chance that this show is better. It could be this was actually worse and the baseline is just worse than usual. Right. And then this show is better than the worst thing so...
So it's inconclusive, right? Yeah. I think another piece would be someone would say it doesn't scale. Okay, it works at smaller sizes. If you scale it up to a trillion, do you get the same thing? And this is actually, I will say it is a valid criticism. There are lots of false starts that we have just in space where people come up with an idea. It works kind of at a small thing. But when you try to like make it big, it's more a byproduct of some weird dynamics at a small scale that isn't
true for all instances and all sizes and all network types even. So like 8MW, it just works for all networks. And transformers, they just work all the time. And so what you don't want is sort of like a fragile situation, like Bowen said, where it kind of works for like LLMs of these types or this, but it doesn't work. So that would be perhaps like a bare criticism that people would have.
would have. And just to ground this in numbers, what was the scale at which you guys ran the experiments? We've gone up through 7B now, like 7B models. So we still have lots of work to do. And it could be that it doesn't scale infinitely. But what we have seen empirically is that as we make it bigger, the differential between distro and NMW actually gets wider.
And so this is very encouraging, which is that if you saw it starting to narrow as the scale went up, you would fear that it would eventually equate and then maybe be worse at the end. But that isn't the case empirically with what we see so far. So we're quite hopeful that it will continue to scale up. And that's why just from a practical perspective, we want to find out now actually bring it to the community and say, OK, well, let's run the giant one together because we don't have the 10,000 H100s to just
do it all at once and prove and know for sure. I really think the important part of our research right now is to prove that distro can reduce the amount of communication and not really about the loss differential because the loss differential of distro right now is like
It's better and it's unexplained. This is probably because of some side effect of this, right? That we haven't explored. It's probably not because of the compression that we're doing. So, or maybe it is, we don't know. But then the focus should really not be on the loss differential, but on the compression part. Because as we scale up the models more and more, you can compress more.
But the last differential might be like changing, right? It could be wider, it could be narrower. It doesn't matter that much at the end, as long as the models are trained as equivalently as possible to ALMW, the state of the art optimizer right now.
And, you know, the critique about the baseline maybe not being replicable. It's been now about a month and a half, I think, since you put out Distro. Has the community been able to replicate the baseline? So what we actually did was we threw away everything we did and we went to an entire new pre-training. We started from scratch again on a second time.
So we had been using our own one based on, we had implemented it twice. Once our own, once inside of Hugging Face's Nanotron framework. And we threw everything away. And I said, okay, let's do it exactly all over again using, it's called from Olmo from Allen AI, which is the very highly reproducible, like they publish every data index
they have every token so you can like 100% reproduce exactly what they did down to like everything and we've re-implemented now a third time in their framework and we're able to reproduce their training run exactly and then did it again with Distro got the exact same results we got with Natron and
It's kind of one of these things where you have to want it to fail. You have to be willing to do the scary thing and see if it fails. But we now have done that for the third time. And in the distro paper that we're going to be publishing next month, all the data is coming now from the Ulmo framework that we used. Oh, great. And we will have the code. And it's also already-- the baseline is not trained from us. So we took a baseline from this Ulmo group.
So that argument can be like settled, right? Because it's not that we didn't come up with. It's like they use a, they assume we assume they are training using the best hypermeters, which they did because they did a huge ablation on like smaller models and train a one B and then we just took
the same code and replace only the optimizer to distro. And then we saw the exact same curve. And then I was actually convinced because I'm the biggest skeptic of distro actually. It was like from the beginning, I just didn't believe that this was possible, but as we seen more and more hints
As I said, it was like a bulletproof glass, right? We're hitting that glass. And then this was the real breakthrough. Then we're like, yeah, this show is real. We should just announce it, right? And yeah, and we should just finish the paper and hope for the best.
AI is just moving so fast. There's so much uncertainty about who wins and who loses, et cetera, that I've seen a natural tendency for people to start sharing less openly with the community when they have a breakthrough or results. Things go closed source, people stop sharing. But ironically, I think for the kind of effort that Distro is, being as public as you guys have been with the open source release, the paper publishing the first few ablation results, which then enables somebody like the ALMO team, the Allen team, to then adversarially red team you guys and try to prove you wrong.
It makes Distro's accomplishments that much more impressive, which then I think is kind of the dream, right? Is that it's kind of what happened with synth data. So for so long, there was so much disbelief. But the minute people started reproducing the Hermes results, suddenly the entire space switched over.
to the synth data approach you guys are using. And so my hope is actually that some of the people listening to this podcast will go try to replicate and prove you guys wrong and come out on the other side. We encourage it. And we want people to try to use Distro because it is so hard for us to think about everything. We're such a small group. We only have so many ideas. And with this Distro thing, I really hope that people just start seeing that we've really pulled out the Jenga tower. And now we have to start over and think about
this new way of training things, which is much cheaper and much more efficient because you don't want to have a huge, like power hungry data centers in like rich countries. And then like everyone else doesn't have anything. Right. And then this thing would allow everyone to participate in training. Right. It's a different mindset, really. Yeah. If Distro works.
A knee-jerk response would be, "Oh, wow, that's terrible for NVIDIA." Because one of the largest drivers of value in NVIDIA's enterprise value and market cap and revenue has been these massive contracts for co-located data centers. People buying 10,000, 20,000, soon, you mentioned Elon buying 100,000 H100s co-located in a single place.
Is that true or do you think there's more to the story that actually the second and third order effects of something like Distro working are actually net positive for somebody like Nvidia and the ecosystem? It's not immediately that bad for Nvidia because there's still years of work to make Distro actually scale towards those types of really, really large training runs.
I think there's a lot of things that Nvidia still has, which is the CUDA stack, all the GPU hardware, right? Those like non things that are adjacent to interconnect that can also work with this show, right? Right. So it's hard to tell whether this would like really affect Nvidia or not. I think the bigger effects will be maybe at like the social scale or the business scale of like what it means to not have a single entity have to do it.
even if everyone's still using NVIDIA's chips to do it. It's that second middle layer that may see a more pronounced effect because ultimately, you still have to do back prop. You still have to load memory in. What might happen sooner would be a redesign of the types of chips that NVIDIA or someone would make. Okay, under this model, we can dedicate more VRAM versus... There's this question of how much VRAM versus how much processing power is on a die. And that could change sort of the...
the dynamics of what that optimal looks like. And I think that is probably more like a-- it's a new meta for them to then have a product space for as well. As I hear you talk about the implications of if your distro works, there's so many analogies that could be drawn to these earlier distributed systems at home efforts. If you remember, if you guys ever did folding at home or-- SETI. SETI, yeah. SETI at home. Was that ever part of the inspiration for what you guys are doing?
Yeah, it certainly was. But we actually what's interesting is that we didn't know if that would activate with a lot of people. Like it was something we were excited about. But sort of internally, it was like we've been our own critics on this. Like, does anyone really like care about training a model at their home and something like this? But and it may not have like the biggest impact right now if you were to go talk to your grandma.
about like what we're doing, but there is certainly a large community of people who, who feel that and want to be a part of working and contributing to AI and that team effort of like the whole world. And it's, and it's certainly something we were excited about internally, but we didn't know if it would actually like hit or stick.
And the second we came out with it, it's like everyone being like, oh, immediately, study at home. I want to do that. Like, I want to be a part of that. And there's something aspirational about it, about reaching for the stars, being part of this, you as your one person being part of this giant big thing that we hoped would be there, but we didn't know until we sort of put it out there. Right. The current distro experiment still uses H100s. Market prices are coming down a little bit, but still in that $30,000 to $40,000 per card range.
And so it's expensive and hard to get. Let's say we can mitigate all the need for all the specialized and highly tuned bandwidth infrastructure, the interconnects in particular. Is access to high-end GPUs, you think, still going to be a requirement for training in a distributed fashion?
Like H100s? Well, we've been, right now, I think people don't actually realize that like a 4090 and like an H100 are in a lot of ways the same card. Can you just explain, you know, for the non-gamers in the room, explain the 4090. The chip that's inside of them is almost identical. The chip, the actual compute chip is actually almost identical to an H100.
And what you're actually buying is the memory around. That's actually the expensive piece, the HBM3 memory that they put around the die. And so there may be like an enterprise markup that you're getting in there that that NVIDIA is charging for like the H100s knowing that you need to use them for training versus someone who wants to just use this for gaming. And hopefully that dynamic doesn't shift now that we introduced Distro. But we because you're able to distribute it so wide,
I think the gaming GPU angle is really going to be like the sweet spot. As long as there's continued to be sort of like higher end gaming GPUs and those are on comparison with the high end training GPUs, even if they're half as fast or a third as fast,
we can marshal a lot of those together and sort of make up for it in scale. So it's actually like, okay, if we are just using the gamer GPUs, I think for the short term. So we certainly do not want the minimum viable entity to be someone with an H100.
That cuts nearly everyone off. And also what Apple's doing with their MLX platform and their Apple Silicon, like that's amazing. And if we could activate just all that latent compute towards training, that is a lot to it too. And so we're making sure that also like the code we're writing to help to actually do this training is agnostic to the hardware, is able to communicate and operate. You can have an Apple device and an NVIDIA device training together. And this is actually for just
practical reasons when training happens right now, it sort of assumes that all the GPUs are the same, they're all in the same sort of organization. So we're actually just writing this like fault tolerant training code where one of the GPUs can go down and that's okay and it'll continue on training even if some of them like fall off and are different. And you just sort of didn't do that before if you were just one entity who owned everything and controlled everything. Yeah, there was no use for it, right? It's like my point of this is that
as you have this technology being unveiled, this true, that allows people to think about possibilities of new methods of training, right? Which is different from the status quo. And those chips probably right now, like the H100 and 4090s are not, like H100 are suitable for training and 4090s are not. But as people...
become aware of better architectures, better training code and different ways of training LLMs or even different architectures of LLMs, they can start fitting those models into smaller VRAM or like gamer GPUs, right? Or maybe there's also pressure in demand. So if Nvidia sees a bigger demand in like less interconnect GPUs such as gaming GPUs, they could potentially sell those more, produce more of them. So it really depends on, it's really a balancing act, I think.
And as we push Distro more and more, people would start to develop an ecosystem of training around it. Otherwise, right now, we have to use H100. But if you don't try something new, you would always stay on H100. This was your first release. Can you just explain what comes next? What are your biggest priorities? What do you guys think you need to do next to actually continue building more momentum? So we will release the paper and then the source code of Distro.
So this would allow people to just start iterating on it immediately. That's for October. So the ICLR conference, we would want to also try to publish. Yeah, and then the question after that is actually building something that can use this together. So we're starting to work on code to be like, what will it actually look like
to practically use distro to so that everyone can come together and train models so that's still very much in like the researchy phase like that but if we if we sort of rewind back to what i said before which was that we looked at what we had which is our community of
of people who are interested in this, a very large community. That was what we had. What we didn't have was the huge stack of H100s. So now that we've now actually kind of gone down the path, solved it, we now want to bring it back and say, OK, here's an actual full stack tooling that can actually you can use to actually do the thing, because what we're going to release like the first release is very is going to actually be very academic.
It's going to be the paper with the proofs for the ablations and sort of the reference PyTorch source code. That's like, here's how it actually works here, all this stuff. But there's still a long way to go from an academic paper and reference PyTorch code to...
everyone in the world training an AI model together, right? The next phase is building that second piece of it. MARK MIRCHANDANI: I see. And can you talk a little bit about what are parts of productizing the research, so to speak, and turning it into an actual optimization library that can be dropped in and used really easily in a training run or in a training
Is that the shape of the product? - I wouldn't think that it's going to be like a product that you buy off the shelf and drop into an existing thing. I think it's really more about what does it mean to have everyone working together on a model? How do you reward them? How do you give them actual access to it? And then what is community ownership of that even mean? So we're still very much, this is still very much in the like ideation phase of the productizing of it. We're lucky to be in a position where we feel comfortable
releasing the sort of the secret sauce before we maybe have all of the value capture sort of accumulated to us because ultimately it's science, it's math, it's math equations. And it's like the Grand Moff Tarkin, like the more star systems try to hold in your hand, the more will slip through your fingers. So we have a good idea about how we can, you know, have a tool chain that can execute and help everyone
train together and work on it. And we're going to release that after we release the source code to everyone. And and then it will just be a process of seeing what works and what doesn't with with the community, because again, another one stupid line, but like everyone's got a plan until it gets out there and is real and we see what do people actually want to want to build with this.
So we've also been really open source with everything because the open source community gives back a lot. So everything is built upon open source. So we are giving back to the open source community. And this will benefit us in the long term because LLMs will be trained better and AI would advance quicker. And one thing I think this will allow in the short term is more experimentation for experimental architectures. If we have a substrate upon which
Like it took us having to go get our own H100s to actually do this. We actually went out and bought like 64 of our own H100s just to do like the distro testing. But the next group of people who have an idea about a new architecture or a new novel idea, now they can use this distro network to actually try that out.
and have the ability for experiments because if you're inside of a large organization, you have sometimes maybe a fear of trying something new because we have to get something out next quarter. But now sort of creating an environment where, like I said before, there's so much area for innovation. You shake a stick and you'll make an innovation in the AI space. Having a place where other people can actually
get access to the compute to do these studies. And if a bunch of people are like, hey, I want to try this weird transfusion, which is a paper from Meta, which we're really excited about, model that's using bitnet, that's got all these weird things, then people are like, yeah, I'll put my-- clear, click this button, here's my GPU. Go ahead, use it. And now you've solved the coordination problem, and now we can try out all these new things and see what the results. So it sounds like if you look at the-- we called it sort of the first, middle, and last mile of, well, a frontier model production.
right? The first mile being pre-training, the middle step being all of the stuff that happens in the post-training and alignment phase, and the last mile being inference optimization, like actually hosting and so on. It sounds like this paper was a breakthrough in the first mile. You demonstrated that you could dramatically reduce the bandwidth required in the pre-training step. One of the things we talked about when you guys put out Distro was the impact this is going to have on, if it works, right, is on regulation.
And so I know the grand challenge for Distro was inspired by the question of, hey, what happens if we never get an open source Lama 4? In a not so distant future where Meta decides, hey, it's just the regulatory risk is too high. The argument would be, well, that's okay.
Thank you Meta for everything you've done so far. Now it's time for the community to step up, run a system like Distro to allow massively decentralized training, right? If let's say, based on the results you guys are seeing, Lama 4 is not open sourced. How far are we from a model that's at least as good as Lama 3, 4, or 5b being trained by the community? I think 7b would be possible immediately. Like a 7b model, I think it's not too far-fetched with the first iteration of the code we put out.
it could be possible to train 7b to like 4 trillion tokens with maybe 1,000 h100s like just rent it all over runpod right just right somebody could do that for four or five be what do you think i think i think it still is it would still be you know like a next year sort of environment thing that we would have to do there are some
scaling problems or not scaling problems, but technical things about how you shard the model. Cause at that point you get to this problem where you have to put the model on more than one GPU or on more. And like when you get to sort of that threshold now, there is a communication requirement between those ones to even shard it. We're looking at how the activations work, like how you could shard those pieces to make it work. So there's still technical work to be done to scale it up infinitely to those, like that, those super large model sizes, but
is not intractable by anything. It's truly just like an engineering question that we need to tackle. And by bringing it out to everyone, I'm highly confident that there are a lot of like, there's also a ton of amazing, smart people who work at all these closed labs, right? And who can contribute how they want to contribute, you know, as not as anonymous, but at night, like, here's my idea. I love what you guys are doing too. Here's a PR that,
implements tensor parallelism in this, you know, like there's a lot that we hope to activate for that side. But it seems that we're in the open source space. We're always like a year playing catch up a year, a year and a half behind like the closed providers. That would be if I had to like ballpark it would be later at the end of next year would be like wouldn't be at maybe that scale.
And, you know, I spend far too much time on local Lama, but let's pretend, you know, you're talking to the local Lama crowd who is not there 24-7 like I am, but is broadly interested as a developer in making sure that open source keeps marching forward. What are the problems that they could help with to accelerate the progress you're talking about?
I think right now there's going to be a lot of these engineering open questions when we release it. And we're going to need help with, for example, NVIDIA has this library called Nickel, which is used internally to actually orchestrate when you look at even a centralized data center run. They do actually, they create these like rings and trees to try to like efficiently move the data around the data center. And they also do use things like GPU direct to copy
copy the information in the gradients out of the GPU and stream them to the next GPU. And we're going to have to throw all that away. We can't use that because it was designed for a centralized infrastructure. So it may be that Distro is two or three times slower than doing it traditionally, but that's not a requirement. It's just because we had to throw out a lot of the crutches, not crutches, like all the things that NVIDIA has built up in their tech stack. And we're going to need to
replicate that on that site. And that's really just engineering work, but engineering work is still work and it'll need to be done. Yeah. Why don't we just take a few minutes to talk about how it actually works and what were the weirdest insights you guys discovered from an engineering and technical perspective during the project?
So basically, if you think about the usual training of ADAMW is that you have all of these GPUs that have a copy of the model, like simply just-- let's take the simplest scenario. So they all have a copy of the model. And when you train them using different data, you want to synchronize them at the end. So you give one different book to every GPU, let's say, and then you train one step. So now all the weights are different. Like everyone has some different training because they've seen a different book.
And now you want to synchronize all of these GPUs into a single model. So you have to do this copy of the whole model into the other one. That's why it's so slow and needs those high-speed interconnects. With Distro, what we found is that we can actually let each of the GPUs train by themselves.
So they don't need to be synchronized, so to speak. You don't need to copy the model over and over so that everyone stays in the same state. You can actually just let each model train with their own books. Just let you train and then
Every step, instead of synchronizing, you just transmit what you've learned, what was the most important thing you learned to everyone else. So it's kind of like you have this space of a lot of points in the cloud where everyone trains and everyone goes in different ways. And then you have this thing that tries to pull them together, right? Tries to get them to a single point. But it will never reach there because your bandwidth is really limited. It's like one megabyte.
The model is like two gigabytes. You're transmitting one megabyte every time. But with this one megabyte, you can try to make them as close as possible in this cloud. And after you train and more and more and more, you start to see that every model actually trains in somewhat sense like the same amount. Like everyone trains.
as if they were in the same point. And you could just take any of those models, they would have the similar performance. So it's kind of like the whole cloud is moving together in space. So that's the intuition of how this role works. But that wasn't how we originally thought this role worked. This is after we've done extensive testing, we looked at actually, we looked at the weights, we looked at how the distance between every GPU, the actual distance, and we see that
At some point in training, they become like bounded. They stop diverging. So that's why like as you train more and more, we think that the compression can get higher and higher because they start to diverge less. And this concept of there being like the one model actually comes sort of from like an engineering detail where when people started writing code for neural network training, they just had like the one computer you're running it on and you're just training the one model.
There's just the one model. There's just the one weights. And then they said, well, we want to be able to train faster. We want to train bigger. But the way this was done was to abstract that detail away where the fact that you were on multiple GPUs was like hidden. And even like now, it's amazing. You write your PyTorch, like PyTorch code rate to train. You write like this one line that's like, do this. And like,
This is actually happening on like 40,000 different computers all at once. But like you as the developer are acting like you're just training this tiny little, you're writing code as if you're training this one little model. So that abstraction that there's just the one model being trained has been maintained, right? And so now the implementation of this maintaining is that as you have each of these different GPUs train on different data, you have to do this all reduce operation where you're basically like, you go learn, you go learn, and then literally we're going to like take everything everyone learned and we're going to
to average it all together and start back to the same point. So it's like everyone goes off in their direction. They all come back home to mom. They all merge back into one node. And then the next step, we go out and do it again. So with Distro, what we found is that rather than bringing everyone back home and averaging it back together, what you want to do is give each of those little nodes that are searching for the lowest point in the lost landscape the freedom to move around.
And they aren't actually coming home and all synchronizing. They each have the freedom to move around. But what you don't want is the freedom to move around and just go off on a tangent. But that diversity of search space where we actually aren't training one, we're breaking the paradigm of there being one model that's being trained. There's actually n number of models being trained, each of them getting to do their own little exploration, but within a bounded space.
so that they're kind of like all looking around and instead of like all coming home, they all phone home. And they just kind of say, "Here's what I've learned that's the best insights from what I've learned," versus being, "Here's, let's merge all together back into one." And that was sort of the dynamics that were, that are at play. But it was interesting because this was actually a post-hoc realization.
Like this was not the thesis that we went into it with. But when we looked at it and we said, OK, it's better, we're getting better loss, we have to know why. And so we then went through the instrumentation and saw what actually was happening and see we found this bounded behavior, but where the diversity of exploration actually is the contributing factor to why it works.
We're actually taking advantage of having multiple GPUs instead of assuming that everyone should be doing the same thing. Where now we're just saying every GPU has the copy of the width, so let's explore differently. Let's do other things that might be possible. That's why this show is working so well, I think. It's one of the contributing factors. It's very counterintuitive. Could you just explain
Why it's so counterintuitive to folks who might not be as familiar with training models? You know, why is this such a surprise that you can actually have bounded search? The fact that normally if you would do this, each one of them would go off in their own direction. And that's not what you want, right? Because like as the models become different, they lose the ability to communicate what they've learned to each other. It's like they are going off into their own country and they're starting to develop their own language. And if they get too far apart, they can't phone home and talk to each other and tell each other. So the simplest, easiest thing is to just like
don't let that happen at all. Bring everyone together and everyone's just one. But the diversity of the ability to kind of go off and do a year in London, like, you know, like each one of them are getting to go off and do their own thing and then come back and say, here's what I've learned. And that somehow that is better than this one monolithic thing. But the one monolithic thing was actually just like a technical
It was from the fact that we had PyTorch and then they like, or like at Keras or any of the other ones. And they're like, well, if you want, you can train on multiple GPUs and you don't even have to change your code. So you needed to maintain this paradigm of there being the one model. So that's what I talked earlier about there being all these things that we've just kind of always done. And that now we're at specific scales where we can go back and rethink about whether those are the right thing. As you guys were describing the post-hoc realization that actually having them all phone home
is much, much more efficient than having them all come home. To complete the analogy, could you just describe in your next-- as you guys are making improvements to the system, how are you approaching who phones home? Who is the orchestration node, the reward function, the objective function? Who is the conductor of the orchestra? But simply, it's like when you use the all reduce operation, everyone just communicates with everyone else.
But here, the operation is just smaller. So just everyone communicates to everyone else and you try to come to agreement, an agreement of what is home. So you don't actually have a home. So everyone's off in any country, but as they phone each other with this tiny amount of data, they kind of understand what is home.
And everyone could like stay together in some sense. But it may be that this isn't even like the most optimal configuration. You know, we're interested in like we're thinking about like an asynchronous version where they're not even phoning home every step to each other. And like, what do you mean if some phone home to others and others don't? Like, and this is where we're excited about because that is an experiment like we can't do it all ourselves.
And so we're going to bring it out. So we actually were very conservative in this first one where we were like, everyone talks to everyone. But that might not even be the optimal configuration. And it may be that the training will be, we make, and there will be like latency problems if you try to do this all over the world, right? So would it be possible to have like an interconnected hub in,
every continent where they're communicating more frequently and then there's less communication to sort of the more disparate elements from like the North, the America to Europe. You have like an American side and a Europe side. They're communicating more and then the nodes in between like communicate some summation of it. And these are all like different configurations that when you free yourself from the paradigm of thinking of it as one monolithic organization doing this one task,
you then find a whole new sphere of optimization. Right. And between the two spectrums of command and control,
And complete anarchy. How are you guys approaching sort of the ideal system design for the next iteration such that there's sufficient coordination, let's call it coherence? Well, obviously it's bounded by the realities of the physical world. And that is going to think, I think that will be the thing that drives it, which is that you have asymmetric network connections from people who are at home. So like home network connections, for example, are quite often, you know, have a lot of down bandwidth, but very little up bandwidth.
So like that is one realist. If we want people to be doing this at home, that is a realist consideration that we need to be thinking about. So I think just in the it'll be driven by like the actual facts on the ground of the network topology that we inherit from like the world. Then we'll have to optimize towards that network topology being what it is. Right. We also are not
mostly in the centralized size, we're going towards the anarchy side. A lot of optimization algorithms start with everything being asynchronous. But that's really hard to get because you sacrifice a lot. You sacrifice the speed, you sacrifice the convergence.
actual efficiency of the algorithm. So by starting on the other side, we just break through this key blocker, which is the communication. And then we can look at every other possibility. For example, as you said, having different bubbles training
independently and then occasionally trying to merge them back together. That could be an interesting thought. Yeah, one of the observations of how a bunch of decentralized protocols have developed over the last few years is that often you have a protocol design where the North Star, of course, is fully decentralized, right? Where you don't have any single node that's more-- is not able to influence the outcome of consensus any more than the others. In practical terms, what ends up happening is you have a few validators who are large contributors of validation.
one or two organizations or three, I'll call it five, only around five to 10 organizations, right? That they're on like sort of clusters of node validators. And when they vote together, they can kind of steer the direction of the protocol. If you kind of chart the number of data centers in the world by the number of co-located chips they have, there's only a handful that have, call it 20K and above, but the tail is quite fat. Exactly. And so before we get to the longest part of that tail, which is, you know, individuals,
there's this fat middle of people who run data centers, which might have, you know, 2K H100s. Do you expect like where we'll go next is that having a model that's sharded across, call it 100 2K H100 clusters is the most likely next, a year from today is the most likely state we're in?
I think there's, you know, immediately coming out, what you'll see is the ability for even centralized actors who might have multiple data centers to now like just use them in a more efficient way. Just have n equals two, no, you know, like n equals two and each one of them is like a whole data center as a node on the network. I mean, and that's just an unlock because especially for a lot of these data centers, they have like maybe 100 gigabit interconnect
between their own data centers, right? And it's like just the fact that you could now treat each of these two as one, I think just from like a practical standpoint, probably is like the biggest unlock that starts beginning. But that's the thing about the method is that it can be decentralized, but it doesn't necessarily have to be decentralized. And centralized actors can still, you know, reap the benefit because they, for that, or they just don't have to buy InfiniBand anymore.
They just can use 100 gigabit ethernet or 10 gigabit ethernet when they're laying out their data centers. So I think there might be a lot of just simple effects that happen quickly and then sort of the
the dream of the full decentralized one may progress slower, but will ultimately eclipse. I think also we should separate different, have multiple networks that can take advantage of each 100s and then have a smaller one that take advantage of the end of the later tail. You have different rectangles. You could fit them in three rectangles instead of having one square somewhere in the middle. Let's keep going on technical details that the community is eager to hear about from you guys.
What are some of the other weirdest discoveries and decisions and trade-offs you guys made? That you still need backprop. That backprop is still king. Yeah, so we actually started this researching something called zeroth order optimization, which is where you don't do backprop.
which is where you actually train a model only through forward passes. Because that was like, we were like, what if you, because we wanted to target a 4090. And like, so I have a 4090, what would it mean for me to like be able to do this? Okay, well, we can't do, we can't train these huge models, but like, what if we could just do it with forward passes? So we actually, our very first iteration was the zeroth order optimization mechanism. And that is a very interesting field.
But what we discovered is that backprop is still king. You really do still need to be doing backpropagation to find the optimal point of the loss. And it's just like zeroth order is like, what, like a thousand? It worked, but it was like you needed like a thousand times or a hundred times. What was the, do you remember? Yeah, it was about like 1,000. So that was for fine-tuning.
And the problem with that is that right now, on specialized-- unspecialized hardware, like general hardware, like Nvidia H100, the inference time and the backprop time is almost the same. It's like-- backprop is slower, but it's like 2.8 times slower, not 1,000 times slower. So this inference-only zero-thoroughly optimizer, if you had the specific hardware for it, like if you had some hardware that can do inference 10,000 times faster--
Then that would be actually useful. Then you could train neural networks just by the forward pass. And then you estimate the gradients. That's a really rough estimate. And you do a lot of forward passes, and then you can train those networks. And what's interesting with order optimization is that in some sense, in the future, you might not need general hardware. You could only just need...
specialized hardware for inference and then do-- - And then actually train on that as well. - Yeah, actually train the network. So you would have your cell phone that has inference and then just train once in a while, right? It's just not always. - Yeah, so that is pretty wild because you're basically saying you can unleash the market of ASICs
on training, right? And ASICs don't take 18 months to tape out, like GPUs, because they don't have to be as general purpose. So they're much faster to tape out. So you can change the architecture on them much faster. They're much cheaper to do. What are the caveats on that statement? Is it basically that as long as the transformer mechanism turns out to be the most efficient workload, then future essentially with if Distro continues down its path is that most training runs actually are running on ASICs, not on GPUs. That could be possible. I wouldn't assume it right away.
But it's possible in the future where you would have neural networks just training using inference forward passes. And we've seen Chip McGraw, other Chip make-- people who are making inference-only hardware. And I also think that there's a lot of-- so the core operations still are floating point matrix multiplications. There are things like bit net, which provide the hint of being able to do non-floating point multiplication that
which is like the amazing thing about this method called BitNet is that because all of the weights are either one, zero or one, the multiplication disappears and it becomes just addition because all the multiplier coefficient terms are just one. So, and then it's just, it's addition and then it's a lot faster. And so, you know, there's possibilities where if you use that and then you had an ace, you could then make an ace who could do inference a thousand times that could like actually get you like a multiply scale. And then
that same chip could be used to do inference on the device and do the training just as fast. And so there's a possibility to world in that, but kind of from a practical standpoint, being like, we're gonna come up with a method that if just the whole world makes all the chips we tell them they make, it's gonna be perfect. So we sort of had to like, you know, take the piece. But what's nice about it is that in the future, if we're able to build up this network that actually has the participants
You can just shift that over to doing that in the future. And it's really that community activation key that's like the key piece. And also, I would add that inference hardware training, using inference hardware could unlock just more capabilities because it doesn't need to be faster than Backprop. It just needs to be fast enough that it's worthwhile to train using those. So, for example, your smartphone, you're using it all day, right? Right.
So while you're using it, while you do inference, you can actually just save a tiny bit of additional information and send it back. So they could use that to train. So if this inference thing is really, really fast, as more and more people use it, you could start to see some companies or some entities try to take this additional data to train. So that could be-- The training would be like a byproduct
FRANCESC CAMPOY: Yeah, the library of inference, exactly. MARK MANDEL: Essentially, inference eats the world, right? It's sort of the future. FRANCESC CAMPOY: Yeah, because usually people train once, and then they infer-- they deploy the model. Everyone uses it. Like chat GPT, it's not training all the time. They could be training one model all the time, but the model you're using is not training. But with inference training, it could be possible that you train the model at the same time you're using it.
Well, I can't wait for the release. Any final things you want to say to folks who are listening? Just come along for the ride. We're trying to make AI that is representative of the whole world and that is personable to you. So if that excites you, like-- Let's make like "City at Home" but for AI. Let's make "City at Home" but for AI, yeah.
Thanks guys. Thank you. Thank you. Thank you. Congratulations on making it to the end. I wish I had a great Easter egg for you, but let's be real. The best reward is knowing that you're now a little bit smarter about where the AI field might be headed. And if you're feeling gracious, please do rate the podcast on your platform of choice and share it far and wide. Thanks again for listening.