We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS]

2024/12/24

Latent Space: The AI Engineer Podcast — Practitioners talking LLMs, CodeGen, Agents, Multimodality, AI UX, GPU Infra and all things Software 3.0

AI Deep Dive AI Insights AI Chapters Transcript

People

AI Charlie

组织和主持多个高影响力的 AI 活动和会议，促进 AI 领域的发展和社区建设。

Dan Fu

Eugene Cheah

Topics

Dan Fu: 近年来，随着模型参数规模和上下文长度的不断扩展，人们开始关注模型计算效率问题。传统的Transformer架构中的注意力机制计算复杂度为O(n^2)，限制了其在长序列上的应用。为了解决这个问题，研究者们探索了各种替代架构，例如基于状态空间模型的方法。这些方法通过结合信号处理中的原理，并利用快速傅里叶变换等技术，将计算复杂度降低到O(n log n)甚至O(n)。此外，选择机制和新的测试时间范式（例如Just Read Twice）也提高了模型的质量和效率。目前，基于状态空间模型的模型在各种任务上都取得了显著进展，例如Jamba和SANA模型。未来研究方向包括硬件模型协同设计（例如ThunderKittens库）以及探索更适合新型模型架构的测试时间范式。 Eugene Cheah: RWKV项目致力于开发高效的RNN模型，以克服Transformer架构的二次计算复杂度问题。RWKV通过设计一种新的线性注意力机制，实现了O(n)的计算复杂度，并在训练和推理阶段都取得了较好的性能。RWKV团队还探索了将RWKV线性注意力层与预训练的Transformer模型（例如Qwen 32B）结合的方法，取得了显著的成果。 RWKV团队目前正在开发RWKV v7，并探索混合架构，以进一步提高模型的性能和效率。他们认为，未来的研究方向可能集中在优化较小规模的模型（例如低于200B参数的模型），并开发更有效的推理方法。 AI Charlie: 本播客回顾了NeurIPS 2024 Latent Space Live!会议上关于Transformer替代架构的主题演讲，重点介绍了Dan Fu和Eugene Cheah两位演讲者及其在各自公司（Together AI和Recursal AI/Featherless AI）的工作。演讲内容涵盖了近年来非Transformer架构的进展概述，以及RWKV模型的最新进展，包括其在长上下文处理、高效计算和多语言支持方面的优势。

Deep Dive

Key Insights

Why are post-transformer architectures like RWKV and state space models gaining attention?

Post-transformer architectures address the quadratic scaling problem of traditional transformers, offering more efficient compute and memory usage. RWKV, for instance, combines the training efficiency of transformers with the inference efficiency of RNNs, making it accessible for low-compute environments like Raspberry Pis.

What are the key differences between RWKV and state space models?

RWKV is an open-source, community-driven project that builds on RNNs and linear attention, focusing on efficiency and accessibility. State space models, on the other hand, leverage principles from signal processing and dynamical systems to improve sequence modeling quality and efficiency.

How does RWKV handle attention mechanics differently from transformers?

RWKV uses a combination of time mix and channel mix blocks. Time mix handles long-term memory states, while channel mix focuses on short-term attention, looking at adjacent tokens. This allows RWKV to process sequences efficiently without the quadratic scaling of traditional attention mechanisms.

What is the significance of RWKV's QRWKV6 model?

QRWKV6 is a 32B model that converts a Qwen 32B model by replacing its QKV attention layers with RWKV linear attention layers. This conversion allows RWKV to leverage pre-trained transformer weights, achieving performance on par with the original model with just a few hours of training on two nodes.

Why is hardware and kernel support crucial for post-transformer architectures?

Hardware and kernel support ensure that new architectures are not only theoretically efficient but also achieve practical wall-clock speed improvements. Without optimized kernels, even efficient models can be slower in practice, making them uncompetitive in real-world applications.

What is the 'Just Read Twice' approach, and why is it effective?

The 'Just Read Twice' approach involves repeating the input document multiple times before querying the model. This method leverages the efficiency of recurrent models, allowing them to better recall information from long documents by processing the same content multiple times, which improves recall-intensive tasks.

What are the potential applications of long-context models beyond language tasks?

Long-context models can be applied to time series data, such as weather forecasting, where the model needs to process and predict based on an extended sequence of data without the need to recall specific past events. This makes them suitable for continuous monitoring tasks where future predictions are more critical than detailed historical recall.

What is the significance of the Jamba model in post-transformer architectures?

Jamba is a hybrid model that combines Transformer and Mamba layers, using a mixture-of-experts (MoE) approach to increase capacity while maintaining efficiency. It achieves state-of-the-art performance on long-context tasks and can handle up to 256K tokens, making it one of the leading models in non-transformer architectures.

Why might long-context models not be as critical as they seem?

While long-context models are impressive, most enterprise workloads operate within shorter context lengths (e.g., 32k tokens). Efficient models like RWKV and Mamba can handle these tasks with less compute, making them more practical for most use cases. The need for truly long contexts (e.g., millions of tokens) is still limited to niche applications.

What is the ThunderKittens library, and how does it benefit model development?

ThunderKittens is a CUDA library designed to simplify the development of efficient models by providing optimized matrix operations tailored to modern GPUs like the H100. It allows researchers to focus on model design rather than low-level CUDA optimizations, speeding up the development of new architectures.

Shownotes Transcript

Translations:

中文

We're back at Latent Space Live, our first mini-conference held at NeurIPS 2024 in Vancouver. This is Charlie, your AI co-host. As a special treat this week, we're recapping the best of 2024 going domain by domain. We sent out a survey to the over 900 of you who told us what you wanted, and then invited the best speakers in the Latent Space Network to cover each field.

200 of you joined us in person throughout the day with over 2,200 watching live online. Our next keynote covers the state of Transformers alternative architectures with a special joint presentation with Dan Fu of Together AI and Eugene Chia of Recursal AI and Featherless AI.

We've featured both Together and Recursal on the pod before, with CEO Vipul Ved Prakash and CTO C.E. Jang joining us to talk about how they are building Together Together as a quote-unquote full-stack AI startup, from the lowest-level kernel and systems programming to the highest-level mathematical abstractions, driving new model architectures and inference algorithms.

with notable industry contributions from Red Pyjama V2, Flash Attention 3, Mamba 2, Mixture of Agents, Baste, Sequoia, Evo, Dragonfly, Danfoo's Thunder Kittens and many more research projects this year.

As for Recursal and Featherless, we were the first podcast to feature RWKV last year, and this year the team has shipped RWKV V5, codenamed Eagle, to 1.5 billion Windows 10 and Windows 11 machines worldwide to support Microsoft's on-device, energy-usage-sensitive Windows Co-Pilot use cases, and has launched the first updates on RWKV V6, codenamed Finch and Goldfinch.

On the morning of Latent Space Live, they also announced QRDADUKV6, a QEN32B model modified with RDWKV linear attention layers. Eugene has also written the most single most popular guest post on the Latent Space blog this year. Yes, we do take guest posts on what he has discovered about the H100 GPU inference neocloud market since the successful launch of Featherless AI this year.

As always, don't forget to check the show notes for the YouTube link to their talk as well as their slides. Watch out and take care.

Yeah, so thanks so much for having us. So this is going to be a little bit of a two-part presentation. My name is Dan. I'm at Together AI and I'll be joining UCSD as faculty in about a year. And Eugene, you want to introduce yourself? I'm Eugene. I lead the RFKB team and I'm CEO and co-founder of Featherless and we both work on this new post-transformer architecture space.

Yeah, so today we're really excited to talk to you a little bit about that. So first I'm going to give a broad overview of kind of the last few years of progress in non-post-transformer architectures. And then afterwards, Eugene will tell us a little bit about the latest and the greatest and the latest frontier models in this space.

So the story starts with scaling. So this is probably a figure or something like this that you've seen very recently. Over the last five to six years, we've seen models really scale up in parameter size, and that's brought with it a bunch of new capabilities, like the ability to talk to you and tell you sometimes how to use your collab and your AWS screens.

But another place where we've seen scaling, especially recently, is scaling in context length. So this can mean just having more text inputs for your models, but it can also mean things like taking a lot of visual token inputs, image inputs to your models, or generating lots of outputs.

And one thing that's been really exciting over the last few months or so is that we're seeing scaling not only during training time, but also during test time. So this is the iconic image from the OpenAI '01 release. Not only are we starting to scale train time compute, but we're also starting to scale test time compute. Now, if you're familiar with our attention and our transformer architectures today, this graph on the right might look a little bit scary.

And one of the reasons is that the implications are a little bit interesting. So what does it mean if we want to continue having smarter and smarter models? Do we just need to start building bigger, bigger data centers, spending more flops? Is this little Dolly 3, we need more flops, guys? Is this going to be the future of all of AI? Or is there a better way, another path forward? Maybe we can get the same capabilities that we've gotten used to, but for a lot less compute, a lot less flops.

Thank you.

And one of the things that we're going to talk about today is specifically looking at that core attention operator in some of these models. And the reason is that, so this is just some basic scaling curves, but attention has compute that scales quadratically in the context length. So that means that if you're doing something like test time compute and you want to spend a bunch of tokens thinking about what comes next, the longer that goes, the more tokens you spend on that, that compute grows quadratically in that.

One of the questions that we're interested in is, can we take that basic sequence model, the basic sequence primitive at the bottom, and get it to scale better? Can we scale in, let's say, n to the 3 halves or n log n?

And so in the first part of the talk, so we just went over the introduction. What I'm going to do over the next few slides is just talk about some of the key advances and ideas that have shown over the past few years since maybe early 2020 to now that shown promise that this might actually be possible, that you can actually get potentially the same quality that we want while scaling better.

So to do that, we're, and, and basically the, the story that we're going to look is we're going to start to see how, so this is a basic graph of just the past couple of years of progress of perplexity, where that blue line, that dotted blue line is attention. It's your basic transformer, full dense attention. And then the dots coming down are some of the methods that you'll see in this presentation today.

We're going to turn the clock back all the way to 2020. So this question of can we make attention sub-quadratic, basically as soon as we said attention is all you need, people started asking this question. So we have this quadratic attention operator, can we do better? I'll briefly talk about why attention is quadratic and the basic thing that happens if you're not familiar is that you have these inputs, these keys and queries,

And what you do in this attention matrix, this S matrix over here, is that you're comparing every token in your input to every other token. So when I try to do something like upload a whole book to Gemini, what happens beyond the, or maybe not Gemini because we don't necessarily know what architecture is, but let's say we upload it to Lama, what happens behind the scenes is that it's going to take every single word in that book and compare it to every other word.

And this has been a really, it's led to some pretty impressive things, but it's kind of a brute forcing of the way that you would try to interpret something.

What attention does in particular is that instead of always operating in this quadratic thing, it takes a row-wise softmax over this matrix and then multiplies it by this values matrix. So one of the key points to notice is that the output size is always going to be the same as the inputs, at least in standard self-attention.

So one of the first things that folks tried to do around 2020 is this thing called linear attention, which is just noticing that if we take out this softmax from here, if we take out this nonlinearity in the middle of the attention operation, and then if you compute the keys and the values operation first, you actually never hit this quadratic bottleneck. So that

That's potentially a way to get a lot more computationally efficient. And there are various ways to do this by basically using feature maps or try to approximate this overall attention computation. But some of this work sort of started to hit a wall in 2020. And the basic challenges were two. So one was quality. Back then, it was kind of hard to get good quality with these linear attention operators.

The other one was actually hardware efficiency. So, this feature map that was just shown by Simplify here actually ends up being quite computationally expensive if you just implement it naively. So, you started having these operators that not only were you sure, you're not really sure if they have the same quality, but also they're actually just wall clock slower. So, you kind of end up getting the worst of both worlds.

So this was the the stage so that kind of sets the stage for four years ago. Keep this in mind because linear tension is actually going to come back in a few years once we have a better understanding.

But one of the works that started kicking off this mini revolution in post-transformer architectures was this idea called state space model. So here the seminal work is one about Worku in 2022. And this piece of work really brought together a few ideas from some long running research lines of work.

The first one was, and this is really one of the keys to closing the gap in quality, was just using things that if you talk to a

An electrical engineer off the street, they might know off of the back of their hand. But taking some of those properties with how we model dynamical systems in signal processing and then using those ideas to model the inputs, the text tokens in, for example, a transformer like Next Token Prediction Architecture.

So some of those early states-based model papers were looking at this relatively simple recurrent update model that comes from maybe chapter one of a signal processing class, but then using some principle theory about how you should do that recurrent update in order to really get the most that you can out of your hidden state, out of your sequence.

So that was one key idea for quality. And when this was eventually realized, you started to see a bunch of benchmarks that were pretty sticky for a few years, things like long range arena, some long sequence evaluation benchmarks. There was stuff in time series analysis. You started to see the quality tick up in meaningful ways.

But the other key thing that was so influential about these states-based models is that they also had a key idea about how you can compute these things efficiently. So if you go back to your machine learning 101 class where you learned about RNNs, one thing that you may have learned is that they don't parallelize as well as detention because if you just run them naively, you have to do this kind of sequential update to process new tokens. Whereas in attention, you can process all the tokens in parallel at one time.

One of the key insights behind the S4 paper was that these were current models. You could take them and you could also formulate them as a convolution. And in particular, the convolution, you could, instead of using a PyTorch Conv1D operation, you can compute that with the FFT. And that would give you n log n compute in the sequence length n with a operator that was relatively well optimized for modern hardware.

So those are really, I'd say the two key ideas in 2022 that started allowing these breakthroughs to happen in these non-transformer architecture. So these ideas about how to principally model, sorry, how to model the recurrent updates of a sequence in a principled way, and also these key ideas and how you can compute it efficiently by turning it into a convolution and then scaling it up with the FFT.

Along those same lines, so afterwards, we started putting out some work on specialized kernels. So just like we have Flash attention for transformers, we also have works like Flash FFT Conf. And if you look at these lines of work, oftentimes whenever you see a new architecture, you see a new primitive, one of the table stakes now is, do you have an efficient kernel so that you can actually get wall clock speed up?

So by 2022, 2023, we were starting to have these models that had promising quality primitives and also promising wall clocks. So you could actually see regimes where they were better than transformers in meaningful ways.

That being said, there were still sometimes a quality gap, particularly for language modeling. And because language is so core to what we do in sequence modeling these days, the next key idea that I'm going to talk about is this idea of selection mechanisms. And this is basically an idea of, so you have this recurrent state that you're keeping around that just summarizes everything that came before.

And to get a good sequence model, one of the things that you really need to be able to do is have the model learn what's the best way to pick out pieces from that recurrent state.

So one of the major ideas here in a line of work called H3, Hungry Hungry Hippos, and also these hyena models, where one way you can do this is by just adding some simple element-wise gates. So versions of these ideas have been around for decades. If you squint at the LSTM paper, you can probably find this gating mechanism. But turns out you can take those old ideas, add them into these new

states-based models, and then you can see quality start to pick up.

If you've heard of the Mamba model, this also takes the selection to the next level by actually making some changes in that fundamental recurrent state space. So it's not only just this gating that happens around the SSM layer, but also you can actually make the ABCD matrices of your state space model, you can make them data dependent, which will allow you to even better select out different pieces from your hidden state depending on what you're seeing.

I'll also point out if you look at the bottom right of this figure, there's this little triangle with the GPU S-RAM, GPU HBM, and this is just continuing that trend of when you have a new architecture, you also release it with a kernel to show that it is hardware efficient, that it can be hardware efficient on modern hardware.

The, the, one of the next cool things that happened is once we had this understanding of these are the basic pieces, these are the basic principles behind some of the sequence models, linear attention actually started to come back. So in earlier this year, there was a model called based the from Simran Aurora and some other folks that combined a more principled version of linear attention that basically the, the, the, the,

two-second summaries that are used at Taylor approximation of the softmax attention, combine that with a simple sliding window attention and was starting to be able to expand the Pareto frontier of how much data can you recall from your sequence versus how small is your recurrent state size. Those orange dots at the top there are just showing spur sequences that can recall more memory.

And the last major idea I think that has been influential in this line of work and is very relatively late-breaking just a few months ago is just the basic idea that when you have these models that are fundamentally more efficient in the sequence length, you maybe don't want to prompt them or use them in exactly the same way.

So this was a really cool paper called Just Read Twice, also from Simran, that basically said, hey, all these efficient models can process tokens so much more efficiently than transformers that they can sometimes have unfair advantages compared to a simple transformer token. So, or sorry, a simple transformer model. So take, for example, the standard use case of you have some long document, you're going to pass it in as input, and then you're going to ask some question about it.

One problem you might imagine for a recurrent model where you have a fixed state size is, let's say that your article is very long and you're trying to ask about some really niche thing. You can imagine it might be hard for the model to know ahead of time what information to put into the hidden state.

But these models are so much more efficient that you can do something really stupid. Like you can just put the document, write down the document, write down the question, write down the document again, and then write down the question again. And then this time, the second time that you go over that document, you know exactly what to look for. And the cool thing about this is, so this results in better quality, especially on these recall intensive tasks. But the other interesting thing is it really takes advantage of the more efficient architectures that we're having here.

So one of the other, I think, influential ideas in this line of work is if you change the fundamental compute capabilities of your model and the way that it scales, you can actually start to query it at test time differently. And this actually, of course, goes back to those slides on test time compute. So while everybody's looking at, say, test time compute for big transformer models, I think potentially a really interesting research question is how can you take those and how does it change with this new next generation of models?

So I'll just briefly summarize what some of those key ideas were and then show you briefly kind of what the state of the art is today. So the four key ideas are instead of just doing a simple linear tension approximation, instead take ideas that we know from other fields like signal processing, do a more principled approach to your modeling of the sequence.

Another key idea throughout all these lines of work is you really want hardware and kernel support from day one. So even if your model is theoretically more efficient, if somebody goes and runs it and it's two times slower, one of the things that we've learned is that if you're in that situation, it's just going to be dead on arrival. So you want to be designing your architectures. One of the key, key

machine learning ideas that has been important for the quality is just making sure that you encode different ways that you can select from your hidden state and really focus on that as a key decider of quality. And finally, I think one of the emerging new things for this line of work and something that's quite interesting is what are the right test time paradigms for these models? How do they change relative to what you might do for a standard transformer?

I'll briefly end this section. So I've labeled this slide where we are yesterday, because Eugene is going to talk about some new models that he released literally this morning. But as of yesterday, some of the really cool results out of these efficient alternative models were, so AI2 trained this hybrid MOE called Jamba that seems that is currently the state of the art for these non-transformer architectures.

There's this, and MIT put out this new diffusion model called SANA recently, that one of their key observations is that you can take a standard diffusion, transformer diffusion model, replace the layers with linear attention, and then that lets you scale to much larger, much larger images, much, much, much larger sequences more efficiently.

And and one thing that I don't think anybody would have called when a few years ago is that one of those gated SSM gated state space models ended up on the cover of science because a great group of folks went and trained some DNA models. So that's Michael Polly, Eric Yuen from Stanford and the ARC Institute. So it's where we're really at exciting time in 2024, where these non-transformer post-transformer architectures are

or showing promise across a wide range, across a wide range of modalities, of applications, and of tasks. And with that, I'll pass it on to Eugene, who can tell you a little bit about the latest and greatest with RWKV. - Yeah, so, Naciso?

yeah you're talking to here oh i'm talking to here okay so yeah two streams yeah so i think one common questions that we tend to get asked right is what's the difference between rwkv and state space so i think one of the key things to really understand right the difference between the two groups right is that we are actually more like an open source rental internet meets academia kind of situation like most of us never wrote any paper but we we basically look at

RNNs and linear intention when intention is all you need came out and then we decided to like hey there is a quadratic scaling problem why don't we try fixing that instead so we end up developing our own branch but we end up sharing ideas back and forth so and we do all this actively in discord github etc so

This was so bad for a few years, right? That basically the average group's H-index was so close to zero, right? Illuter AI actually came in and helped us write our first paper. Great, now our H-index is now three, apparently. So, but the thing is like, a lot of these experiments led to results. And essentially, we took the same ideas from linear attention and we built on it.

So to take a step back into like, how does RWKB handle its own attention mechanic and achieve the same goals of like OAN compute respectively. And in focus of our overall goal to make AI accessible to everyone regardless of language, nation or compute, that's our goal.

We actually train our models primarily on over 100 languages, which is another topic altogether. And our goal is to train to even 200 languages to cover all languages in the world. But at the same time, we work on this architecture to lower the compute cost so that people can run in Raspberry Pis and on anything.

So how did RWKB break the dependency of LSTM token flow? Because I think to understand architecture, it's probably easier to understand it from the RNN lens, because that's where we built on. We all state space kind of like try to start anew and took lessons from that and say, so there's a little bit of divergence there. And AKA, this is our version of linear attention.

So to take a step back, all foundation models, be it transformers or non-transformers, at a very high level, comes in a token. I mean, takes that into embeddings and goes through a lot of layers, generate a lot of internal states, whether QKB cache or RNN states or RWKB states, and outputs an embedding layer norm and sampling. And we just take more layers and more embeddings. And somehow that magically works.

So if you remember your ancient RNN lessons, which we call blessed learning these days, the general idea is that you have the embedding information from all the way up.

And when you take that information and you flow it back down and then you process it as part of your LSTM layers. So this is how it generally works. Kapati is quoted saying that RNNs are actually unreasonably effective. The problem is this is not scalable. To start doing work on the second token, you need to wait for the first token. And then you need to, and likewise for the third token and fourth token, yada, yada, yada.

That is CPU land, not GPU land. So you can have a H100 and you can't even use 1% of it. So that's kind of why RNNs didn't really take off in the direction that we wanted billions of parameter income training. So what did R.KV version 0 do? We just did the dumbest, lamest thing. Sorry, this is the bottleneck for RNN. We did the dumb thing of removing that line.

and it kind of worked. It trained, it sucked, but it kind of worked. Then they were like, hey, then no one cared because the loss was crap, but

How do we improve that? And that's essentially where we move forward. Because if you see this kind of flow, you can get your GPU saturated quickly, where it essentially cascades respectively. So I'm just waiting for this to loop again. So it's like once you get your first layer, your token to be completed, finished, you start to cascade your compute all the way until you're, hey, I'm using 100% of the GPU.

So we worked on it, and we started going along the principle of that as long as we keep this general architecture where we can cascade and be highly efficient with our architecture, nothing is sacred in our architecture. And we have done some crazy ideas. In fact, if you ask me to explain some things in the paper, right, officially in the paper, I'll say we had this idea and we wrote it this way. The reality is someone came with a code, we tested it, it worked, and then we rationalized it.

So the general idea behind RWA-KVR is that we generally have two major blocks that we do. We call it timemix and channelmix. And timemix generally handles long-term memory states, where essentially we apply the matrix multiplication and C-loop activation functions into processing an input embedding and an output embedding. I'm oversimplifying it because this

this calculation changed every version and we have like version seven right now channel mix is not is similar to base in the sense that where it does shorter term attention where it just look at this the sister token or the token before it because it there's a shift in in the token shift matrix

I don't really want to go too much into the papers itself because we do have three papers on this. Basically, RWKV, Iron-Infiltrated Transformer, ERA, Eagle and Finch RWKV, Electric Value State. This is the updated version 5, version 6. And GoFinch is our hybrid model, respectively. We are writing the paper already for v7, which is for RWKV7, codenamed Goose,

architectures are accompanied by a bird. And I'm going to cover as well, qrwkv and mamanrwkv and rwkv. And where did that lead to? Okay, because we were all GPU poor. And to be clear, like most of this research is done like only on a handful H100s, which I had one Google researcher told me that was like his experiment budget for a single researcher. So our entire organization has less compute than a single researcher in Google. We

One of the things that we explored into was to how do we convert transformer models instead? Because someone already paid that million dollars onto training, so why don't we take advantage of those weights? And I believe Together AI worked on the lowercase for the number side of things. And we took some ideas from there as well, and we essentially did that for RWKV.

And that led to QRWKV6, which we just dropped today, a 32B model, where we took the 32B model, freeze the feedforward layer, remove the QKV attention layer, and replace it with RWKV linear layers. So to be clear, this means we do not have the RWKV channel mix layer. We only have the time mix layer.

But once we do that, we train the RWKV layer. Important is that the feedforward layer needs to be frozen so the new attention can be learned. And then we unfreeze the feedforward layer and train all the layers together with a custom learning rate schedule so that they can learn how to work together. The end result, surprisingly, and to be honest, to the frustration of the RWKV MOE team, which ended up releasing the model on the same day, was that

We just a few hours of training on two nodes. We managed to get it to be on par with the original QAN32B model. So in fact, when the first run that completely confused us is like-- and I was telling Daniel Goldstein, who leads most of our research coordination,

When you pitched me this idea, you told me at best it would get the same level of performance. You didn't tell me the challenge and score would shoot up. I don't know what's happening there, but it did. MMLU score dropping, that was expected because if you think about it, when we were training all the layers, right, we were essentially like Frankensteining this thing and we did

brain damage to the feed forward network layer too with the new RWKV layers. But 76%, hey, some of it is retained and we can probably further train this. We didn't even spend more than three days training this. So there's a lot more that can be done, hence the preview.

This gives up a big question because we are really now in the process of converting to 70B. This is actually extremely compute efficient to test our attention mechanic. It becomes a shortcut. We are already planning to do our version 7 and our hybrid architecture for it because we're doing the training from scratch and we get a really good model out of it. And the other thing that is uncomfortable to say is that because we are doing right now the 70B,

is that if this scales correctly to 128k context length, I'm not even talking about a million, 128, majority of enterprise workload today is just on 70b at under 32k context length. That means if this works and the benchmark matches it, it means we can replace the vast majority of current AI workload unless you want super long context. And then, sorry, can someone give us more GPUs? Because we do need the VRAM for super long context, sadly.

So yeah, that's what we are working on. And essentially we are excited about this to just push it further. And this conversion process, to be clear, I don't think it's going to be exclusive to RWA-KB. It probably will work for Mamba as well. I don't see why not. And we'll probably see more ideas or more experiments or more hybrids or like, yeah, like one of the worst thing that I wanted to say outright, and I confirmed this with the Black Mamba team and the Jamba team, which because we did the Goldfinch hybrid model,

is that none of us understand why a hybrid with a state-based model, R.Q and state-based, and transformer performs better than the baseline of both. It's like when you train one, and then you replace, you expect the same results. That's our pitch. That's our claim. But somehow when we jam both together,

it outperforms Wolf. And that's one area of adulation that we only have four experiments, of course, four teams, that a lot more needs to be done. But these are things that excite me, essentially, because that is what potentially we can move ahead for. Which brings us to what comes next.

So this part is kind of just some, where we'll talk a little bit about stuff that we're excited about. Maybe have some wild speculation on what's coming next. And of course, this is also the part that there'll be more open to questions. So a couple of things that I'm excited about is continued hardware model co-design for these models.

So one of the things that we've put out recently is this library called Thundercittens. It's a CUDA library. And one of the things that we found frustrating is every time that we built one of these new architectures, and I'm sure you had the exact same experience, we'd have to go and spend two months in CUDA land, like writing these new efficient things. And if we decided to change one thing in PyTorch, like one line of PyTorch code is like a week of CUDA code at least.

So one of our goals with a library like Thundercat, and so we just broke down what are the key principles, what are the key hardware things, what are the key compute pieces that you get from the hardware. So for example, on H100, everything is really...

revolves around a warp group matrix multiply operation. So you really want your operation to be able to split into relatively small matrix matrix multiply operations. So like multiplying two 64 by 64 matrices, for example. And so if you know that ahead of time, when you're designing your model, that probably gives you, you know, some information about how you set the state sizes, how you set the update, how do you set the update function.

So with Thundercantons, we basically built a whole library just around this basic idea that all your basic compute primitives should not be a float, but it should be a matrix and everything should just be matrix compute. And we've been using that to try to both reimplement some existing architectures and also start to design some new ones that are really designed with this core, with a tensor core primitive in mind.

Another thing that we're, at least I'm excited about is we, over the last four or five years, we've really been looking at language models as the next thing. But if you've been paying attention to Twitter, there's been a bunch of new next generation models that are coming out. So there, there are video generation models that can run real time. So that are supported by your mouse and your keyboard.

that I'm told if you play with them, they only have a few seconds of memory. Can we take that model? Can we give it a very long context length so that you could actually maybe generate an entire game state at a time? What does that look like for the model? You're certainly not going to do a giant quadratic attention computation to try to run that. Maybe use some of these new models or some of these new video generation models that came out. So Sora came out. I don't

about two days ago now, but with super long queue times and super long generation times. So that's probably a quadratic attention operation at the bottom of it. What if we could remove that and get the same quality, but a lot faster generation time? Or some of the demos that we saw from Paige earlier today, if I have a super long conversation with my Gemini bot,

what if I wanted to remember everything that it's seen in the last week? I mean, maybe you don't for personal reasons, but what if I did? What does that mean for the architecture? And I think that's certainly something I'm pretty excited about. I'm sure you're excited about it too. I think we were supposed to have some hot takes, but I honestly don't remember what our hot takes were. Yeah, including exciting. Hot takes, yes, these are our hot takes. I think the big...

the big one on Twitter that we saw, that we shared was the question is like, is RAG relevant in the case of like the future of like state-based models? Let's see. I haven't played too much with RAG, but when I have, I'll say I found it was a little bit challenging to do research on it because we had this experience over and over again where you could have

an embedding model of any quality. So you could have a really, really bad embedding model or you could have a really, really good one by any measure of good. And for the final RAG application, it kind of didn't matter. That's what I'll say about RAG while I'm being recorded.

I know it doesn't actually answer the question, but... Yeah. So I think a lot of folks are like extremely excited of the idea of be it RWKB or state space potentially having infinite context. But I think the reality is that when we say infinite context, we just mean a different kind of infinite context or as it's previously covered, you need to test the model differently. So think of it more along the lines of the human. Like, I don't remember what I eat for breakfast yesterday.

Yeah, that's the statement I'll say. And we humans are not quadratic transformers. If we did, if let's say we increase our brain size for every second we leave, we will have exploded by the time we are five years old or something like that. And I think basically fundamentally for us, regardless of whether RWKB, State Space, XLSTM, etc.,

our general idea is that instead of that expanding state that increase in computational cost what if you have a fixed state size and information theory detects that that fixed state size will have a limit just how big of a limit is the question like we like rwkv is running at 40 megabytes for for a state its future version might run into 400 megabytes that is like

millions of tokens in, if you're talking about mathematically, the maximum possibility. It's just that I guess we were all more inefficient about it. So maybe we hit a hundred thousand. And that's kind of like the work we are doing trying to like push it and maximize it.

And that's where the models will start deferring because it will choose to forget things. It will choose to remember things. And that's why I think that there might be some element of right, but it may not be the same right. Maybe the model learned things and it's like, hmm, I can't remember that article. Let me do a database search to search. Just like us humans, when we can't remember the article in the company, we do a search on Notion. Yeah. I think something that would be really interesting is if you could have...

facts that are so right now, the one intuition about language models is that all those parameters are around just to store random facts about the world. And this intuition comes from the observation that if you take a really small language model, it can do things like talk to you or kind of has like the style of conversation, it can learn that, but where it will usually fall over compared to a much larger one is it'll just be a lot less factual about things that it knows or that it can do.

But that points to all those weights that we're spending, all that SGD that we're spending to train these models are just being used to store facts.

And we have things like databases that are pretty good at storing facts. So I think one thing that would be really interesting is if we could actually have some sort of outside data store that a language model can look at that maybe has some sort of gradient descent in it, but would be quite interesting. And then maybe you could edit it, delete facts, change who's president so that it doesn't get lost. Can we open up Q&A and hot takes for the audience? Sure.

I have hot take Q&A. Do these scale? When 405B state space model rag exists, no one does long context, who's throwing in 2 million token questions, hot takes?

The who's throwing in 2 million token question, I think, is a really good question. So actually, I was going to offer that as a hot take. My hot take was going to be that long context doesn't matter. I know I just gave a whole talk about it. But what's the point of doing research if you can't play both sides? But I think for both of us, the reason that we first got into this was just from the first principled questions of there's this quadratic thing.

Clearly intelligence doesn't need to be quadratic. What is going on? Can we understand it better? You know, since then it's kind of turned into a race, which has been exciting to watch, like how much context you can take in. But I think it's right. Nobody is actually putting in a 2 million context prompt into these models. And, and, you know, if they are, maybe we can go, you know, design a better model to do that particular thing. Yeah. What do you think about that? So you've also been working on this. Do you think long context matters?

So I'm going to burn a bit. How many of you remember the news of Google Gemini supporting 3 million contacts? Raise your hand. 2 million. Oh, it's 2 million. Yeah. How many of you actually tried that? I use it a lot. You, you're one of mine's TV. I use it a lot. All right. So for some people that is used, and I think, I think that's the, that's might be like,

This is where my opinion starts to differ because I think the big labs may have a bigger role in this because

Like even for other way, even when we train on context, the reason why I say VRAM is a problem is that because when we did the, we need to back prop against the states, we actually need to maintain the state in between the tokens by the token length. So that means we need to actually roll out the whole 1 million context if we are actually training 1 million, which is the same for transformers actually, but it just means we don't magically reuse the VRAM consumption in the training time space. So that is the one that VRAM bottlenecks and, uh,

And I'm neither OpenAI nor Google, so donate GPUs if you have too much of them. But then putting it back to another paradigm, right, is that I think O1 style reasoning might be actually pushing that direction downwards. In my opinion, this is my partial hot take, is that if let's say you have a super big 400B model, and let's say you have a 70B model that may take double the tokens, but gets the same result

Strictly speaking, a 70B, and this is even for transformer or non-transformer, right? We'll take less resources than that 400B model, even if it did double the amount of thinking. And if that's the case, and we're still all trying to figure this out, maybe the direction for us is really getting the sub 200B to be as fast as efficient as possible with a very efficient architecture that some folks happen to be working on to just reason it out over larger and larger context lines.

One thing I'm super interested in is models that can watch forever. Obviously, you cannot train something on infinite context length. How are y'all thinking about that, where you run on a much longer context length than is possible to train on? Yeah, it's a great question. So

I think when-- I think you guys probably had tweets along these lines too. When we first started doing these things, because these are all recurrent models, in theory, you could just run it forever. You could just run it forever. And at the very least, it won't error out on your crash. There's another question of whether it can actually use what it's seen in that infinite context.

context. And I think there, so one place where probably the research and architectures ran faster than other research is actually the benchmarks for long context.

Do you turn it on forever? You want to do everything or watch everything? What is it that you actually wanted to do? Can we actually build some benchmarks for that? Then measure what's happening and then ask the question, can the models do it? Is there something else that they need? Yeah, I think that if I were to turn back the clock to 2022, that's probably one of the things I would have done differently, which would have been actually get some long context benchmarks out at the same time as we started pushing context length in all these models.

I will also say the use case. So like, I think we both agree that there's no infinite memory and the model needs to be able to learn and decide. I think what we have observed for, I think this also with the state-space model is that one of the key advantage of this alternate attention mechanic that is not based on token position is that the model don't suddenly become crazy when you go past the 8K training context tank or a million context tank. It's actually still stable. It's still able to run. It's still able to rationalize. It just starts forgetting things. But

Some of these things are still there in latent memory. Some of these things are still somewhat there. That's the whole point of why reading twice works, things like that.

And one of the biggest push in this direction is that I think both State Space and RWKB have separate papers by other researchers where they use this architecture for time series data, weather modeling. So you're not asking what was the weather five days ago. You're asking what's the weather tomorrow based on the infinite length that we, as on this earth and the computer will keep running. So, and they found that it is possible

like better than existing, like be a transformer or existing architecture in modeling this weather data control for the param size and stuff. I'm quite sure there are people with larger models. So, so there, so there are things that, that in this case, right, there is future applications. If your question is just what's next and not what's 10 years ago. Thanks so much for having us.

2024 in Post-Transformers Architectures (State Space Models, RWKV) [LS Live @ NeurIPS] 43:02 Share