We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Transformers Need Glasses! - Federico Barbero

2025/3/8

Machine Learning Street Talk (MLST)

AI Deep Dive AI Chapters Transcript

People

Federico Barbero

Topics

我专注于Transformer模型的局限性，特别是它们在处理长序列时出现的表示坍塌和信息流问题。我的研究表明，Transformer的因果注意力机制和softmax函数会限制其对单个token的精确识别和对长序列中信息的高保真度维持。这会导致模型在诸如计数和复制长字符串等看似简单的任务中失败。我通过数学分析和实验验证了这些局限性，并提出了几种改进方法，例如修改输入、调整架构以及利用其他模型的优势。例如，通过在序列中插入额外的token，可以改善Transformer的性能，因为这可以防止模型丢失序列末尾的信息。此外，我观察到，模型在处理长序列时，会越来越关注序列的起始token，而忽略后面的token。这与因果注意力机制和信息流的机制偏差有关。为了解决这个问题，我建议使用窗口注意力机制，这是一种有效的方法。此外，我还研究了数值精度对模型性能的影响，发现低精度会导致表示坍塌。因此，提高数值精度也是一种改进方法。总而言之，我的研究揭示了Transformer模型的固有局限性，并提出了几种改进方法，以提高其在处理长序列任务中的性能。关于推理的定义，我认为它是一个非常模糊的概念。计算机程序如果编写正确，可以任意泛化，我们也可以证明其完美性等等。但这算推理吗？我不确定。人类在处理大型任务时也可能会出错，例如计算π到很多位数。因此，人类和机器对推理的理解可能不同，机器擅长人类不擅长的事情，例如计数，因此机器的推理方式可能与人类不同。理想的推理系统应该能够避免简单的错误，并且不会犯概念性错误，同时能够利用机器和人类的优势。

Deep Dive

Shownotes Transcript

Translations:

中文

I think reasoning is just very ill-defined. A computer program will generalize arbitrarily well if you write it correctly and we can prove that it's perfect and so on. Is it reasoning? I'm not sure. These models are probably not implementing algorithms, right? They're implementing heuristics that, you know, at least when they're training are good enough to fit, right? But then as soon as they go out of distribution,

they generalize horribly. I like our results because we give a quantity that you can directly measure and like say this is happening in a transformer. Well, these other results are more like constructive kind of proof that you can represent this class of languages or so on. So they're kind of like very related, but very different approaches to similar issues in some way.

Federico welcome to MLST it's so nice to have you here thanks for for having me so why do transformers need glasses probably should not be allowed to pick titles anymore because i think i come up with these wacky titles but like like the key idea like well why this phrase is like kind of nice to me it's like

They seem to be very bad, and this is kind of the point of the paper. They seem to be very bad at detecting... Like, if you care about a single token or whatever, at some point this breaks, right? And it's something about, like, you know, they can't be, at least in the limit, as your context size grows, they can't be good at caring about a single token. This is kind of the...

especially if it comes to the end. So, Tufalabs is a new AI research lab. I'm studying in Zurich. It is funded from Paz Ventures, involving AI as well. And so we are Swiss version of DeepSeq. So a small group of people, very, very motivated, very hardworking. And we try to do some AI research starting with LLM and O1 style models. What we're looking for now is chief scientist and also research engineers. You can check out positions at tufalabs.ai.

So talk us through that. There's a figure, I think it's on page two, where you talk about the representations getting kind of squashed towards the end. Can you explain that figure? So imagine like a family of growing sequences. You repeat its last token. In the paper, for example, we have like, you know, can you count the number of ones in this sequence? And now you add an extra one, right? So now the two sequences have...

two different answers right like essentially the one in which you repeat an extra one now has as an answer like n plus one right or k plus one there's actually an issue with this construction and it's like the fact that at some point the influence of this final one

gets lost, right? And this is the fundamental idea, right? As these sequences grow, you really care about that final one, that one that kind of gives you the right answer. But at some point, if you start measuring inside the model at the representations, and you can also, I guess we showed this mathematically as well, at some point, these representations get closer and closer, right? And then the issue is computers have precision, right? And if

If now this representation is too close, it's under the precision of the machine,

Then you're forced, you're mapping these two sequences to the same thing essentially. So one of them has to have a mistake in it. So let's sketch this out. I'm not sure if I've remembered the experiments exactly, but it's something along the lines of you generate a sequence. So it could be 1, 0, 1, 0, 1, 0. Maybe you probabilistically generate the ones and zeros. So 70% of the time it's like a 1, 30% of the time it's a 0. And then

at the end you ask a question right so you say how many ones are there or what was the last thing and and you're kind of saying that as the sequence grows longer the network becomes it has a bit of a blind spot towards the most recent thing to me what was surprising is that like the way you construct this sequence right of like let's say let's stick to copying because maybe that's

kind of a more simple example, right? It's like, can you copy the last element of the sequence? This is kind of like a completely trivial operation for a human. A human will never make a mistake in this. Irrespective of... I think this is the nice thing, right? If you're counting a massive sequence,

a human will very likely make a mistake. But if you're just asking what's the last element, you'll never make a mistake in this, essentially, right? Because it's like a completely trivially generalizable, you don't need to do any computation, essentially, right? It's just like, just look at the last thing. But then somehow the machine or the transformer, you know, we tried this with Gemini, you know, like,

at the time of like the most powerful thing. At some point it just, if you give it for example 1 1 1 like a very long sequence of ones with a zero at the end at some point starts outputting one. Like I think this very nice explains like at some point the fact that there's a zero in the sequence gets lost in the representation so that you know the represent and we measured this right in a smaller model of course because like we need to have the weights and everything to do it and

But this happens in... You can really view the representations kind of converge to each other. And I think this is a nice thing because it's kind of like saying, "Look, we can really find a quantity which

explains why we're failing in this task, right? Yes. And we'll get to your intuition and theory in a minute because I think it's really cool. It's like an intuition pump for people. But the reason why people at home will find it interesting is I think most of us feel that there's a kind of recency bias. And if anything, it's more likely to know about stuff that you've just said rather than stuff a long time ago. And certainly when we look at

some of these large models with large contexts, we see a kind of U-shaped curve, don't we? Right, yes, yes. So how does that kind of fit in with your new model? Here we have to talk a bit about one of the main intuitions, right, of this paper, which is like kind of related to this representational collapse idea, but it's like, I think in our paper, we're kind of developing this as its own idea, right? And it's the fact that like somehow the way information flows in these transformers,

actually has an inherent mechanistic bias, right? And it's actually towards the start of the sequence. But when you train these models, right, like you're training them to predict the next token, very likely this next token depends on the very nearest tokens, right? So in some sense, the training dynamics push it to care about the most recent things, but the...

mechanics of the information instead pushes it to care about or at least retain information towards the start of the sequence. We believe this kind of nice... There was this kind of nice phenomenon which was observed, which is like if you ask language modes to do information retrieval, they tend to find it harder towards the middle, right? And to us, this kind of explains this phenomenon in the sense that mechanically, transformers are good at the start, but they learn to care about

The end? And the middle is kind of lost somehow. MLST is sponsored by Sentinel, which is the compute platform specifically optimized for AI workloads. They support all of the latest open source language models out of the box, like Lama, for example. You can pay on consumption, essentially, or you can have a model which is always working or it can be freeze-dried when you're not using it.

All of the models that they deploy support the OpenAI API specification out of the box, which means it's just a one-line change in your application to switch over to Sentinel and start saving money and make your application go faster. We should linger on this for a bit. Maybe we should show a figure on the screen now, which is to do with this kind of topological information transfer through the network.

The really interesting thing is that the earlier on you are, the more topological pathways there are to the prediction token. So if you think about it, this self-attention transformer, now, first of all, it has...

causal masking which you should um introduce but that's just a thing so that when we're training it it can't cheat by looking into the future but subject to that causal masking um the further back you go the more kind of pathways through this self-attention matrix it can take before it does yeah the prediction and what you're kind of claiming is that the more pathways there are

the less likely the information is to become squashed. These kind of ideas, I think, are very popular in graph neural network literature, actually. So it's not really out of the blue, this stuff. But as you mentioned correctly, it's really the fact that

you have this causal mechanism in place, right? Which essentially means that attention now looks like a lower triangular matrix, right? Like attention is only allowing you to look backwards. And this is really like a training trick, right? Like, of course you can have transformers like BERT and so on that don't do this, right? But then if we want to efficiently train outer aggressively, you can't have a kind of your transformer look forward, right? Because then this makes it super inefficient to do things. And

And so what we point out is that actually this choice of causal attention has an implication on how information propagates. And it's exactly the number of paths you care. Like intuitively, if you're the last token, the only way for your information to kind of survive through the attention mechanism is if your attention kind of on yourself

Because this is the only kind of attention that preserves you is quite high. But now this means that this forces, because there's this kind of sum-to-one constraint, that everything else is low. So there's this kind of trade-off, right? Like if you want yourself to survive and you're the last token, you can't essentially do attention, right? So I found this to be like actually explaining a lot of interesting phenomena that...

This is why I quite like this kind of intuition. And there's a really interesting literature here because, of course, your supervisor, by the way, was Michael Bronstein. Right, or still is, I guess. I love Michael. He's amazing. But, of course, Michael has done lots of work on this idea of squashing in GNNs. And even before that, people were talking about vanishing gradients and stuff in RNNs.

And I think I read in your paper, you were saying that in the GNN literature and squashing, there's this notion of, I can't remember the word you use, but it's something to do with like the taxi distance or something. So like the more steps you have to travel, like the more squashed you get. This paper, right? It's like...

taking ideas from this kind of work in graphical networks and building a bridge, right? And a very common way you think about... And actually, this is the work I used to do. I used to study information propagation in graphical networks. So this work is not really a surprise, right? Like, that I kind of got into this. But it's like, essentially, the way people study graphical network propagation is by relating it a lot to spectral graph theory. And...

You relate this a lot to how kind of these... Even like Markov chains, I guess. And now you get some natural quantities. Like the taxi kind of distance you refer to is usually referred to as commute time. Yes. And it's a quantity that measures... Like imagine you're on a graph and you...

you define a random walk which says I'm a node with degree 5, so I have 5 neighbors. I can jump to any of them with probability 1/5, right? And so on. So this defines kind of this process which, you know, you start at a point or the node and you sample and you can kind of create your little journey, right? And this commute time

is essentially asking the question: "Let me pick two nodes on the graph and let me start at one of them. What's the expected number of steps I have to do to reach the other node and come back?" So it's commuting because you want your random walk to hit that node and come back. And somehow this is super related to heat equations on graphs and so on, maybe not surprisingly.

because there's this kind of relationship between how heat spreads and how random particles move, right? So there's this kind of nice, very well-studied behavior. And it turns out that graph neural networks, depending on how you choose them, like for example, graph convolutional networks are very similar to heat equations. They're actually...

you can view them as a discretization of a heat equation over a graph, and this is actually very well defined, then it's not surprising that how sensitive they will be to two nodes talking to each other will be

somehow related to this commute time, right? And this is how people have been studying graph neural networks recently and it's proven like to... We actually had a paper on this where like you can kind of show that like on some tasks just changing the way the graph is connected heavily affects how easily the graph neural network can kind of make two things

communicate or how it can kind of transfer information between the two. And just explain more about the heat equations. Are you talking about things like second law of thermodynamics and Navier-Stokes and stuff, but in the context of a graph where you have fewer degrees of freedom because you can speak to your neighbors, obviously, but it's a little bit different. Right, right. So you can literally define a heat equation over a graph and you can define a Laplacian operator and evolve it, right? And this is like

studying the spectrum of this operator is something that's... There's just super strong analogies between continuous heat equations, you know? You can really view graphs as discrete types of surfaces, and then

you can draw many, many parallels and for some reason, magically, maybe not magically, it's super consistent. Like you can develop inequalities on graphs and heat equations on graphs that are consistent to continuous spaces, right? And then you can use these inequalities or whatever to tell, to bound how a graph neural network can kind of

spread information or like how it's kind of information spreading operates. You cited Muse, the lyrics of one of his songs talking about how energy dissipates over time. Right, right. Yeah, which was which of course was a reference to the second law of thermodynamics. I think this this citation is more due to Petar, so I can't take credit. But yes. So Petar, of course, is famous because he invented graph attentional

networks and he's got loads and loads of citations on that that that's amazing but peta for years has been talking about the um i guess the the potential limitations with with transformers and he's been looking into the domain of graph networks to to improve that do you think that there's an opportunity here to build a better architecture yeah i guess that that's the hope right like uh somehow i mean maybe this is a

This is coming from the opinion of someone that started his PhD on graphical networks and then decided to kind of jump ship and go to like language models, whatever. And I think there's actually a lot of value in studying graphical networks because we can really have, I think we have a pretty deep understanding of how they work. At least some models are very kind of

you can really study them quite nicely because of these kind of relationships with physics and so on. And the hope is that you can take these approaches and then build a bridge to transformers. And transformers, if language models were fully connected, this would not be as interesting. But the fact that they have this very specific topology

kind of gives you opportunity to exploit these ideas. If you're fully connected, essentially you have arbitrary behavior. And so what can you say, right? But if you're not fully connected now, you can maybe say something more interesting. And this is why I like causal attention mechanisms, because they're actually super interesting to study.

And I think in our paper we also have some efforts, maybe quite small, but I think they're an interesting direction in which, I mean, they're mostly in the appendix, but like if you can develop a spectral theory for these kind of triangular attention matrices, you can come up with cool results or like kind of cute results. Whereas, for example, we have this result, you know, in the limit, you only care about the initial token, right? And so on. And these are kind of like

really ideas from spectral graph theory that kind of naturally and these are not maybe I wouldn't say they're like super deep results but like I think they provide a lot of intuition for like what is happening right there are so many things you said there I don't know which one to take first but maybe let's just touch on the last thing you said which is one of your um

you know, kind of ending theorems in the paper is what you just said. I think it wasn't a strong theorem actually. You said that you believe that in the limit as the token length, you know, increases, increasingly it's only going to pay attention to the first token. Right, right. We had to make some assumptions there, right? And it's like, because everything is super non-linear. But like the main intuition is that like

what happens if you add more and more layers, right? Like this is kind of what you're trying to study. And if you assume that like you can kind of ignore the contribution. So of course it's a pretty big limitation, but like in general, I think it's pretty hard to study super nonlinear systems from this perspective, right? Of linear algebra. But like, I guess the point is like,

if you keep applying these attention matrices, and maybe this kind of leads nicely, but attention matrices can never really be sharp, because for them to like, by sharp I mean, for now let's just say I wanted to, like, you care essentially how much this process mixes things, right? And because you can only go backwards, this mixing goes only in one direction.

And this direction is only towards the beginning of the sequence. And so the more layers you apply, the more you just care about the beginning of the sequence. And this is kind of what happens, right? And actually, there are very nice results here. And I was really happy to see this. There are some... There's a very nice paper by Carlini, which these are very good folks that do, like, security, right? And...

And there's this very nice paper where they actually found that as you repeat tokens and so on, it seems like your final output just becomes the beginning of sequence token. This was nice to see because these results, to me, point Y, you know, like why is this happening? Like it seems kind of arbitrary, like why do you... But actually mechanically, this makes a lot of sense from a spectral perspective. So this is why I'm kind of very excited about this direction, like perhaps...

even if we make these kind of broad simplifications, there's something there that you can really... It really helps your understanding. And so I hope that this kind of work could even kind of spill over to

Somehow better understanding leads to better security or better attacks or better defenses. I kind of view this as understanding comes first and then application comes later. I guess the obvious question is we do have models now that have a 2 million token context.

I mean, they don't work perfectly, but they seem to work quite well. So why don't we see a massive mode collapse with those? I guess someone has to open up these models and figure this out, essentially. Because I think at least spectrally, this should not work, right? And the issue is that then you can view a tension as a type of contraction.

But then you have all these kind of different components like MLPs coming later and so on that then act potentially as expansions, right? So there's something there about like the balance that needs to be struck. Like you want to counter the contractive effect of attention and you have residual connections, for example, that help with this or you have MLPs that help with this, right? So is this kind of like some kind of self-stabilizing process that you hope for?

I imagine that, well first of all I think a lot of models which go much longer contexts alternate between windowed attention so they kind of have a lot of times mechanisms, right? And maybe these are discovered by accident and this helps, you know, you do architecture research and you discover that this helps but whatever, like why? I don't know And then I also imagine that these models will learn a lot of things

to kind of preserve the mixing. So for example, if you look into Lama or Gemma, a lot of the heads actually don't do anything. For example, they implement diagonal heads, right? These are quadratic no-ops, essentially. These are expensive no-ops, right? And why is the model learning this? I think it's because it's trying to preserve itself. It's trying to not mix everything all at once. It's trying to still...

I don't know, maybe it wants to, like, it's trying to, yeah, just essentially not overmix in some way and it can find ways, like, for example, a lot of heads would just attend to the beginning of sequence token and this beginning of sequence will have norm very close to zero. Again, this is a no-op. So there's also, I think, a huge opportunity here to, like, figure out in some way

much more like this is a huge waste right and like model compression is a huge topic and so maybe you know looking at these issues can kind of help with that right like figuring out why why is it learning no ops you know like this is a huge waste yeah a few thoughts on that i mean first of all

This kind of theory can help us a lot understanding training dynamics and coming up with principled ways to design architectures. But there's an element of sometimes, I mean, even now, if you take Gemini and you trained it a hundred times, presumably sometimes it would work better than other times because of like, you know, whatever the conditions are. And,

maybe sometimes you need to sort of give it a bit more capacity and it might not use that capacity. So when it trains, it will almost kind of like say, no, I'm not going to bother using the stuff over there. And even then, I don't think it makes sense to think about training dynamics in a global way because it's quite input sensitive. So there are certain inputs that might light up more of the network and perhaps in some cases and not in other cases. So it's really complicated.

Yes, I guess that's a good way to summarize. I think if you ask Razvan Pascano, who's also in this paper, he will tell you that he thinks you need a lot of heads because at the start, it's kind of a way to give it a lot of options. So maybe then it learns that some heads explore in this direction, some heads explore in this other one, and then at some point, heads that it doesn't need shut off.

And they just preserve the ones which, for some reason, found the best kind of setup. So I guess in some sense, you have this super over, you have maybe 500 heads or even many more than that. And yeah, as you're saying, like you, it's just like a kind of diversity strategy at the start to then try to,

Just pick the ones you care about. You were saying that we design these neural network architectures and it's kind of a little bit hacky. So there are multiple opportunities, multiple heads, multiple pathways, residual networks and so on. And I loved how you described it, that there's kind of like expansion and contraction in successive stages. What's the intuition there? Yeah, I guess this is really like...

this spectral intuition that... like if you look at the eigenvalues of these triangular matrices, right? Like there will be only one, a single one, if you make some slight assumptions, and everything else would be less than one, right? And kind of the effect of this is like if you take powers of this then

Essentially you're just left with the one eigenvalue that's a value one. And so this is kind of this eigenspace that survives. And so this kind of acts like a contraction. So it's still not necessarily... I guess it depends how... I'm being kind of loose with the idea of a contraction. But to me this is contracting in the sense that it's destroying information unless you're in this kind of one specific starting token. And then...

Instead, the MLPs can essentially have whatever Lipschitz constant they want. So MLPs, they can expand you as much as you want. So in this sense, imagine you... And I'm really viewing this as kind of read and write operations. And so imagine your attention now cares about a hundred different things. And so...

each coefficient is roughly 1/100 and now you're copying this information maybe to a different hidden dimension within the value vector. So you're just kind of like a copying operation. Now everything in there will be scaled by 1/100, right? So then this is kind of, you know, norm to me is very important into building very sharp operations. So

If now everything is 1/100, this actually won't have that big of an effect. But now your MLP can take this and multiply everything by 100, and now everything is kind of strong again. And of course now you're still fighting with layer norms and so on, but in some way this gives you a way to... The attention can copy information, but the more it copies, the weaker the effect of this kind of copying operation is. So the MLP can kind of counteract this.

And this is kind of the way I view these two operations. It's like, if you want to copy many things, you probably want some component that's able to kind of crank up the knob and try to make this copying more strong in some way. When you cited Carlini, and certainly in most of the experiments you did, it was quite a low entropy sequence, which is to say it was ones or zeros or something like that.

How much of an effect does the amount of different types of tokens or the amount of entropy in the sequence, how much does that matter? This matters. And I feel like it's also due to how tokenization works. I think you have to be very careful when you're doing these experiments. For example, if you want to usually put a space between tokens so that now the tokenizer, most of them will kind of

see a space and then give you a new token, right? So you really have to be careful about this. And then also the kind of entropy of the sequence matters. And I think this just makes it very complicated, right? So I believe we do have kind of experience that check that this still holds for like...

essentially random or arbitrary sequences, but you just want to be kind of... They definitely do have an effect. So for some reason, I found that digits such as 9, or larger digits will have usually larger magnitudes for some reason. It kind of makes sense, it's what it's trying to do. But all of these different things can really have an effect on how quickly you converge to...

how quickly these two sequences converge to each other. And can we explain what we mean by that? So if I understand correctly, you've got two sequences and then you do the L1 like difference of the softmax and you're saying that that kind of converges towards like, you know, epsilon, some arbitrary small number. I guess what we really study is like

you care about the representation of the final token at the last layer because then this is what you slap on your linear projection and you extract your next token, right? So the key idea here is like if you take two sequences and this kind of last... the representation of the last token, the last layer, to these two sequences arbitrarily close

then you run into issues because now they get mapped to the same thing. And...

Essentially what you care about is the norm between these two things. You want to show that based on this family of sequence, if you keep growing and growing, at some point this norm goes to arbitrary epsilon, essentially. That makes sense. So we want to have self-attention transformers that have the representational fidelity to distinguish differences in these situations. Right.

Here's another interesting thing. So you bring in this idea of numerical precision. So you say we've got a problem, right? So we have these sequences and they get longer and perhaps the sequences are quite low entropy to start with. But eventually, depending on the numerical precision of the neural network, so it might be FP16, it might be quantized,

at some point we kind of like dip below the threshold where the network can distinguish between the two things. Yeah, yeah, yeah, that's exactly the end. The fact that we have really heavy quantization in a lot of these models at the moment, I mean, they go to much less than 16 bits. This really, to me, points like towards, I mean, I think it's understandable that like to quantize models, you lose performance. I think anyone would probably expect this, but like,

I think this points to very kind of mechanical issues of like, you know, if you really quantize a lot, at some point you even have sequences that maybe were distinguishable at higher precision, but now they're indistinguishable and you make a mistake by force, right? So there's this kind of very mechanical just complaint in some way about quantization in

Yeah, because doesn't this become catastrophic quite quickly because your research shows, and by the way, Liron, he's the Doom Debate guy. He keeps saying, give me examples of things that LLMs can't do. Copying and counting. Yeah.

They can't do more than 100 elements. Frontier models can't do more than that. And this is like before you quantize them. So now we quantize them down to 4-bit or whatever, and they just collapse straight away. This is a huge problem. And Liron would say, this isn't a problem. We just need to use tools, man. Yeah, sure. What's the problem with that? I guess this depends who you are and what you're kind of,

interested in. I mean, I'm interested in, well, first of all, understanding what's going on, but also like, I think a model should, you know, if we claim that, you know, we have AGI or whatever, you know, like I think a model should

should be able to do simple things that a human can do, right? And so there's this kind of argument that like, okay, it would be nice. And then of course you can equip models with tools and this is very successful, but like, I guess one of the main points we make in our paper to kind of argue against this directly without saying this, it would be nice is that you often just want to copy things into tools, right? And so if you can't even do this,

then you probably have a problem, right? So somehow counting, sure, could be solved with tools, but copying is something that's super fundamental and you want to be very robust at doing this. So you also did two very interesting experiments. One was with chain of thought and one was where

And this actually has a bit of a history. Do you remember like GBT-3 had a problem with numbers because of the byte pairing coding or whatever, and you put spaces between the numbers and it works better. So one of the experiments you did was you just kind of like interleaved the sequences. Yeah. And that helped, weirdly. Why did that help? Actually, it was not weird. Like actually, this was us trying to debug our theory in some sense. And it's like,

The intuition there with over-squashing, for example, is that if you're trying to copy the zero at the end, and you know now that the zero is... Being at the end means that it's kind of just very problematic now because you lose this idea that there's a zero. Now, if you add more zeros in the sequence,

this fact that there's a zero is not lost as quickly. In some sense, it's a way to counter the ones dominating the representation. So now adding these zeros perhaps unsurprisingly helps not lose the zero in the representation.

What I think was very nice to see was that copying the first token seemed to be much easier than the last one because then this is really kind of showing this idea of the paths, right? The representation of the zero when you look at these partial derivatives is much more influential when it's at the start rather than at the end. This to me is like really like, okay, this is like we have built some kind of understanding here, right? Because I think prior to...

without thinking of over-squashing these paths, this seems absurd, right? It's super random behavior. Why is this happening? So I kind of like the intuition from these experiments. And folks should definitely look at the paper. I mean, just kind of verbalizing it, as I remember, there was kind of a task where you told it to do something at the beginning and at the end, and

you found that as you increased the sequence length, it was kind of like still successfully attending to the beginning for quite a long time. But at the end, it dropped off very quickly. And one of the things as well you did was like, I think zero shot and more than zero shot chain of thought prompting, because all of your experiments were basically showing that

when the sequence length got to even a fairly small number, I think even past 10 or something, it would just kind of go crazy. And if you were counting, you would expect to see like a monotonic, like increasing set of results, but it just collapsed very quickly. How did chain of thought help? The experience we did were like, maybe the most simple one is like, just asking it to sum one plus one plus one K times, right? And, and,

I guess okay if you I think it's kind of notorious that these models suck at this kind of task which to me is quite disturbing but anyway and and

With Chain of Thought, the idea was like we were trying to ask it to break it down, right? So it's like maybe do it in groups of five, right? And so on. But then you very quickly see that essentially starts diverging and like just coming up again with like nonsensical answers. And this was not surprising to me. It was just like...

It was quite expected to me actually because there's something about this repeated nature of the task, right? Where it's like I think the model just gets very kind of quickly mixed up with like what it's keeping track of and it's just like there's so many one plus ones, right? And it just becomes just extremely hard for it. So there was this amazing plot and I think it was a counting task. Right, yeah. And

What it showed was this big mode smack in the middle. Could you explain that? So what we did is just very simple. We essentially fed the model, can you sum 1 plus 1, so on, 5, 10, 15, up to 200 times.

And what you see is that it's okay around 20-ish and then it starts outputting numbers, not even multiples of 5 and so on. And then once you go past a certain size, it just relikes to output 100. And this to us was kind of pointing towards the fact that it's

probably not mechanically counting. And I think there's been like actually some very recent preprints out on like this kind of behavior that these models are not like probably not implementing algorithms, right? They're implementing like, I think in this paper they call it bags of heuristics, right? Like they're implementing like kind of

heuristics that, you know, at least when they're training are good enough to fit, right? But then as soon as they go out of distribution, they generalize horribly and they output always 100 because 100 seems to me like a pretty plausible common answer for some, a sequence that's a big sum of ones, right? Like somehow 100 is like, make sense as 100, right? At least from a training set point of view, right? So this is,

I think it's a pretty surprising but also not surprising result that this occurs. Yes, and that paper you mentioned is "Arithmetic without algorithms: language models solve math with a bag of heuristics." Right, yeah.

When I read your paper, there was this beautiful word that you used. I had to look it up because I've not seen it before. I think it was subitizing. Right, yeah. And children apparently, they learn to sort of like, you know, that's five apples. And they learn a pattern for counting. So they're not actually counting. We can just quickly kind of do that. And the suggestion is that language models are doing the same thing. Like it makes sense that you care about rough estimates more than like precise numbers. I mean...

I'm not sure I really understand the difference between 100 and 103 but I can understand the difference between 10 and 100, right? Like at least, you know. So I think it makes sense. And I guess this kind of goes to the question like I'm not even sure how I count, right? Like I don't really know how robust my idea of counting is in my head, right? It's like, okay, I can mechanically do it but...

Yeah, but it makes sense that raw estimates are probably quite important for children and probably for language models. So in the introduction, you did a bit of a literature review on some of the LLM skepticism papers, and you cited Vinyals talking about how under certain conditions,

self-attention transformers can be Turing complete right under which conditions can they be Turing complete right so I think that paper in particular is not really formal about this actually but there are papers which are and I think they come from folks who are more like in kind of

kind of more computer science approach, right? Like I'm looking at them through automata and so on. And usually assumptions you make are about like hard attention. So this means that you assume that your attention is one or zero, right? And this is impossible, right? In practice, but it makes sense why this is a useful kind of paradigm. I think people relax this usually or sometimes to having

I'm not sure exactly, but it's like softer hard attention where it's like instead of having just a delta at some point you have one over a k, right? So you can attend perfectly uniformly, right? And so I think many of these results involve these constructions because they're simply easier to treat

mathematically. But shouldn't it be impossible in principle for a self-attention transformer to be Turing complete if it's finite precision and if it's doing a fixed amount of computation? So they also have these assumptions on memory growing with sequence length usually and our precision growing so by memory I roughly mean hidden dimension times floating point precision this is kind of the memory how many bits you have right in a transformer and and

And I guess also utter...

This times the number of tokens you have right? This is kind of like but but usually I think a lot of people study this in the paradigm of like you have your precision increase with with sequence length like logarithmically or whatever obviously Precision is not increasing right a precision is what it is, right? Then what you choose it to be so yeah spoke with Schmidhuber about this Of course there was that Siegelman paper in 95 where she showed that if you have an infinite precision RNN, right? Then of course it can mimic a Turing machine right and if you have infinite precision you don't have representational collapse and

like you can go as close as you want, you don't care, right? Because you can always distinguish the two, right? So it makes sense that, you know, in some sense these results are quite related, right? Like how this kind of

What I think the nice thing about our results other than... You know, these results in Automata, formal languages and so on, I mean, there's this very nice line of work on this programming language called RASP. It's a programming language that implements operations transformers can do. There's this kind of working conjecture on like how if

you want to implement something which is a short RASP program then you can, like your transformer can learn this relatively well, right? And like generalize better. So this is kind of nice relationship with like essentially programming languages, right? And transformers. But in some sense, I like our results because we give a quantity that you can directly measure and like say this is happening in a transformer. Well, these other results are more like

kind of construct- which is obviously super interesting but they're more like constructive kind of proof that you can represent this class of languages or so on so they're kind of like

very related but very different approaches to similar issues in some way. Yeah, there are so many ways to come at this. And by the way, I love that, is it Greg Delatang, the Chomsky hierarchy? Right, yes, yes, yes. I think he mapped all of the various different models like LSTMs and transformers to the automata hierarchy. I think he said the transformers were... I think quite at the bottom. Yeah, I think RNNs could learn...

Like, you know, counting languages. Yeah, yeah, yeah. But you also cited Peng talking about the limits of compositionality and transformers. Right, right. Because that must trivially be true, surely. Like a language model couldn't generalize Mary loves John to Mary loves Jane.

because it doesn't have that invertibility. Everything gets kind of spread out into all of these circuits for want of a better word. Yes, yes. I think there's also been work, for example, that showed that even just switching the order of... If you ask a language model to solve first-order logic, like just...

chains of implications. The order in which these are presented to the transformer, so if kind of this chain is very naturally presenting the order you feed it in, greatly affects if you can like solve these kind of problems. So yeah, all these results point to like

pretty bad generalization habits that our current models have. So what kind of modifications could we make, you know, practical modifications to overcome this collapse of representations? Vanishing gradients, I think, are quite related to this. And this is something that, you know, transformers are in some way a solution, you know, like

some way transformers arise naturally after you discover that RNNs have this kind of bias. But then we point out they have even causal mechanisms implement this bias in some way. And like, there's something I think just hard about having long sequences. Like you have to essentially take a large amount of information and try to fit this in a finite number of bits, right? So I think

This is just a hard problem and probably the best solution would be to have ways to compress information and somehow just be left with less and have your technique just care about less things, right? So in some way, windowed attention and these kind of processes

I think are quite well motivated in this aspect. I mean, in your bones, deep in your bones, do you lean connectionist or do you lean towards we need to have a hybrid, neuro-symbolic approach? I'm a big fan of having hybrid approaches and

Actually, very recently, I think maybe... I mean, now there's like the Chess World Championship going on and actually some... Petter and some other... and these amazing researchers, they released like some kind of component in Gemini that's able to play chess. And it's actually like surprisingly strong and it's like a language model still. And it does have its own kind of...

specialized like chest component, right? So this is a...

I think this is just a very principled approach. Like, imagine you could have your base language model, which is really like this orchestrator, and then you could have maybe a specialized math unit or a specialized chess unit and so on. You could kind of... Ideally, you keep the base model fixed and you can swap out and improve your chess component at some point. I think this, to me, sounds...

super exciting and we know that it kind of works. I mean in some sense vision language models, you know, they have a specialized vision component, they have a specialized language component usually and so I think this works and of course visual language models are trained in their own kind of way so it's a bit tricky but like this to me sounds quite nice. Like in some sense you have this stream of tokens and these can be generated by very specialized things but then you have this

kind of base model that's processing the stream. But these tokens could come from whatever you want, essentially. It could be even not just... It could be like an RNN or something. If you find that RNNs are better at counting, maybe you could use those as like your, you know, maths component or whatever.

- What other trade offs? I love this formal approach because it can give us solutions that are explainable, robust in some ways, verifiable and so on. But when you mix and match these things together, of course it's very difficult to achieve learning. And it's kind of somehow bottlenecked by

how we designed the you know the system 2 component if you like so there's no clean and easy way to to build them together i agree i think this is why we don't do this at the moment it's just pretty hard um i i just think that transformers have pretty fundamental limitations that like at the moment and so it makes sense to try to use other models as well like i

But yeah, it's just super hard to do. And maybe once we find a better way to do it, then this will become much easier. What is your definition of reasoning? I think reasoning is just very ill-defined to me. And like, I can talk to you about length generalization, which I view as like some kind of

subcomponent of reasoning. I mean, to me, reasoning is just a very kind of broad idea of like, we care about generalization and robust generalization. But then there's like very different... Like for example, a computer program will generalize arbitrarily well if you write it correctly and we can prove that it's perfect and so on.

Is it reasoning? I'm not sure. A human... I mean, there's this... I think this very nice example of, like, I believe it was a number file. They set off with a bunch of people and they're like, let's compute pi to... I don't know how many digits, right? And they got, like, a ton of people. They spent a week doing this. And then I think the first time they pretty much... They had a mistake somewhere in the middle, right? And, like...

You know, and so this to me points to like, it's not surprising, like in some sense, even humans as tasks grow, I mean, I'm sure you know how to sort an array or whatever, but like good luck sorting an array that has

a million digits, you know, you'll probably make a mistake like somewhere, right? And then so are you then not reasoning? I'm not sure, right? So there's this kind of weird trade-off where it's like, yes, we care about length generalization, but then we're kind of bad at length generalization, right? Like we're not good at processing large amounts of data. So maybe if we just care about length generalization, we're not reasoning. So Sanjay has a very long-winded way of saying, I think maybe reasoning is not

super well defined, but we can do things that are well defined. And I think that it's fine if language models have a different definition of reasoning to humans because they are fundamentally computers, right? And they are built on a computer. So computers are good at things that humans are terrible at. This is why I think stuff like counting and whatever should be doable for, because they are fundamentally computers. And so I think reasoning for humans and machines, it's

We care about different things, right? So... Cholet's coming here at the weekend. And, you know, he wrote this measure of intelligence paper. And he never uses the word reasoning, which is interesting. And there was a tweet I was reading this morning. And, you know, so he thinks intelligence is knowledge acquisition efficiency. Right, yes. And knowledge acquisition for him is all about this kind of creative, adaptive search for novelty. Right. But he's always at pains to stress that...

just performing a skill isn't intelligent and you're not doing reasoning. So a chess computer isn't reasoning. But AlphaGo is reasoning, or AlphaZero, I should say. And there's something about the creativity. It's because it's actually kind of trying things and it's acquiring new knowledge that perhaps the creators of the system didn't program into it. I'm really into chess, right? And I don't understand this comparison because...

What AlphaZero, for example, or... You know, AlphaZero has produced very novel ideas that computers don't find, for example, right? Like these kind of pushing the H pawns and so on, that now are used all the time by grandmasters, right? And they are mechanically doing very similar things. Like both work on this kind of tree search idea, both...

they just have a different way of computing heuristics. In Stockfish, for example, they will... I'm not sure about all the details, but my understanding is that essentially a bunch of people just put in a ton of heuristics in there that maybe are somehow extracted from statistics or so on. You want your nights on average not beyond this. But then these

human heuristics will just bias the tree search, right? And then now if you don't have these human heuristics in like alpha zero, it will find something different. But this is not surprising to me. And somehow I think that associating this finding novel ideas

not as being creative, but rather it's just using a function that's not aligned with what humans necessarily think is... And even Stockfish will find completely crazy ideas, but somehow the fact that

there's still like in some way there's a human touch in Stockfish that's not there in AlphaZero and I'm not sure like for example if then you can say that a human playing chess is reasoning right like humans will be much less methodical when they think about a position they will have a ton of kind of moves that appear in their head and you can't even explain why right but like a computer and sometimes doing a tree search

can roughly tell you why, like okay because deep down, you know. So I'm not sure I agree necessarily with this chess, but maybe it's because I'm maybe too addicted to chess and then like I have... But yeah, to me, yeah, it's just like you're guiding

through a heuristic that's not human. So then it feels creative. I mean, if we were to design an optimally intelligent reasoning system, I mean, what would it look like to you? Defining reasoning is important, but I think I would just like it to in some way behave like a computer in the sense of like, it doesn't make...

mistakes which are trivial in some sense. It doesn't make mystical or trivial, but it also doesn't make conceptual mistakes. This to me is really what these language models can do much better than humans. If you can leverage this fact that they are machines and also leverage this fact that we can make them behave or align them to think like humans, I think this is essentially creating a super powerful combination which

a human is, it would just be completely out of reach for a human. When I was speaking with Neil, I'm slightly skeptical about the Golden Gate Claude thing, right? So it said...

"Okay, take an abstract feature. Go back to the corpus. Here are all of the tokens that maximally activate that feature." To me, it looked like a bit of a keyword matching scheme. This is where we have this kind of reasoning chauvinism. We think when we do stuff, it's reasoning, and when language models do it, it's not reasoning.

I think the meta component of reasoning is really important. So as you say, we could just do this tree search and from a functional perspective, it's as if it's doing reasoning because it's found all of these weird and wonderful trajectories and patterns and it's doing reasoning. Hey, but when we do reasoning, we're looking at abstract analogies in the real world. So we're saying the universe has these patterns, just like a kaleidoscope or this composition of knowledge being mixed together and so on. And when we look at a chessboard, we see this weird analogy with bananas or something.

and we're kind of doing that multi-domain analogical reasoning. And that feels like a very creative jump, so we call that reasoning. But when we see a computer do it in a simplistic way, we kind of say that's not reasoning. There's this illusion of reasoning that comes from like you... Humans are extremely good at like finding extremely complex patterns which are kind of unexplainable. And somehow these jumps...

reasoning. But like, I think there's actually chess is a great way to think about this. And it's like, what, for example, there's this kind of very notorious experiment where you ask

grandmasters or whatever, like very strong chess players to look at a position and then try to recite it off memory, right? So you'd look at it for, I don't know, 20 seconds and then take fresh board, put all the pieces, right? And then they asked also amateurs to do this and the grandmasters were much, much better at

recuperating this position. But then if you gave them our... These positions all came from games, right? These are positions that arise from playing chess. And instead, when they gave them random positions, they were equal, right? So it's not that grandmasters are extremely... They have crazy memory or whatever. Grandmasters have a much more compressed representation of the world.

because they've played and seen so many chess positions, they're able to compress structures, right? So if I have pawns on f7, g6, and h7, and a bishop on g7, and I'm castled, you know, and it's like, this is like, I recognize immediately as a fianchetta, right? And this is like,

one thing now, right? This is, but when someone who's never played chess, they have no concept of this structure, right? But like, or I've seen this Sicilian pawn structure a gazillion times. I'm just remembering this is, or this is this position that arises from this Peschnikov or whatever, right? Like this is one concept now. And it's just because I'm allocating so much more of my brain to compressing complicated things

seemingly noisy information, but it's not actually noisy. It's something that I've just seen 10,000 times. Yeah, but then there's the question of, it feels like, you know, we have all of this experience, and then there's this diffused concept formation that you speak of. And then the intuition isn't reasoning, it's memorization, surely. But it's not a simple case, though, of

One day a chess player has a flash of inspiration and they just define this new piece of knowledge and then from that point on they can just intuitively grab that knowledge. It's a really diffused thing where you have all of these diverse experiences and like, you know, this concept emerges. It might happen memetically, it might happen spread out over lots and lots of different people. Yeah, it's somehow magical, right? It's the magical power of human brains, right? That

You don't know why one day you look at this position and you come up with this brilliant move that you maybe would have not even considered. But I guess my point is like, it's... This only happens once you have an extremely strong compressed representation. Then you can...

somehow ignore noise, you have essentially this gives you more capacity to compute things without thinking about like, oh, all these pieces are here, right? Like you just look at it and you know, you can just compute with a much clearer kind of representation essentially. And then this is when brilliancies occur. And this is when these magical moves pop out. So are you a fan of the flash of inspiration?

Once you have these kind of moments where like everything clicks, it's kind of like you've climbed a little step now in like your understanding. And I think

they will only, at least from my experience, these kind of moments come once. There are a lot of things, at least for humans, right? You need a lot of things to be right for you to come up with these kind of next steps. So you need to be very much super focused on this task and then you need to have good sleep and then you need to kind of just be in good health. And then you can play maybe one day at just a much better level or whatever. You can do something at a much higher level

And then I think this kind of gradually accumulates as this kind of like sequence of inspiration moments or whatever that then like kind of build up on each other. But yeah, it's for sure not like something you can choose or like something that...

doesn't come unless you invest a ton of time into something, I think. Carl Friston's got a beautiful word for this. He calls it epistemic foraging. You know, which is, it's a little bit like, you know, over time we just discover this knowledge and the cloud of knowledge just increases. Federico, it's been an honor to have you on. Thank you so much. Yeah, thank you so much.

Transformers Need Glasses! - Federico Barbero 01:00:54 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

Transformers Need Glasses! - Federico Barbero