We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Want to Understand Neural Networks? Think Elastic Origami! - Prof. Randall Balestriero

2025/2/8

Machine Learning Street Talk (MLST)

AI Deep Dive AI Chapters Transcript

People

Randall Balestriero

Topics

Randall Balestriero: 在深度网络的训练过程中,即使训练指标已经达到稳定状态,持续进行训练仍然可以提高测试指标。这是因为网络会不断调整其权重,以便更好地泛化到测试样本。Grokking现象通常出现在特定的任务、模型和权重初始化等特定设置中。然而,在更一般的设置中,比如在计算机视觉任务中,对抗噪声下的测试精度提升通常出现在干净数据训练和测试精度稳定之后。这意味着,即使没有进行对抗训练,通过长时间的训练,网络也会自然涌现出对抗鲁棒性。这种鲁棒性的涌现是长时间训练和稀疏解出现的结果。具有分段线性非线性的神经网络可以将输入空间切割成线性区域,并将每个区域线性映射到输出,类似于弹性折纸。神经网络就像弹性折纸,将输入空间扭曲,然后用直线切割,以分离类别。神经网络对样本的切割密度不同,切割密集区域对对抗攻击更敏感。通过分析局部区域的分割统计信息,可以评估模型对不同样本的行为差异,从而评估模型对某些样本的算法偏差。相比于用一句话概括神经网络,理解神经网络在数据流形不同区域的行为更有助于全面理解神经网络。 Randall Balestriero: 在样条函数逼近中,选择空间分割和每个区域多项式的阶数至关重要。相比于增加多项式的阶数,更好地定位分割区域能获得更好的逼近效果,即使保持分段仿射。神经网络通过训练参数来适应分割,学习分割的同时也学习了仿射映射。神经网络的区域集中在训练点附近,较小区域提供更精确的逼近。

Deep Dive

Chapters

This chapter introduces the spline theory of neural networks, explaining how neural networks partition input space into linear regions and perform affine mappings within each region. It compares this to methods like k-means clustering and highlights the adaptive nature of deep network partitions, enabling superior extrapolation.

Neural networks partition input space into linear regions.
Within each region, the network performs an affine mapping.
The partition adapts during training, improving extrapolation.

Shownotes Transcript

Translations:

中文

So basically at very early stage during training the train accuracy or train metric, it can be something else than accuracy, grows and then starts plateauing. So you will think, okay, training is done, but actually the test metric is still not at a high point. So if you actually keep training much longer than you normally would, then although the train metric still plateaus, the test one will suddenly start

increasing, your network will start assembling its weights together so that it can extrapolate on test samples. So basically the high-level summary is that groping was observed in some very specific settings, specific tasks, specific models, specific weight initialization.

Here, what we try to look at are more general settings and try to see if there is the same delayed property emergence happening. So it turns out that it's not happening in terms of train and test accuracy, right? Those two things evolve roughly at the same rate and plateau roughly at the same time. But when you start looking at

test accuracy under adversarial noise, then that's when you have emergence of robustness to adversarial noise on test images, much after you reach this plateauing area of the clean train and clean test accuracy. First, I want to also thank IMTIAS because it's been really pushing a lot and that's how this paper got to this stage. When you have a neural network which has any piecewise-defined non-linearity, so it can be relu's, can be leaky relu's,

can be sawtooth as well, like anything that's piecewise and linear. So if you have a neural network like that with such nonlinearities, then what it basically does is it cuts up the input space into linear regions and then linearly maps each region to the output. So basically, that's what the theory says, that you have this spline where the neural network boundaries are-- like where the spline boundaries are defined by the network weights, and the mapping is also defined by the network weights. A lot like an origami.

But the origami is just not an origami. There's a little bit of stretching as well. So I would say it's elastorigami that the neural network is doing. Like you give it an input space, an input domain, and it's going to play origami and turn it into whatever, like some warped space, and then draw a line to cut it up. So the last line is going to be, because the output is linear, like a hyperplane in the embedding space. So it's going to cut this origami up with a straight line such that the decision boundaries separate the classes.

But this is basically what a neural network is doing because for some samples we've seen this in experiment that for some samples you would see more partitioning around it so that would mean that there's more non-linearity needed there or the function is actually like more complex and then there would be some parts of space where the the cutting of the space is like more sparser so that's like a harder region and then where there's more cutting up it's more prone to adversary attacks because if you move a little bit you'd like basically be crossing more non-linearity so more of the activations would be changing in the network

for. Take a couple of different slices, 2D slices, to get an estimate of what the partition statistics is. The more important thing is estimating the partition statistics around the locality. And these statistics can basically tell you how the model behaves differently for different samples. For example, if you have a couple of samples which come from a demographic, which have less expressivity assigned by the neural network,

you can basically say that the neural network is algorithmically biased towards those samples because it is not that expressive in those regions. Whereas for other samples or a particular class, it would be assigning more regions than-- based on the statistics, you can infer how the model behaves differently for different sets of samples. Instead of coming up with the general notion of what a neural network generally does, this allows us to think of what a neural network generally does

in different parts of region on the data manifold or of the manifold. So these notions I think would help us, like have come converged towards like a more general understanding of neural network than just like a one liner statement that this is what neural network does. One of the basic results in a spline function approximation and mostly it's been studied for 1D to 1D regression is that when you design a spline, you have to pick out what is your partition of the space

and the degree of the polynomial that you use on each region. And if you think about it, you might say, okay, piecewise affine is really not the best. You just have an affine mapping per region. It's not so rich to represent a lot of variety within each region. But it turns out that if you have to choose between

well positioning your regions, so finding a good partition versus increasing the degree of your polynomial. Actually it's much better to fine-tune or position your partition according to the data even if you stay piecewise affine and just from this you can get extremely good approximation power although

In the parallel region you have a really simple function. And the way you get this power is much better than if you keep the partition the same and just increase the degree of the polynomial. So from this, what you should take away is that

not having smoothness and being just piecewise affine is okay and actually is optimal if you can position those regions well. And that's what you do with deep networks. Basically you have this adaptation of the partition when you train the parameters of your model because the partition and the affine mappings are tied together. So by learning one, you learn the other.

And the way it works in practice is that we see most of the regions get concentrated around the training points and also extrapolated based on the rule given by the architecture. And if you go in other parts of the space which is not either near the data distribution or near the extrapolation rule, then you have much less region concentration. And this is really important because again, the smaller the regions, the more precise is your approximation.

MLST is sponsored by CentML, which is the compute platform specifically optimized for AI workloads. They support all of the latest open source language models out of the box, like Lama, for example. You can just choose the pricing points, choose the model that you want. It spins up, it's elastic autoscale. You can pay on consumption, essentially, or you can have a model which is always working or it can be freeze-dried when you're not using it. So what are you waiting for? Go to centml.ai and sign up now.

Tufo Labs is a new AI research lab. I'm starting in Zurich. It is funded from PASS Ventures, involving AI as well. We are hiring both chief scientists and deep learning engineer researchers. And so we are Swiss version of DeepSeq.

and so a small group of people, very motivated, very hardworking and we try to do some AI research starting with LLM and O1 style models. We want to investigate reverse engineering and explore the techniques ourselves.

Professor Randall Balestrieri, welcome back to MLST and congratulations on your new role at Brown. Thanks, thanks very much and happy to be back, especially to speak about like latest research and Spline, so very happy to be here.

You invented or co-invented perhaps the spline theory of neural networks, which is something that revolutionized my understanding of deep learning. I think we should just have a quick refresher. So what do we mean by the spline theory of neural networks?

Yes, so first just to put some things in a better context and avoid some issues, legal issues later. So splines have a very rich theory and they've been used probably since the 80s or maybe even before for function approximation.

But the thing is that most of the research was done for 1D, 2D, or maybe 3D input spaces because that's where most of the function approximation was needed, maybe for partial differential equations and things you could observe. So what we did?

around 2018 was to try to understand what are current deep networks and at that time it was mostly convolutional networks, ResNet or MLPs with ReLU activation, Max pooling, this type of non-linearities. And it turns out that when you have this type of operation, so affine operation like

dense mapping or convolution, then ReLU or MaxPooling and you keep interleaving and composing those layers, then the entire input-output mapping is itself a continuous piecewise affine spline. So what that means is that the input space of the network, so maybe the space of images if you do like MNIST or CIFAR classification, so this is a huge high dimensional space,

It is cut up into polytopal convex regions, and within each of those regions, your network is just an affine mapping. So overall, it's continuous, but within a region, it's just an affine mapping. And so you can characterize the geometry of those regions, where do you have more regions than in other places, and you can start to understand what your network is actually learning and why is it able to extrapolate, for example, or what is the impact of the architecture, regularization, and things like this.

There's some quite technical language here as well. So I just really want to kind of hit home to the audience that it's a bit like, you know, when we train a neural network, we train it to become a bit like a honeycomb. So there's like this structure and structure

a bit like a lattice maybe and inside the holes of the honeycomb those represent you know decisions that the neural network makes. Yeah exactly maybe another parallel we can do is if you do k-means for example or some clustering method this way you learn

partition of your space depending on in which cluster your point is assigned to. So you have this sort of structure with regions. So if you do K-means, of course the mapping is piecewise constant, right? Within the region or the cluster, you are assigned to the same cluster. Then when you go to the next region, you are assigned to another cluster. So this is the type of K-means partition geometry you get.

With deep nets, it's very similar. You have those regions as well, but within each region, instead of being constant, you have an affine mapping, and then you have some extra constraints that the regions are not independent from each other, and they can learn even if you are very far in extrapolative regime. But yes, the type of geometry you get is really akin to what you have with K-means or KNN and things like this. The key thing, though, with deep nets, and that's one of the key benefits,

is that how you learn your partition or how you learn those regions is not just where you have data, but everywhere in the space. And that's how you get a much better extrapolation performance even in high dimension, which you don't if you do like KNN for example. One of the operative reasons I like spline theory is that there are a lot of people saying that neural networks do this magical emergent reasoning.

when you understand the neural network as computing these spline partition boundaries. So essentially it's a little bit like a locality sensitive hashing table.

that's not really conducive to them reasoning or doing any different kind of computation. Yes, so I think there is two points from this. So first, as you said, you have this sort of like locally sensitive hashing or template matching. So you just try to fit locally what is the region geometry and the affine mapping. So this you could think it's really ad hoc and you know like brutal rule, but where you actually have some sort of reasoning or intelligence thing emerging is because the way you learn in a path

part of the space impacts how you learn in another part of the space, even if you don't have data there. And I think that's where things become from really ad hoc like KNN or K-Means to something that is much more complicated and maybe much more human-like, where from one example, you learn some things that you will be able to reuse on an other example that you did not see during training and that is very, very far in your space. So it's really

not ad hoc in that sense that it's able to extrapolate from hidden rules in a very efficient way. Brilliant. Well, now we're going to get to the most exciting bit today, I think, which is you've got a paper and you wrote this with IMTIAS. And we've actually got some great content with IMTIAS coming out as well. But it's called Deep Networks Always Grok.

And here is why. Give us the elevator pitch. Yes. So yeah, first I want to also thank IMTIAS because it's been really pushing a lot and that's how this paper got to this stage. And unfortunately, it could not be here for visa reasons, but yeah, like huge kudos to IMTIAS.

So basically the high level summary is that groping was observed in some very specific settings as a delayed emergence of generalization. So your test set accuracy starts growing much after your train accuracy already plateaued. But this was in very specific settings, specific tasks, specific models, specific weight initialization.

Here, what we try to look at are more general settings like CFAR, so Computer Vision Task, Convolutional Networks, ResNets, and try to see if there is the same delayed property emergence happening. So it turns out that it's not happening in terms of train and test accuracy, right? Those two things evolve roughly at the same rate and plateau roughly at the same time. But when you start looking at

comes, although you don't do any adversarial training at all. It's just as a consequence of very, very long training and the emergence of those sparse solutions where some new geometric properties of your network naturally emerge. Maybe we should just linger on a couple of things here. First of all, let's do adversarial robustness. Yeah.

So what happens when a neural network is not robust? Yeah, so basically what we do, and it's a very standard way to attack a network, you get an image, an input, whatever it can be, something else than an image, you feed it through your network, and then based on gradient information, you see what is the best direction noise to add to the original input so that you can fool the network to predict incorrect class. So this is a white box attack. You use gradient information of your network.

And this way you are able to fool the network so you can get, for example, a train set where the network on clean images has maybe 100% accuracy. And just because of those really small perturbations that you cannot see by eye, you can reduce its accuracy to just random guessing. So it's a really, really efficient attack. And a lot of people try to make networks robust to that through adversarial training. So during training, they actually sample those attacks and try to make the network robust to it.

And here we show that actually you become robust to those naturally just through very long training episodes, basically. Yes. Now, I always looked at the spline theory as a kind of mental intuition pump for this adversarial examples problem. So we'll show a graphic on the screen now. But essentially, when you look at a spline partition, there are just...

crisscrossing overlapping splines of it. It's a complete mess. It's very, very chaotic. And this is the reason why the networks are so brittle because it's so easy just to push a test example over the boundary so that the network behaves differently. Yeah, exactly. So basically when you see in this figure all the partition regions, what you have to think about is that when you're in

moves a little bit and go from one region to another, it's when you're mapping as a non-linearity, and so you have a change in the prediction. And so the more of those regions you have, the more non-linear kinks or points you will have, and therefore it's much easier to perturb your network.

But when you reach this very long training stage and so this sparse solution, what happened is that you no longer have so many regions everywhere in the space and around your training points, but instead regions start migrating away from the training points and testing points to concentrate near the decision boundary. And now because you have much wider regions on those points, it means your network is just affine within a much bigger region of the space.

and therefore is just an affine mapping and it's much easier to control its sensitivity to noise. So you're saying, and maybe we should actually just introduce grokking first as well and then we can come back to this. So grokking is this phenomenon of delayed generalization. Now I know that's an oversimplification because whenever I speak to grokking people and say that, they say no, that's an oversimplification. But basically you train a network for ridiculously longer than you normally would and then stuff happens later in training. Can you explain that?

Yeah.

Yeah, so basically what they did is on specific settings, most of them are maybe constructed or simple tasks. You can show that at very early stage during training, the train accuracy or train metric, it can be something else than accuracy, grows and then starts plateauing. So you will think, OK, training is done, I can stop my training there. But actually, the test metric, again, it can be accuracy or something else, is still either near random or just very

slightly higher but still not at a high point. So if you actually keep training, as you said, much longer than you normally would, then although the train matrix still plateaus, the test one will suddenly start increasing

and converge on its own later on. Which means that during training, you have still some gradient information that makes your weights change. They don't have any impact on the train metric, but eventually at one point your network will start assembling its weights together so that it can extrapolate on test samples.

Okay. And there's an interesting broader story here because I know you're very interested in the learning dynamics. And there are different stages of training. So at the beginning, the network learns quite simple features. And then as we progress through the training stages, it learns increasingly complex features. Now, this is kind of weird though, right? Because what you're showing is a local decomplexification. Because I imagine that things would get more complex. It's learning high-frequency information.

but the network is learning a local decomplexification to kind of stretch out those boundaries. And then there's the question of, well, what learning signal is it using to do that? Yeah, exactly. So basically, as you said, you have this two-stage training, a bit like in a double descent dynamic.

where during the first stage, you start from random initialization and your network focuses aggressively on the points, use a lot of regions around them. So it's sort of a memorization. So it's still able to extrapolate, but still it's really, really focusing on the actual points.

So lots of regions near them and very smooth mapping and no simplification anywhere in the space. But then once you reach this stage and you keep training again like much longer than what you will usually do, what happens is that there is still some gradient information that

is conveyed through your loss and the network will, as you said, decomplexify around those points. And so the regions will start to migrate and move away and instead they will focus on where the decision boundary is. So all the allocated parameters are trying to

fit the decision boundary very precisely instead of remembering where those points are. And because of this, the radius of the regions will increase around those points and therefore you will get adversarial robustness and you will reach this stage where your network goes from something really smooth, almost like uniformly smooth, to being piecewise constant, which is what you should converge to in the limit theoretically.

really nice properties that naturally emerge, although you have to train for a very very very long time for it to happen.

Yes, yes. I suppose you could argue it both ways, whether it's complexification or decomplexification. But the remarkable thing, as you say, it moves from looking at the training examples to looking at the regions. And there's this great figure that I'm going to show again on the screen now, where after this kind of grokking phase has happened, you get these partitions emerge. And it looks a little bit like a topology map or a contour map or something like that.

And these partitions look very much like a Voronoi diagram. So it looks like the partitions are equidistant, the boundaries are equidistant between the points. And it looks like a mountain structure. It's not a single decision boundary. It's actually many that have been squashed together. Yeah, exactly. So when you see those huge concentrations of partition regions, all of those are squashed together around the decision boundary to bring the

representation capacity to go from one class to the other. And this is what you want, right? Because when you are near a point and it's in neighborhood, you just want a simple affine or even a constant mapping. You don't need to put a lot of parameters there, a lot of regions. But instead, where you want to put all your regions is when you are at the decision boundary, because this is where you need to go from one class to the other. So you need actual curvature in your mapping.

So this transition is really the key that goes from uniform smoothness in the space to piecewise constant mapping. And this is what brings adversarial robustness as well. And again, this emerged only after very, very long training, because as you said before, it's not the first thing that the network learns, right? So it's sort of a hidden solution that emerges after very long training, because maybe the gradient norm to reach there is very small, or maybe because it's just fighting the implicit bias of your architecture.

architecture, because as we also showed, as a function of the strength of the regularization, you can control the rate at which this emergence happens. So basically, the more you regularize with things like weight decay, the more you fight this part solution, and therefore it will maybe not happen at all or happen very

late during training even later this is um it's related to sparsity right so we know when we do iterative magnitude pruning and you know we train a dense network and then we kind of like you know we we pull out all of the um the the weights that have low magnitude and and the the sparse network strangely is more robust even though we've taken away most of you know most of most of the weights

And isn't it interesting that after this grokking phenomenon, the network we get resembles a sparse network? Yes, yes. So it's exactly related. So in another paper, we showed that you can actually do pruning in a really informed way as a way to simplify the partition to make it focus on the decision boundary rather than the point. And you can prove a relation between pruning, the impact on the geometry of the partition, and even other methods like collapse in the rank of your parameters

and this type of regularizer. So basically, there is a one-to-one correspondence between all those things. And the nice thing is that the partition gives you a single geometric object where you can visualize all of those and understand when and why they are beneficial. But yeah, exactly. It's very related to like this lottery ticket hypothesis and like

iterative pruning. In a way, when you do iterative pruning, what you probably do is you first switch this complicated solution with a uniform distribution of the region, but then on your own aggressively you remove most of them that are useless for your task, so probably the one near your points. So you bring the model closer to this later

stage, but through active pruning of your model parameters or units. How is this phenomenon related to double descent? Yes, so this is one of the things also related to neural collapse that we mentioned before. So basically there is this dynamic that at first there is this sort of

memorization and then you start to learn to extrapolate except that here we don't look at it in terms of capacity as number of parameters it's more geometry of the partition but you have the exact same thing so if you look at this local complexity measure that we derive

you can see the dynamics of the migration of the region. So when you look at the beginning of training and when you start having an increase in train and test accuracy, you have a lot of regions that get concentrated near the points. And this is the first ascent that you get after the first descent. And this is where you get no robustness at all. So this is where people will stop training because your train and test accuracy look good. But then you don't get any robustness.

If you keep training, then suddenly the regions migrate away. So you see this second descent happening. And this is when eventually robustness emerged because again, those regions moved away. So the ones near the training points and test points have bigger radius and therefore you get robustness. But this is...

giving you a new way to look at things like double descent or just training dynamics through the lens of the geometry of the partition which is really nice because until now most of the things that were studied was looking at the loss function or train and test accuracy which are maybe very like task specific or very like

black box in the sense that you just look at the network F and you don't really dive into it. But now you have this new way to look at what's happening within your network in terms of this partition. And so one of the big future questions is to try to re-derive most of those results in terms of the geometric properties of the partition.

Right now, we use regularizers because, I mean, certainly this was more the old school view, but by kind of deliberately constraining networks to be simple, they train better. But what we're talking about now is that actually we want to have a type of complexity.

I think you said to me earlier that if we're not careful about our regularizers, we might not even get this effect in the first place. Yeah, exactly. So first of all, there are many ways to regularize a network, like implicit and explicit regularization, and even methods like batch normalization, for example, that we may think, okay, it's just a way to make

training easier because you have normalization, actually it acts as a regularizer as well. So in this other paper, we show that if you have batch norm, for example, you actually actively try to concentrate the regions near your training points. So when you employ techniques like this, you will actually fight this sparse solution and therefore you cannot get there at all or maybe you can get there through even longer training episodes.

And as you said, when you use other things, so like weight decay, for example, this has also a strong bias. You say, OK, a solution near zero is what brings your model to a nice solution. And usually, all those regularizers try to enforce smoothness in your mapping in the L2 sense of it.

But because of this, you don't try to get to a piecewise constant solution, which is the one that will give you adversarial robustness and which is the one that you eventually reach when you reach this adversarial working phase. So that's why there is also a lot of implicit biases that we put through standard regularization that maybe we need to rethink if you want to speed up the emergence of adversarial working, for example.

One of the things that we'll talk about today is that neural networks do incredibly interesting things later on in the training dynamics. And this is great, right? If you're Meta or if you're Google or something, because they've got these big GPU clusters and they train neural networks beyond which it's out of our reach for the normal person. So we want to capture that behavior and we want the neural networks to do it earlier.

I know you've already done some work with, I think, didn't you build a geometrically inspired regularizer that kind of made the boundaries orthogonal to each other? Could we design a regularizer that would encourage this grokking behavior sooner? Yes, that's a very good question. So first, as you mentioned,

So right now we reach this adversarial robustness through very long training and because we don't actually do adversarial training, the robustness we get seems to be very good across different types of adversarial attacks. So that's a really beneficial property that people want.

But as you said, the problem is that because you need to train for so long, right now it's not a solution that everyone can get access to. And that's a really big limitation. So that's why one big axis of research is how to speed up the emergence of adversarial groping so that everyone can get access to it.

And then to your second point, yes, so there are a lot of ways we can build regularizer based on the geometric understanding of the partition. So for example, one thing that is really easy to compute, even for very, very large network, is a distance from a point to the nearest boundary of the region it lives in. So this

quantities, its distance, is differentiable with respect to the parameter of the model and it's really fast to obtain it, which means that you can actually use it as a regularizer during training. So this is just one example, but there are a lot of ways you can derive a differentiable regularizer that can be used to enforce some constraint in your partition.

Another thing is that you could actually not use a regularizer but try to build the architecture with those constraints and force. So one typical example is suppose you have no biases at all in your network, then the type of partition you get is constrained to be central so all the regions

are cones going away from zero to infinity, and this is a hard constraint. So if you don't have any biases, then you constrain your partition to always look like this, and therefore you don't need to have an extra regularizer. So there are a lot of ways you can either build a hard constraint into your architecture or parametrization of the weights, or build a differentiable regularizer that you can use during training.

Yeah, absolutely. I mean, after speaking with Sarah last week as well, she had this paper out about the, you know, the EU AI Act and the executive order. And they had this hard absolutist limit on the number of compute flops, which is a tally of the amount of computation operations that we do. And the thing is, we're in the regime now where people...

People just think, oh, you know, there's a commensurate relationship between capabilities and compute, so let's throw more compute at it. And what we need to be doing is smart flops. But further, we need people like you that actually have a theory of neural networks. Because had you not had this spline theory of neural networks, it wouldn't be possible for you to have the mental model and to design new regularizers in a principled way. It's so important.

Yeah, exactly. I think one of the key benefits of using splines is that not only it gives you some theoretical guarantees and solutions, but also the type of things you can visualize are really easy to interpret. So even if you never used splines before, it's really easy to look at those figures of the particle

the regions and see why this partition is better than this one for this specific task or in terms of robustness. So the good thing is to not just have a theoretical understanding of deep networks, but have something that any non-expert can visualize and act on and use to actually do better training of state-of-the-art models. So you want it to be tractable, interpretable and easy for everyone to get on board with basically.

And as you said, from this you can get really strong insights. For example, through this paper, we see that if you have allocated number of flops, the common wisdom, or at least what most of the people may do, is just say, okay, let me try to fit the biggest model

that I can on my GPU and then whatever remaining amount of flops I have I will just use it for training time. But from this paper what you see is that actually depending on the type of application you want, if you want for example adversarial robustness, you may want to use a much smaller model but allocate flops for training time instead. And when you do this different flop allocation as you said like smart flop allocation, then you will get a model that is adversarially robust

after this long training episode. So that's why through this, you should bring back questions about, okay, where do you allocate for your flops based on what properties you want your network to have. Talking about your results a little bit, how does it change depending on the type of problem, the data set and so on? What have you seen? Yeah, so there are a few trends that we saw. So for example, as

size of the dataset increase or decrease, or also the function of the noise you have in your labels, then the emergence of groping can be delayed or not. So those are things to keep in mind as well. And those are things that could be used later to also try to speed up groping. So for example, if you have maybe a smart curriculum for training, or if you have a sort of like teacher-student training, that's the type of thing you could try to do to speed up the emergence of groping. And now there are more and more papers trying to look at this.

Because again, all of those things are intertwined. So if you think back again about the partition, you have to think about the partition as being adaptive to the data set that you have in hand. And that's one of the beauty of the splines that deep networks use as opposed to standard splines, that now the partition really adapts through gradient descent of your weights.

to your data distribution and the loss that you have. So if you change how your points are distributed, the number of points, the dimension in which they live, or just the parametrization of your network, it will impact the partition that you learn and therefore will impact the geometric properties that are either beneficial or not for you. So all of those things are intertwined and the beauty of the spline interprets

interpretation of deep net is that you can understand this relationship precisely and you can like probable guarantees on why you get this partition and not another one. So now you can inform decisions how to act or to parameterize your model to reach a state that is good for your downstream task.

Yeah, very interesting. I mean, I think now might be a good time to talk about your local complexification measure. And the only reason I bring this up is the first stage is we can, you know, when we've got this measure of complexification and rocking, we can change the hyper parameters that, you know, like the prediction architecture and so on to optimize this effect. I mean, maybe downstream we can actually have

principled way of designing the architecture to optimize Grokking. But how did you design that complexification? Yeah, exactly. So basically what we are trying to do is look at how many regions or how complicated is a partition near a specific point.

So a good proxy for that is simply to count the number of regions that are nearby. And as you mentioned before, this is roughly equivalent to how much bits of information your network has in this neighborhood. And basically, what we do is a proxy to get that number of regions very quickly, even if you use a very large model, which is based on per layer counting of the number of vertices that are in an epsilon ball. And then we do some ablations to show that this is a good proxy for the number of regions.

But this is what we measure. And basically what you see in those local complexity graphs is that this number has different dynamics during training. And the very interesting thing that you observe is that it peaks when the train and test accuracy reach the plateauing phase, but then it starts decreasing much before you get the adversarial groping

happening, which means that this metric is actually much better. It's more sensitive to training dynamics and what happens in terms of your deep network geometry because this metric starts to show this, as you said before, the complexification around the points.

much before you get the adversarial groping. So this is really nice because you have a sensitive metric that tells you how the geometry of your model evolves. And this also opens a whole new door to, for example, new ways to do early stopping or to

understand where your network is done training or not because even if you are not interested in this adversarial groping so let's say you just want to know okay when the first stage of training happen you can just look at this metric and start to look okay when is it plateauing and when you reach this first

plateau area, you know you can stop training here, although you never computed train or test accuracy. So you see that a proxy matrix that characterizes the geometry of your partition is actually all you need to really understand the stage in which your network is and do you need to stop training or do you need to change some hyperparameters or architectures or things like this.

And in fact, that's something that some people have been using before. So there is this paper by Atlas Strong from UT Austin, where they show that you can do neural architecture search just by looking at some statistics about the partition. So this is informative enough to do neural architecture search and get a good model to train on. Interesting.

Why do neural networks kind of learn low complexity, low frequency features first? Yeah, so this is a good question. So there is a lot of work on this implicit bias or simplicity bias where you just try to learn the simplest rule first, like the spurious correlation things, right? And that's one thing that people are trying to fight now

a lot because of course you think you learned a good solution but actually you just learned a shortcut solution that maybe will put you in a really bad situation later once you deploy your model and that's something that is a really active area. So there are some cases where we can explain why this is the case. So for example if you learn by reconstruction right there is this other paper where we show okay there is this bias that comes from the data set because this

simplicity bias is transferred in terms of which frequencies of your image have the most gradient information. And it turns out those are the low frequency part of your image. And you can prove why this is the case when you learn representation by reconstruction. But in the most general setting, I think this is still an open question and intertwining the implicit bias of your architecture, the way we do training, and maybe even other things like batch normalization,

and that augmentation as well. And you implied that high complexity features are less likely to be shortcut features. What's the intuition for that? Yeah, exactly. So basically, at least for perception tasks, for example, you can show that when you look at high frequency features,

visually you have much less of those previous correlations, for example between what is the background and what is the actual object in the image. Because this information is not present in the high frequency part of the images. So naturally because you remove them, then if you train on this type of filtered images, you will of course remove the opportunity for the network to learn those shortcut solutions and it will have instead to focus on what is the actual shape of the object

I try to classify instead of just, okay, is the background grass or is the background a nice beach? And therefore you can just from this tell what is the object in the image. So this term emergence comes up quite a lot as well. It's a bit of a woolly word and people talk about it in respect of grokking and sometimes it's overestimated because they use log plots and the grokking isn't quite as transient as people think it is. But does it make sense to use the word emergence?

To me it does because it's a phenomenon that happens on its own. So it's not like after you reach this first stage of training then actively you change the learning rate or you change the regularizer and you make it happen. You don't do anything, you keep training and suddenly this new property emerges on its own. So I think to me the emergence term is actually quite fit for this as long as you don't actively do something to make it happen on its own basically.

Very cool. So I'm looking at this diagram again, which shows, you know, the kind of the honeycomb, the topology map. And this is quite a good example because it shows a real clean partitioning between examples. But I wondered, are there any more complex examples? And in the future, might there be a kind of meta partitioning scheme that kind of like, you know, coarsens it even more? Yeah, exactly. So I think

From this understanding, we know what type of partition we are trying to get to. So the next question is: how can we try to enforce that into your network? And is there a way to either impose that through the parameterization of your model or through regularization or through printing as you mentioned before? And I think little by little, once we understand what is the geometry property you need,

or does it transfer in terms of parameters of the weight? Now we are at a stage where we are able to say, okay, we can actually derive a method to reach to that stage actively and earlier during training. And this, again, through the spline partition is much more intuitive than if you just treat the network F as a black box model, right? Because you look at this image, you say, okay, we want to increase the radius of your region. You know, for example, we can pinpoint which units are responsible for each partition

partition boundaries so we know which one do we need to prune to make the region bigger or not. So all of those things are tied together and because we have this understanding it becomes much easier to actually act on the network to reach that solution faster. Amazing. Now just for the folks at home I can't impress enough

on you folks how big of a result this is. This is absolutely brilliant. So, you know, I've had loads of adversarial robustness researchers on like Nicholas Carlini. I've got, you know, Andrew Ilias tomorrow. And for years, people said that this is an intractable problem, that anything you do to fix the robustness doesn't actually fix it and you just reduce the headline accuracy. Yes, yes.

And you've actually kind of proven that you're doing it in an optimal way. Maybe can you explain that? But what do you think this means just for the whole space? Yeah, so this is very interesting. So I think there are many questions to this one. So one thing, first of all, is that here,

as we mentioned before, we don't use strong regularization, right? So maybe a lot of the previous results were under a strong regularization setting. Then that's where maybe it's impossible to get universal robustness or it's much harder to get to it. So here we show maybe a new way where people could try to look at this problem differently.

But also the nice thing here is that we don't do adversarial training, right? So because of this, we don't really overfeed the robustness to a specific type of attack. And in a sense that's also

link with like overfitting if you train with adversarial examples you overfit your robustness to this specific adversarial attack and maybe just because of this overfitting mechanism you become more sensitive to another type of attack but here we don't do any adversarial training this robustness comes naturally and therefore there is no reason to think it overfit to any of the specific attack because it was not used during training right so i think because it's a

implicit emergence on its own, then by definition it will be much more universal than the active way to get adversarial robustness that people have been using before. So I think this opens a new door to try to look at those results again and see if there is maybe a new way or a new compromise you can get through this implicit emergence.

Amazing. Well, congratulations on this amazing work. And also, my love goes out to Intias. You've both been driving this work for so long now, and I'm so happy that you're finally proving to the world how incredibly important the spline theory is.

Yes, yes, yes. And also the nice thing is that all of those results are not specific to vision or specific architecture as well. So that's one of the beauty of splines is that your network is a spline regardless of the data modality, regardless of the input dimensionality. So whatever insight you get really transfers across a lot of applications.

applications. And so when you derive something new or you have a new paper, you don't solve just one problem, but the whole family of problems. So that's also one of the power of this sort of theoretical understanding is that you can do one proof and it will be useful to many. So it's really efficient.

It might just be worth saying as well that people might think, oh, you know, the spline theory is just talking about MLPs, but I just want to impress upon folks that every neural network is an MLP technically, right? You know, whether it's a transformer, every self-attention layer has an MLP, whether it's a graph convolutional neural network or

whatever, it's all an MLP. Yeah, exactly. If you think about the convolution, you can just think of it as a MLP with a blood-circulant matrix, so just a constrained parameter, but still you have the same, the whole network is interleaved of affine mapping and non-linearity, and this is true for all the current architectures.

And in fact, in this other paper, the polarity paper, where we control quality and diversity of generated sample, we show that you can use the spline formulation and spline results to improve state-of-the-art results on huge architectures. So this is not something that is only for a toy example or small dimensional setting. This is really something that can give you answers and actionable solution for state-of-the-art models across modalities.

Very cool, very cool. Okay, so we're going to move on to your next paper.

So your second paper is "Learning by Reconstruction Produces Uninformative Features for Perception". So this is talking about the difference between a reconstruction like an autoencoder where you're actually reconstructing an image and then you're kind of like doing the mean square error between the original and the reconstructed or the other way of doing it is so-called like contrastive and non-contrastive models where you actually look at the difference in the latent space. Tell us about this paper.

Yeah, exactly. So in this paper, we try to give some answers and some explanation to some phenomenon that were observed empirically. And two of them are if you learn a representation by reconstruction, the representation you get is a good baseline, but it's not state of the art and you need some fine tuning to really bring its quality high for the specific downstream task that you are trying to solve.

And the second observation is that the quality of your representation to solve a task does not align well with how good the reconstructed samples are. And often, even if the reconstructed sample looks good by eye, you still need to keep training for a very long time for the representation to become useful for perception downstream tasks.

So those two observations have been known for a while. And the point is, can we try to explain why? And maybe from that, try to derive better methods later on. And the main takeaway is that because reconstruction method in input space, so in pixel space for images, use mean squared error, most of the gradient information come from the low frequency of the images. And those features are not the ones that are useful for perception tasks.

And there is this nice example where we look at what information is encoded in the low frequency features and in the high frequency features. And you can easily see even you by eyes that the low frequency ones are not enough for me to tell what class is this image from, but the high frequency one is enough. So basically this bias that comes from just the data set distribution and all

its again spectrum is, is something that is copied in the autoencoder but because it's biased, it's not aligned with our downstream task, you get suboptimal representation.

Yeah, I mean, I'm going to show you the figure on the first page now, and it shows this eigenspectrum. And on the right-hand side, it shows the features that the neural network learns first, and they have a bigger mass, so they kind of dominate. And then later on, the higher frequency features are learned. Can you kind of, and there's a couple of examples as well. So the image on the left is the high frequency features, which is very recognizable. The one on the left is the low frequency features, and it's just

Exactly, yeah. So as you said, in this graph where we see the distribution of Hegan value, so you see on the right side, you have the ones corresponding to the low frequency one, which are the ones with the highest Hegan values. That's why you have most of the energy in your image. And you can show, that's what we do in the paper, that this is what is going to dominate the gradient information. And therefore, that's what you're going to naturally learn first.

This is what gives you the biggest reduction in terms of mean squared error. So in a sense, you could say if there is one frequency that gives you the biggest reduction in mean squared error, which is it? The answer is it's the low frequency one. So because of this and because we do gradient descent, obviously, we are going to learn this one first. And it's just natural because we just try to copy the bias that is already present in the data set.

So we learn those ones first. And then if you train long enough and if you have enough capacity in your autoencoder, you will start learning the high frequency details, which are the ones with much lesser amplitude, so much less gradient information. So only then you will learn features that become useful to solve perception tasks, because those are the ones that even by eye you can see, okay, they contain the features that can tell me, okay, this image is this class or this class. Perfect.

Very interesting. Now, you said there's a new method now. People are starting to use the reconstruction, something like an autoencoder, just because it's really easy and people can do it at home. But it does pick up this sort of dataset bias because many datasets do actually have, they're dominated by these low frequency attributes. But it is dataset specific and you can actually add noise

to fix this. Yeah, exactly. So basically it is data set specific because as you saw in this figure, you try to mimic the bias in terms of like again spectrum of your data set. But this again spectrum will be different depending if you have, for example, background or no, if you have different translation in your images, how many classes or objects you have. So this bias or this misalignment between relationships

reconstruction and perception feature is data set specific. And the simpler the data set, like MNIST, SVHN, the more aligned are those two tasks. Because in a sense, every bit of information in the image become useful for both reconstruction and perception. Because there is no background, no nuisance variable, no noise. So if you learn to reconstruct, you learn to recognize, basically. But this is not the case when you move to really realistic images, so higher resolution, colors, background,

many varieties of objects like ImageNet, then that's where the alignment becomes really, really bad. So this is dataset specific. And as you said, so now, for example, people use a lot mask autoencoder, which is a different version of denoising autoencoder with different noise strategy. So in this case, you don't just take as input to the autoencoder the original image

and try to reconstruct it. Instead, you add some noise, some perturbation to the original image, and then you try to reconstruct the original image. So you try to denoise the noise distribution that you are using. So in denoising autoencoder, it was common to use additive isotropic Gaussian noise, but now in mask autoencoder, you actually mask big

parts, big blocks of the image. So you see you have a different type of noise strategy. And what we show in the paper is that you can play with this noise strategy to try to counter the bias that you have in the data set. So in a way, you try to say to the network, OK, I know you try to mimic this bias, but let me make your life harder for this part of the spectrum or those type of features so that instead you just focus on the other side

which probably is better for my downstream task. And so that's why through careful tuning of your noise distribution, you can try to realign the learning by reconstruction and learning for perception features. But still, it's a really, really active process. And in a sense, how can you do this process if you don't have access to labels, a priori, right? So this is a big question. And that's why one of the future research is, how can you automatically design a noise distribution?

Yeah, exactly. Because you were saying earlier that, I mean, for example, something like pink noise, which I think has a log power spectrum, if I understand. So you can actually design the noise spectrum to kind of preferentially focus on, you know, let's say the low frequency features. Yeah, exactly. So for example, if you tell me, okay, for my downstream task, I know what

type of frequencies I should focus on and which ones are useless. Then from this information alone, so you don't really need labels, but still you need this form of like weak supervision, right? But from this information alone, then you can reverse and generate and figure out what is a noise strategy so that when you learn a representation by reconstruction, you will not encode the useless features and only focus on the ones that are useful for your downstream task.

But again, this requires some a priori expert knowledge, and this is not something that is always so easy to do. So here, for example, for perception, we show that, OK, it's actually quite easy if you focus on the more high frequency details instead of the low frequency one, you get better representations. But if you have another downstream task tomorrow, like depth estimation or trying to count the number of trees in an image, it's not really clear if it's just high versus low frequency.

In general, depending on your downstream task, it may be hard to really define what is a noising strategy that makes most sense to you. And also you need to be able to implement this noising strategy, right? Because if it's too complicated and too involved and it slows down your training a lot, then it's also not something you can use in practice. I understand. So the punchline is if you use a reconstruction loss, you're basically inheriting a bunch of dataset biases, which are going to mess you up downstream. Therefore, we should use these kind of contrastive methods.

Some of the audience might need a refresher on what that means. How does that work? Yeah, so in most of those, what I would say, reconstruction-free self-supervised learning method, you observe different views of the original image. So this can be because you apply different data augmentations to it, or you extract adjacent frames in a video, or you have just different viewpoints of the same building. And then what you are trying to do is to get those input images through your network

and compare their representation in the embedding space and have them get the same representation for all those different views. So this is also a form of comparison, but it happened in the embedding space instead of trying to reconstruct the original image and compare the reconstruction to the original input.

And is the intuition there, as we were saying before, that a trained neural network, after it's reached maturity, it tends to focus on the high complexity information. So you get that for free. When you're comparing the latent space, you're actually kind of focusing in on the type of representation that you want.

Yeah, exactly. So because you work in embedding space, then you allow to disregard a lot of the details about the input image that you don't need. So of course, this is a function of how you define the data augmentation and how you do this positive pair sampling. But it's much easier to disregard things because you are not trying to compare

to the original pixel space image in terms of mean squared error. So just because of this, you are able to disregard useless information and you are therefore able to control much more easily what features your network is focusing on. Instead, if you try to do reconstruction, the only way you could say, okay, I don't want to focus, for example, on the leaves,

of the tree is to try to come up with a new loss that becomes invariant to that. But this on its own is a huge research program and maybe there is no easy solution or at least no tractable solution. So that's why working in embedding space is a really, really nice proxy and efficient proxy where you can keep using this mean squared error but in this new space where you can easily disregard information about the input.

Brilliant. Third paper, characterizing large language model geometry helps solve toxicity detection and generation. Give us the elevator pitch. Great. So there are two key components in this paper. So one, which goes back to the spline again, where we look at a single layer of LLM and you can see you can decompose it into two big blocks. You have the multi-head attention and then the following MLP block. So this is true for each of the layers of most of the current LLM are

So if you look at the MLP block alone of each layer, you can interpret it as a spline again, whether it's using roll activation or switch activation, again, it's the same thing under this spline viewpoint. And you can try to understand, okay, can we characterize the region in which a given prompt falls in?

So again, pure geometrical characterization is the region big or small, what geometric characteristics this region has. So from this, we derive seven very simple features that characterize this. And so it's seven features per MLP block. So it grows linearly with the number of layers that you have. But the total number of features is really small. For example, even for a 70B model, you have about 500 features that fully characterize a given input prompt.

And then what we say is, okay, are those features informative at all about the prompt? And what we find is that even if you do some very simple, for example, TSNI visualization, so really like unsupervised dimensionality reduction into 2D, and we say, okay, what is the distribution of those features? You can see that they are clustered, for example, based on the modality of the prompt. Like which dataset this prompt comes from? Is it about

mathematics, about law, medical data. So just based on this, naturally those features are already clustered in terms of prompt modality. So that's one of the things you get naturally. And also, for example, for toxicity detection, you can also see that if there is toxicity in the prompt or not, you get different clusters of those features.

So this is really interesting because it shows that again by characterizing the geometry of your partition and of the region in which your prompt falls in, suddenly you can already have a really strong characterization of what your prompt is about. So and again this can be applied to any pre-trained LLM, does not need to have like expert knowledge or anything and it's really easy to extract them and use that to do like

different downstream tags that you may want to do. You can fit them as input to a linear layer if you want to do something else than toxicity detection. You can just try to train a model to predict whatever quantity you want from those features and you will get a very good baseline.

Yeah, so this is another great example that I think a lot of people now when they do, you know, unsupervised representation learning, they're looking at the space, they're not looking at the geometry, you know, like this partition boundary. So another great example of how the spline theory

really really helps us here. So you've created a whole bunch of features that derive, that describe, statistically describe this geometry. So one of the features might be what's the, you know, the average distance from the boundary or something like that. Yeah, exactly. And these features alone, so you do tisny on them, you see they work well, but you can just build a linear probe, so just a simple linear classifier or random forest or something like that, and these are significantly more informative

than everything else out there. Yeah, exactly. So one thing we compared to was to look at, for example, Hugging phase, the most downloaded models for toxicity detection. And we compare those to this baseline, as you said, where we just extract those features and we train linear head on top to do toxicity detection. And we see that not only we're able to do the prediction with as low latency as them, but also we get a better detection rate.

So this is a really competitive solution. And of course, this is really nice because you can reuse a pre-trained model and you can control how you use those features for your downstream tasks. So here it was toxicity detection, but you can really see you can just fit it to anything that you want.

And also, as you said, the benefit of this is it's very easy to derive those features because suppose you don't know anything about your model. Each layer has representation, which can be like four, ten thousand dimensions. You have this at each layer for each token. So you don't really know how to make sense of that. And you cannot just say, OK, I'm going to extract all the representation and concatenate them together because you have like a million dimensionals

representation of your prompt, which means that if you want to learn a linear probe on it, you will need to play a lot with regularization, feature selection, so it's going to be really cumbersome to you. But here, because of this spline intuition, we know how to derive most informative features that characterize your prompt. Therefore, we get only a few hundred of them, and therefore you can learn a linear probe even if you have just a few samples for your downstream task. Yeah, that's incredible. I mean, I'm looking at your results now, and I think that the most famous

model for toxicity detection. This is on the Omnitoxet dataset. So this model, Martin Hart, has been downloaded 1.2 million times, I think, in the last month. And that had like a rock the area under the curve of around 73.5%. Yours is...

just on the linear probe on Lama 2 7-bit, 99.18% and your latency is like commensurate with the best one. Yeah exactly and the good thing with this method is that because per layer you extract the features you can actually control the trade-off between latency and accuracy by using for example in this case only the first three layers features

So you get a good latency and still a good accuracy, but you could improve the accuracy by using more layers and increasing the latency. Or if you wanted to even further reduce the latency, you just use the first one or first two layers. So maybe you will lose a bit of accuracy, but you will further decrease the latency. So again, because you can control how much features you want to use, you will really have a trade-off as opposed to current solutions, which are

OK, I'm going to retrain a new LLM and treat this toxicity detection as a new task for LLM. And therefore, you need to use it as a sort of black box detector. Yeah, I mean, I think folks at home should be looking into spline statistics. I mean, you know, for example, there are loads of people, you know, Anthropic did

this sparse auto encoder on language model representations for Claude Sonnet and they did the San Francisco bridge thing and they were just using the vector space. Presumably if they use spline features it would be even better. Yeah, exactly. And the nice thing with

this interpretation as well is that you can do much more than just toxicity detection. So as they did in this study, you can try to do this, for example, for data filtering or to try to derive, okay, which prompts should you use or not for training or to compare

models even, right? You could try to create a new LLM that is sort of orthogonal to the current one in terms of those features. So there is a lot of things you can do because also those features are differentiable, which means that you can actually use them during training. So suddenly it opens the door to many things.

And those features, you can compute them very quickly on the fly, which means that you can use them from the get-go at every step as an extra regularizer, as an extra training objective. Yeah, I think that's really important. We were talking earlier about, let's say, building a new regularizer. But as you say, the features are differentiable, and you can use them

in many, many different ways. So, you know, interpretability, robustness, sparseness, like many, many parts of the training dynamics can use these features. Yeah, exactly. And so you can use them as regularizer, but you could even use them to try to derive adversarial attacks. So going back to the first topic, because since you can differentiate through them, you could try to say, OK, can you use that to manipulate the prompt to, for example, make it look more toxic or less toxic? Or you can do a lot of things because you get differentiability. So this is a great, great property to have.

Amazing. Okay, so the second part of this was you were looking at the intrinsic subspace with relation to the prompt. Can you tell us about that? Yeah, exactly. So this first part was about trying to understand the MLP block and see what we can do from this geometric understanding. The other part of LLM layer is this multi-head attention, right? So here we try to understand, okay, how is this characterizing geometrically the given input?

And so we derive this nice proxy scalar, which is the intrinsic dimension of the space in which the prompt lives in. And in short, you can derive this as a function of the sparsity of the attention that you get. And from this, we can really easily see that

current training prompts have some intrinsic dimension distribution and therefore we can try to create new prompts with increased or decreased intrinsic dimension. And so of course, a priori you might think, okay, this is interesting, but what can you use this for? And one application we found in the paper is that if you actually increase artificially the intrinsic dimension to look like

a point which was far from the training data. Maybe that's a space where your LLM was not RLHFed, so you can bypass the RLHF mechanism and make your LLM generate toxic answers. And in fact, this is really natural because the way you prevent toxic

generation is just during training so you try to say okay don't say that here don't say that here but do you extrapolate that was the open question and here what we show is that once you explore a new part of the space that was not used during training through this manipulation of the intrinsic dimension you can make your RLHF LLM generate toxic answers also if you use like

normal original prompts, it will say, okay, I cannot say this because it's not something I'm allowed to say. Yeah, this is so interesting. So there's a real theme here that we've been talking about that there is a kind of spectrum of complexity in neural network training dynamics and representations.

And you've got this figure here where you show there's a commensurate relationship between the context length and the intrinsic dimension. And I think a better way of kind of explaining that is the complexity of the representations. Now, RLHF has a kind of complexity dimension.

limit at the moment which means it's only capable of addressing low complexity representations or things with a small intrinsic subspace so when you have a really long context length this is related to the interview I did with the University of Toronto students the other the other week they've got this self-attention controllability theorem and they basically say that as the context length increases the controllability increases which means you can make a language model say anything and that's to do with like you're moving you're like increasing the the intrinsic

complexity of the representation. Yeah, exactly. Because if you think about it, the higher dimensional space you are working on, the harder it is to control what your LLM is going to do or not, because it means the number of samples you will need to really control it grows exponentially, right? Unless you have a smart parametrization on how you do this

so that you can learn to extrapolate from a few samples, but that's not the current way to do this fine-tuning. So as you increase the context length and also as you increase the related concept between tokens so that there is really a non-sparse self-attention mask, then you go to a part of the space that was never seen before and that is really, really high dimensional. And so it's less likely that people who created the LLM were able to control what's happening there and there because you are just in this actual space.

very high dimensional space where no one can guarantee anything at this point. Yeah, now this is a real problem for kind of steerability, alignment, interpretability and so on. I was speaking with Nora Belrose the other day and she was talking about concept scrubbing, you know, so you can scrub concepts out of neural networks and it works really well early in training but as the network complexifies then it kind of adapts and it counteracts

any concept scrubbing that you do. So we have a real problem that as neural networks complexify, we can no longer control what is going on. - Yeah, exactly. And also it's really hard to find a solution to it by ourself acting on the data because based on our vision and the way we learn, there is something where we can think, okay, we remove this information from the data and we think that's all we need to do. But actually, because there are so many, right, we're in high dimensional spaces, so many things happen between different input dimensions,

Maybe there is still this concept embedded in other parts of the data and the network is going to be able to find it if it's a nice short circuit solution for it. So basically, that's where we have to be really careful about our intuition through visual inspection and like two-dimensional reasoning may not scale when you go to really high dimensional spaces because a lot of other things are going on and the network will pick up on those. So that's where there is a huge need for

provable solutions because you need to understand what the network is trying to learn or to control it geometrically to prevent from learning those shortcut solution and also to have guarantees that okay now you reach this stage where you have maybe like a safe model or anything like this but it cannot be

only by acting on the dataset or empirical only, because as we showed here, you can always find a way to go in a new part of the space where nothing was seen before, just because the spaces are gigantic. So you need to have a better parametrization or better control of your network if you want to really have a probable guarantees for like RLHF extrapolation in this case.

Yeah, yeah. It's so interesting. I mean, and just to bring it home to folks as well, I mean, the example of this jailbreak was I think you just go like 8, 8, 8, 8, 8, 8. You just have this like huge context length. And then all of a sudden you've rendered the prompt impervious of RLHF because RLHF has a complexity ceiling, right? So like at some point you complexify the prompt. You're now outside the context.

the controllable kind of space of RLHF. So this is a big problem. What can we do to RLHF to fix this? Yeah, exactly. And one thing to add is that this jailbreaking is not specific to one LLM architecture or one LLM setting. So in the paper, we showed some things with LAMA, LAMA 2, but for example, you experimented with ChatGPT and you had the same thing. So it seems to be not specific to the architecture or the way RLHF is implemented.

And so there is really a fundamental problem which is how to control the behavior of a deep network in a really high dimensional space where you cannot visit all the places in this space. So one way to look at it could be can we find better parametrization of the network so that only by learning from a few examples we can probably generalize or extrapolate in

many other parts of the space. And this is about finding the right parametrization of the model or the right way to do RLHF in this case. But this is a huge problem which relates to like extrapolation in general and just what to do with really high dimensional data and how to control the behavior of your model everywhere in the space, only from a few training samples. One way to attack the RLHF is to increase the context length.

But one way to increase the success of the attack is actually to not just add the extra context with random tokens, but to add related concepts. So because of the related concepts in the added token, then the sparsity of the attention will be less and therefore you have much higher chance to jailbreak the RLHF mechanism. So also those attacks are not

easily detectable. So for example, if you just prepend A, A, A, A, A many times, maybe it's easy to safeguard the model against that. But if you just add natural English sentences with related concepts to the toxic prompt you are giving, it's really hard to detect, right? Because it's like common English, normal sentences, but still this is able to jailbreak the RHF and even at a better rate than just with random

So there is also a lot of work to do in terms of, OK, what is the relation between adding context that is related to the current prompt or just adding random context? And all those things interact together. And this is also a big future research question. Very interesting. So you've been working with Yann LeCun very, very closely for many years at Meta. You're now at Brown.

What is your research plan for the next year? Yes, so my research plan is to try to increase the amount of provable guarantees we have in current learning solutions, whether it's working with text or computer vision or like multimodal data set. We need to dive deeper into all the things that are happening behind the curtains, like training dynamics, fairness,

or biases that you learn from the data and rethink the basic things that we've been doing forever, like regularization or just like training so that we can control for those things and give probable answers to users or practitioners. And there are a lot of things to do because nowadays we don't question anything, right? And it turns out that we need to re-question most of those methods that we're using if we want to make progress in this area. And of course, this can take the shape of using splines,

but there are also many other tools we can use for this. But the goal is that whenever your method is not working, you need to be able to give a precise answer and not just say, okay, try another hyperparameter and come back to me in two days. So everything that we can do to have useful theory, provable guarantees, and that is tractable to be applied to industry scale problem. This is what we're going to work on over the next few years. Very cool.

Very cool. And we were briefly talking earlier about the DL theory book with Sho Yida and Dan Roberts. Do you have any sort of broad views on other theories of deep learning? Yes, there are a lot and many of them have really practical insights. For example, you have this paper from Greg Young where they show from a theoretical characterization of training dynamics and what happened in the current network,

you can do cross-validation on a small network and then given the hyperparameter that you find, you know you have a rule to extrapolate it so that it's also the best hyperparameter when you use a bigger model. So you have a lot of really practical things like this that come from theoretical studies of deep networks

I think right now what we are missing is something that is easily accessible for everyone. And a lot of current theoretical studies of deep networks require a huge amount of mathematical background and it's really not easy to access for people that did not do like a PhD in math or at least a bachelor and master's in math. And that's why I really like the

because you can actually make progress on it even just through visual inspection of the model. So that's why I always try to not just do theory for theory, but so that anyone who reads the paper can learn something from it and train better models tomorrow. So that's my goal.

what I try to focus on and make it very accessible to people. But there are a lot of different theoretical viewpoints of deep learning and each of them have been trying to come up with new solutions. But now I think we need to try to assemble everything into one that is independent of the modality, independent of the architecture and accessible to everyone as well.

Amazing. And are you going to publish a paper with Ellie Pavlik in the next year? That's the hope. That's the hope. Yes, yes, yes. I'm a huge fan of Ellie. I had her on the show. I think she's working a lot on negation at the moment in LLMs. Yes. Yeah, yeah, yeah. And it's very interesting to talk to her as well because the way she thinks about language

and how we learn and how LLM learns and what we can try to learn from that is very interesting and really complementary to try to use Siri and Splines to explain LLMs. So that's why I think there will be some really fruitful collaborations in the next year for sure.

I mean, in a way, that's what's good about ICML, but it's what fascinated me about the papers that we talked about today, because you've got this huge background, you know, doing self-supervised learning and vision models and so on. And, you know, the theme of what we were talking about is this kind of spectrum of complexity in the representations and the training dynamics. And it's so interesting how you transferred that into RLHF.

So it seems like a slightly different domain, but science is all about just reusing knowledge from different domains and cross-pollination, right? Yeah, exactly. And I think the intuition you get in one modality easily transfers to other modalities as long as you don't overfeed too much to a really specific architecture, a really specific viewpoint. And that's the beauty about splines.

splines happen regardless of modality, whatever insights you get about a spline partition is going to transfer the same across data set, whether it's images or text, etc. And that's how in this paper, which is like a very good example, as you said, so you had the Sarat and Romain who came from the

LLM expertise side and I came from the Spline perspective and together it was very easy to come up with this solution because everything is transferable once you reach this intuition. So that's why people should not be scared to

explore new dimensions, new data modalities or even new architectures. And that's actually how you get the best insights that become complementary at the end. Amazing. Now, if people want to learn more about the spline theory, where would you point them? That's a good question. So I think the best is probably to look at the few last papers that we've been doing with Rich, Imtiaz,

that are about splines. And there are a few different ones that are focusing for example on generative model, using spline to do uncertainty quantification or here for LLMs and try to find the one that is the closest to their current expertise so that there is not too many things to learn at once. So only try to learn the spline partition and so on. And then to reach out to us, obviously,

feel free to send us emails and send us messages on Twitter and all, because splines can be really cryptic if you look at papers from the 80s and 90s, because the way they were thinking about it was very, very different. So don't try to look at a 30-year-old paper with spline approximation and approximation rates, because it will confuse you more than anything. First look at figures, look at current papers,

and then reach out if you have any questions. Cool. And as a bonus question, so recently, Kolmogorov-Arnold networks came into the limelight and they are also a kind of spline approach, aren't they? Yes. Yeah, exactly. So basically, they try to present an alternative to current MLPs where they add hardcodes, some spline activation function within the architecture. And that's a really great example on how you can use a priori knowledge to really define an architecture that is going to work very well for some specific

problems. So I think most of the problems they were looking at were small scale and low dimensional. And that's where you need really expert knowledge to design what the partition should look like and what type of spline to use to really get the best from small training set.

And splines are so far a really nice way to do this because again, the geometry is visually interpretable. You can see what you need. And if you hard code most of those things, what it means implicitly is that you need less training time, less training samples to actually learn something meaningful. So that's also another area where splines are useful is if you are really an expert

on your domain, your data domain and your downstream task. You can then try to transfer this in terms of geometric properties and Spline gives you a really nice way to do this and to actually put this in practice and create new architectures and get new models. - Rando Bellistrio, it's been an absolute honor and a pleasure to have you back on. Thank you so much. - Likewise, likewise. Very much, thanks for the invitation.

Want to Understand Neural Networks? Think Elastic Origami! - Prof. Randall Balestriero 01:18:10 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

Want to Understand Neural Networks? Think Elastic Origami! - Prof. Randall Balestriero