We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Prof. Randall Balestriero - LLMs without pretraining and SSL

2025/4/23

Machine Learning Street Talk (MLST)

AI Deep Dive AI Chapters Transcript

People

Randall Balestriero

Topics

我进行了一项实验，发现即使从随机初始化开始，大型语言模型也能在小型数据集上很好地学习特定任务，训练稳定且不易过拟合，有时性能可与昂贵的预训练模型相媲美。这引发了我们对预训练成本效益的思考，至少在某些应用中，预训练的优势并不明显。我们还研究了自监督学习和监督学习之间的关系，发现它们在学习表示方面具有理论上的等价性，这使得我们可以将监督学习的理论和经验应用于自监督学习。通过这种联系，我们可以设计出新的自监督学习模型，以处理现实世界数据分布的不平衡性，例如在ImageNet等数据集上表现良好，但在iNaturalist等重尾分布的数据集上表现不佳。此外，我们还研究了地球数据模型中的公平性问题，发现这些模型可能存在偏差，在某些特定位置（如岛屿或沿海地区）的预测精度较差，这可能会对基于这些模型的政策决策产生不利影响。我们发现，这种偏差可能部分源于模型架构和编码位置的方式，例如使用傅里叶基函数进行建模会引入偏差，而使用小波基函数可以提高模型的局部化能力，从而减少偏差。

Deep Dive

Shownotes Transcript

Translations:

中文

We just launched this experiment and then we were very surprised to see that the hugely overparameterized model not only trains out of the box, like you have very nice training curves, but also they don't overfit aggressively at all. And what we found empirically is that we just out of the box use typical supervised training. We don't have to play with hyperparameter optimizer and you have very, very stable training.

So this brings us to the question, is it worth it to spend so much money to gather a gigantic pre-training data set, spend like months and many GPUs to produce those models? But at least for some applications, it seems to not be much better than random.

MLST is sponsored by 2forAI Labs. Now they are the deep seek based in Switzerland. They have an amazing team. You've seen many of the folks on the team. They acquired Minds AI, of course. They did a lot of great work on ARC. They're now working on O1 style models and reasoning and thinking and test time computation. The reason you want to work for them is you get loads of autonomy, you get visibility, you can publish your research. And also they are hiring, as well as ML engineers, they're hiring a chief scientist.

They really, really want to find the best possible person for this role and they're prepared to pay top dollar as a joining bonus. So if you're interested in working for them as an MO engineer or their chief scientist, get in touch with Benjamin Krusea, go to twoforlabs.ai and see what happens.

Originally, the main motivation was to see how much information you gain by doing pre-training. And is this next-local prediction really making your network learn something about language and reasoning? And so then we are saying, one way to compare this, at least empirically, is to just take a randomly initialized model, train it from scratch on a supervised task like sentiment prediction, sentiment analysis, and then in theory,

Because we have a very, very small training data, let's say like 20,000 samples, and because those models have like 7 billion parameters, the pre-trained one will perform very nicely with a little bit of LoRa fine-tuning because it already knows what to reason about the world, right? So maybe you just adjust a little bit to the specific task

that you want, but since you have so many prior knowledge, you will solve the task very easily. But the random one either will overfeed completely because you have like seven billion parameters and only 20,000 training samples, or maybe it will not learn at all because

training dynamics will be completely chaotic. And so we just launched this experiment and then we were very surprised to see that the 7 billion or like the hugely overparameterized model not only train out of the box, like you have very nice training curves, almost like you train MNIST, but also they don't overfit aggressively at all. They overfit less than if you just train MLP on MNIST basically.

This is very surprising. Basically, from this, we said, "Okay, actually, maybe there is a deeper question, which could be how much implicit bias you have in those language model." Because already we knew from computer vision that, for example, ImageNet, you can have a 50 million model on the one million data set. We have this 50 to one ratio. You have the implicit bias that prevents you from overfitting and just serving the task.

But still, it's 50 to 1. So this may sound a lot for statisticians. But now it's like 7 billion to 20,000. So the ratio is gigantic, right? And to me, it was very surprising that the scale, like the size of this ratio, still

allows you to learn something that does not overfit. This is very surprising because in vision, for example, transformers are known to overfit more easily than ResNet. They seem at least in vision to have actually less implicit bias or implicit regularization, but at least with this type of next token causal

architecture, LLM, you don't seem to overfit easily to your data. So this was quite surprising. Yeah, and we should bring in the name. So this was your workshop paper at the Self-Supervised Learning Workshop here at NeurIPS. And it's called "For Perception Tasks, is LLM Pre-Training by Next Token Prediction worth the cost?"

so this is absolutely fascinating right so so we've been given this this belief that we need to have these huge pre-trained models they're trained on all the data on on the internet and it turns out that certainly for discrimination tasks so things like classification rather than generation actually you can just start from scratch with a with a with a fairly small model and you get sometimes even better results yeah and uh

even small or even large model. Like you just start from scratch, you do this very simple supervised classification task, right? Okay, given this prompt, is it good or bad sentiment? Or what type of job is the prompt describing? You know, this type of, I will not call it reasoning, but yeah, more semantic classification. And turns out that, yeah, you start from random.

Even if you have a small training data set, you will have performances that are sometimes as good as a pre-trained model. So this brings us to the question, is it worth it to spend so much money to gather a gigantic pre-training data set, spend like months on many GPUs to produce those models?

And for some cases, for generation, all right, there is no question this is what you need to do. You have your next token prediction, you learn how to generate samples. But at least for some application, it seems to not be much better than random. So it's quite interesting. So what are the differences in the learned representations? So that's something we did not really look at, like low dimensional representation of what you learn.

It's possible, so some work try to look at the attention entropy and the mechanistic interpretability viewpoint of LLMs. So it will be interesting to see if you have this sort of

You know, neural collapse things that happen. So even if you're like a seven billion parameter, maybe you end up learning a very, very, very simple sub network that does the task, a bit like lottery ticket hypothesis as well. And that naturally emerged from the training dynamics. Or is it really exploiting all the parameters? I think that's one thing. So to extend the workshop paper to conference, we want to probe into more like what are the useful parameters? What did they learn?

Are each layer actually learning something or maybe the first layers don't really learn anything, just the last few ones are learning something. So yes, there is lots of open questions for this. What does it tell us about the nature of understanding and maybe even intelligence? Because we think that the reason why these things understand is because they just have all of these representations to all of these different things in their experience.

and now we can shortcut to one of the better word. What does that tell us? - Yeah, I think that's a good question. So in this case, we must look at very specific classification tasks. So for example, you have a description of a job, what job it is, is it like a good or bad sentiment?

And this, you are able to solve it good, but you are not able to go out of distribution to solve a new type of question. For example, for this job description, then you cannot answer, OK, is this job paying you more than this job? Because this was not present in the training data, right? So I think you get very good models cheaply, quickly from random initialization, but they will be very specialized. And I think the benefit of having maybe the pre-training may come if you want to do more of like open-ended

classification or reasoning. So I think it really depends on the type of application you want to solve, what's your downstream task and how much you want to generalize to new scenarios. But at least now it shows that it's not just pre-training with next-decan prediction is better for everything.

So, I mean, going back five years, data scientists used to build specific classification models for doing everything. And now we're in this regime of we need these really big models and we do in-context learning and maybe even some fine tuning and we get them to do fairly specific discriminative tasks. But now you're saying we should almost go back to where we were five years ago and start building specialized models again. Only now, rather than building...

classification models, we're still using the transformers and the LLMs but we're making them do specific tasks. Yeah, exactly. I think if you only want to solve a few specific tasks, use this prior knowledge to have a nice architecture, supervised data set for that and just do that from scratch. This is something that's gonna probably work much better. But again, you need to make sure that the downstream application will never go

too out of distribution. So that's why it really depends on the application and the type of use cases that you have. But I think at least here it shows that there exists some tasks where next-token prediction is not the answer. And in fact, it's not just not the answer, but it's not better than random initialization, which is really sort of the worst case scenario.

Interesting. I mean, from a fairness and bias point of view, a lot of people say that, you know, large language models are bad in a way because there's the dominance of North American cultures and so on. But you could also argue the converse, which is that the good thing about them is that they do have some awareness of value, you know, so we can fine tune them to have guardrails and to sort of say the right thing and so on. Is that harder to do with this approach? Yes. So here, because you are in a fully supervised setting,

you don't have as much flexibility to change the behavior of your model or it will have to take the form of supervised fine-tuning. But because you don't have a generative capability, it's certainly restrict the type of interaction you have with the model and how you can improve it.

because the output is just, okay, is it a good or bad sentiment? It's not something that gives you a full answer that then you can try to argue against and generate a fine-tuning data set from. It's just, okay, good, bad, and that's it. Another thing is training strategies. So, you know, like the big players building these LLMs, they have lots of internalized knowledge around...

Even the order in which you train the language models, everything is important. Certainly in the old days of basic models, you just stick a load of data in there, no one really cares. So now do people need to be thinking about the specialized knowledge, maybe thinking about curriculum learning and all of this kind of stuff? Yeah, so this is a good point. So we did a paper recently called the Fair Language Model Paradox, where we show that when you do this next token prediction, because you have some tokens that are very low frequency,

it's very hard to train on them and it takes a very long training. So it's very wasteful, right? And the problem is that because you do this next token prediction, you need to really capture all your distribution of tokens and so you spend a lot of time. But in this case, if the low frequency tokens are not useful to solve your task, you actually don't need to capture it at all. So in terms of training dynamics, this is actually a much simpler problem in many cases. And what we found empirically is that we just

out of the box, use typical like supervised training. We don't have to play with hyperparameter optimizer and you have very, very stable training. So that's one thing that could be also interesting for future work is to see is this something that is easier to optimize and maybe that's why those like 7 billion parameter model can learn and not overfit on like 10,000 samples. And then it's also bring other things that maybe this on its own could be a better initialization

for next token prediction as well. So this is very open up in the air, but maybe you could think of a simpler supervised objective that would be better pre-training

solutions that then you can use for like next token prediction if you wanted to. But at least this would be a better starting point from random. So you almost reverse the trend. So we've spoken about two extremes. So on the one extreme, we have pre-training and you can use it for any downstream task. And on the other extreme, we have, you know, you start from scratch just with one task.

is there an intermediate solution? So what if I did this new approach but for multi-task, let's say for five tasks? Yeah, so that's a great question. So if you really think about it in the limit, you could formulate a next token prediction as a multi-task where you want to each task is predicting is the next token this one or not. So in the extreme case, you could just recover a next token prediction on one end,

And on the other hand, you have what we have here. So just one task, very coarse, high level, predict if it's a good or bad sentiment or whatever. So in between, you have a huge spectrum that you can exploit. And if you can find, as you said, maybe five very different representative tasks, this should be enough to or could be enough to learn the representation that is as general as possible. And then you can use this for maybe new tasks that come

So I think the research question is how to design the minimum amount of tasks so that you have as diverse representation as possible. And of course, we don't want to go to the extreme of just doing, again, next token prediction.

But this is a very, very nice research question because if you have this spectrum and you can control where you want to be, then you can really have a pair use case choice. So it's not, okay, you're always here or always here. Tell me what you want to do, how much new tasks you expect your model to be exposed to, and I tell you where you need to be in this spectrum. So this could be very interesting as well. Very cool. It does make me think, though, that the

these models understand through naive statistical alignment? And is it possible that the benchmarks we use just don't cap, you know, the gap of understanding that we've lost from moving from the pre-trained models isn't being captured? Yeah, I think because especially in the recent years, we focus a lot on generative decoder-only methods. All the evaluation and the type of objectives we put on ourselves is really about good generation, right?

Even if you want to answer a question, you need to generate a good explanation. You need to understand what are the intermediate steps. And I think the fact that we focus on generative models means that we completely bias the evaluation and the way we approach this thing. And maybe you could have still knowledge that is learned without being able to generate anything. So I think this is also something that could be interesting to look at, or at least keep in mind

when we explore those models. But philosophically though, isn't generation analogous to thinking in some sense? So don't models that generate, aren't they smarter in some deep way? Probably what you want to do is maybe imagine what could be, but I don't think you want to do generation with very granular details like next token generation. Because if you think about it, even just in terms of classification tasks,

you have a lot of different uncertainty depending on the token. If I start the sentence, okay, I saw this movie for minutes, there is no way you can tell what was the next token for after four, right? So this means that you know, a priori, it will be like a time,

Maybe it's like one hour, 10 minutes, two hours. But do we need to be able to generate the 52 minutes or whatever the answer was to actually understand that I was seeing a movie, therefore I was staying in a place for at least more than five seconds, right? So I think token is way too granular.

And if you had maybe like concept tokens, that's where you could start seeing, okay, this is meaningful because that's closer to maybe what we do. But right now we are very, very, very low level because tokenization is lossless compression, right? So this is too close to the raw data. And yet we have the life easy compared to computer vision because already you work in language, which is very compressed representation of knowledge, but still token is probably too low level still.

Well, that was a fascinating paper. Let's move on to your next one. So the birth of self-supervised learning and supervised theory, and that was with Yann LeCun. And basically you said that the observed differences between self-supervised learning and supervised learning are not due to the loss function themselves, but rather the labeling of the dataset used in training. Give us the elevator pitch. Yeah. So basically what we show in this paper is that you can have a supervised objective, like let's say least squares to make it simple.

So you have the inputs, you have your network's prediction, and you have the labels. And you can turn this objective, which tries to predict sample Xn to prediction Yn, into a self-supervised learning objective, which tries to compare samples with each other. So basically, you go from saying, OK, this image is a car or a dog, to saying, are those two images the same or not, which is the self-supervised type of joint embedding world.

And so you can show that if you have labels or you have knowledge of this pairwise relationship, they are actually learning the same representation up to some symmetry that is irrelevant if you do linear probing. So the loss function in itself, the SSL one or the supervised one, try to do the same thing. They just operate on a different view of the labeling.

whether this image is that or are those two images or two samples representing the same thing. So given that, then the next question is how come self-supervised learning is able to generalize better than supervised? And from this perspective, what you can say is that it's because it's as if they were solving a supervised task where the labels are not about predicting all the cars to cars, but are very, very, very fine-grained labels where in the limit, each image is its own class, basically.

So if you think about supervised learning in this extreme setting, you also don't overfit to the task because you don't collapse any image to another one. And so theoretically speaking, you can solve many downstream tasks as you want. So this equivalence of loss at least brings a slight new perspective on the fact that it's not really about the objective. It's more about how you design the SSL pipeline. You say, okay, this sample is related to this sample, but it's not the objective that

makes you learn a better representation. Okay, and in the paper you were talking about how SSL can maximize the worst-case downstream task performance. Can you sketch that? Yeah, so basically if you think about all the possible realizations of downstream tasks, you could have some very coarse-scale ones. We have maybe different pictures of cars and buses and you just want to say, "Okay, it's a car or a bus," so no details need to be encoded to solve this. But then you can have downstream tasks where you want to say, "Okay, which brand of car is it or which color of car is it?" So you have a distribution of downstream tasks, right?

And so the point now is that you want to learn the representation so that if you look at the distribution of downstream task performance, you are able to be as good as possible on most of them. So you don't want to be very good on some, and then in the tail, you are very bad

on the majority of them. And so then from this, you can try to say, OK, what would be the labeling that tries to make your worst case as good as possible? And from this, you can say, OK, this is actually the labeling that self-supervised actually implicitly

doing. How does the class balance affect the difference in the losses? Oh yeah, so this is a very good point actually. In a follow-up paper we are doing right now, we show that current SSL objectives assume class balanceness. And this is something we already highlighted quickly in this SSL Supervising as a Uniform Cluster Prior paper we did a couple years ago. And we show that current SSL objectives assume balanced representation of classes or concepts

And this means that if you train on ImageNet, things work out very well because concepts are sort of equally represented. But then if you go to other data sets like iNaturalist, which are very heavy-tailed, then you have a huge bias in your representation. So until now, people did not really know how to solve this. And so one way...

people approach this is through data curation. And they say, OK, I'm just going to remove the oversampled concepts to try to make it more uniform. And then I do self-supervised learning on this. But because now we have this theoretical formulation and this equivalence of losses, we can use the exact same settings that people used in supervised learning to re-weight, depending on the frequency of classes,

we can use that to come up with a new self-supervised learning loss that takes this imbalance into account. So this type of thing is enabled from this mathematical formulation and its principle. So the way we do this weighting, you can prove that it's the right way to do it from this supervised theory.

And so this is really nice because suddenly from this seemingly naive connection, you can now come up with new generation of self-supervised learning models where you can actually match what the real world data distribution is like.

non-uniform distribution of classes. Maybe even if you have some samples that are more noisy than others, you can include that information as part of the SSL objective as well. So suddenly you have a whole new world of possibilities that comes and because there is this connection, you can actually prove, okay, this is the right way to do it, at least from this supervised theory viewpoint. You also pointed out a connection to vCreg.

Exactly. So basically, what we do in the paper is that we show if you have a least square supervised type of objective and you turn it into a SSL1, what you obtain is basically V-Creg. So then you have a few variations. It could be V-Creg or WMSE, depending on how you do this

from supervised to SSL, but you can show that depending on the type of supervised loss, you recover different types of SSL ones. If you look maybe more at cross entropy, supervised learning is going to be more like a SIEMCLR type of loss, but you have this one-to-one correspondence. And this is also very nice because in supervised learning, at least you know when one loss may be preferred compared to another one. And this has been studied for a long time, right? Because supervised learning has been around forever. And so now we can reuse those insights

for self-supervised learning. So this, to me, is also a very, very strong benefit of this thing is that suddenly all the theory and like the thousands of papers that have been done in supervised learning, we can just take it and apply it in SSL. Another example is a neural collapse, for example, that has been proven in supervised setting. Now it applies

like in five lines in a SSL setting as well. So this connection is really beyond just trying to say, okay, it's not the objectives that make SSL better. It's really tying those two

huge communities together towards the goal where you have a single unified objective to learn representation. And this is nice too because if you speak to people they will think, okay, you have supervised learning on one side, SSL on the other side, and basically you are either in one camp or the other. But now what we show is that you actually, SSL is

pretty much everything in representation learning and supervise is just one realization of SSL. Then V-CREG without labels is another one. Then this one is another one. So you really have a better understanding of this relationship and what representation learning is trying to do.

Galaxy brain question incoming. Could you combine SSL and supervised objectives in some way to improve generalization? Yes, yes. So there is one paper which is supervised contrastive learning. So the way they do it is that they use the labels within a SIEMCLR framework to try to basically do fully supervised learning but with a SIEMCLR objective.

So first of all, we can show that indeed this makes sense and that basically we can explain the empirical result that they had. But actually, we can do a little bit more than that. So if you are in a semi-supervised setting, for example, it may not be clear how to combine those two losses anymore. Or maybe you could say, OK, I have the two and I have a coefficient to weight them. But then you need to do cross-validation and so on. But now from this perspective,

You can combine them in a very principled way and you can understand which weighting makes sense depending on how much sample you have in one or the other. And you can use all the literature again from supervised learning for this setting as well. So this is something you can do very easily with this formulation as well. Okay, so if SSL and Supervisor are two sides of the same coin, I mean, of course, we can use this theoretical framework to design new forms of SSL framework, but

Is the distinction relevant if they are the same thing? I think it's not just two sides of the same coin. SSL is more general than supervised learning. So it's really SSL could be the more general objective to learn representation. The more prior knowledge you have, the more you know about your downstream tasks, the more you know about your labels, and then SSL

slowly becomes supervised learning through the labels that you use for the SSL objective. But then because, as you said, you have this hierarchy now, it does not really make sense to say you have either supervised learning or SSL. Rather, what makes sense is to say, okay, what's this relation matrix? What's this pairwise matrix? If you build it from labels, it's supervised learning. If you build it from other a priori knowledge, for example, two consecutive frames in a video,

basically have the same class, then you are more in an unsupervised SSL setting. But it's all about how do you build this pairwise relation matrix? That's the main question. Very cool. Right, let's move on to the next paper. So, "No Location Left Behind: Measuring and Improving the Fairness of Implicit Representations for Earth Data." So, there's loads and loads of modeling frameworks now that do these implicit neural representations of geospatial Earth data. So, things like climate modeling, resource allocation, environmental modeling.

I was actually interviewing Johannes from NXAI yesterday. I don't know if you know him, but he's working on similar stuff. The problem is you've studied this and you found that there's loads of biases and fairness problems. Yeah, exactly. So basically what we show is that when you want to model, for example, let's say temperature or precipitation to make it simple, and you want to learn, for example, implicit neural representation, it means that you want a model so that if you give a location and a date, for example, it can predict what was the temperature there.

So if you have this type of implicit neural representation, it's very good because if you learn a nice model, then you can actually interpolate those values. So maybe estimate what the temperature was in this part of the globe where you did not have a sensor. But you can also do extrapolation as well. If you assume you really learned the true physical model of the world, you could start saying, OK, what the temperature will be two years from now.

So this is very nice to have this type of model for all sorts of applications. The thing is that when you do this nowadays, depending on the architecture and the different design choices that you do, you will maybe have very good prediction on average. So when you look at the average performance around the world globe. But actually, if you look, for example, around islands or coastal area, your prediction is going to be very bad, almost random.

So this is something that can be very concerning because if you use this type of model to decide about a policy that will affect a specific island, using this model prediction is as good as using random guesses. So it can be very detrimental and people need to be aware of those biases. So what we found is that, for example, for this type of climate data, islands are often disregarded, coastal area, basically region where you have a big gradient in the type of climate

data that you try to model. How much of a responsibility do modelers have to detect these kinds of biases in the data? I think there are two components, as you said. One could be that just the dynamic of the data you are trying to model is harder

near island or maybe it's even unpredictable because you don't have enough observations to do that. So you have some uncertainties that probably you can never recover from good design. But still what we found here is that a lot of the biases now

comes from the architecture and all you want to do to encode those positions, the type of basis you use to do the prediction. So right now it seems that a big chunk of the bias comes from the architecture. But I totally agree that I don't think we can remove the bias entirely because there is maybe just different type of uncertainty at different parts of the planet as well.

I mean, the world is a very, very complicated place. I mean, realistically, to what extent can we mathematically model it? Yeah, so that's a good question. So I think it depends on the type of horizon that you have and the type of data that you want to model. If you have a system that is much more chaotic or can vary very quickly without much changes in the past observations, that's something that current models are having a very hard time with. If you want to predict something else, for example, temperature,

in North America, not near the coastal area, so really inland. Maybe that's why we have less gradient dynamics, things are a bit more stationary spatially and through time, so then it can become much better. But I think at this point, we don't have an architecture that is really able to understand that you have different physics, different dynamics models at different parts of the globe.

And so because of this, you just see what's the best on average and it means you miss out a lot of details. Can you tell us about some of the technical framework? So one thing we showed, for example, at least for this type of globe data representation, is that people use Fourier basis to model the prediction. And this is something that is better than not using any basis at all. But what it means is that you imply the type of signal you're predicting is very stationary and not localized at all.

And this is a very strong prior, right? So this may be true for some things, but for other things like precipitation or temperature where you have localized very high gradients, then it's a strong bias. And if you come from signal processing community, you know very well that to have better localization, you go from Fourier to wavelets. And so that's one thing we did in this paper. And we showed that using wavelet basis to encode those

data allows you to have better localization and this removes some of the biases. And here it's more of a proof of concept that different design choices give you different type of bias trade-off. We've let it not answer to everything, right? But I think the next step is to really be able to encode less and less, a priori, which basis to use and let the model learn from the data on its own.

And we are not yet at this point, at least for this type of climate data. How could it handle noisy or missing data? This depends really on the type of model you use. So for example, if you have INR, then you will not use the missing data as part of your training pipeline. And that's one of the benefits of them. So if one of your sensors stopped recording during some years, you just don't use that as part of your training data because you really control where do you have the data and when you have it, what the prediction should be.

So these Earth models, they are now informing policy around the world. Who should we hold accountable? Is it the technology? Is it the scientists who design the models? Is it the policymakers who interpret the results? I think it's very hard for the person who designs the model to know a priori what it's going to be used for.

So I think it's more downstream when you know clearly what you want to do with it. You should first set up a nice evaluation pipeline to make sure that it's something you can actually use to make those decisions. And then you can report any type of failure modes you observe for people to improve on the design. But a priori, it's very hard to imagine what this model will be used for. So in the ideal setting, you wish that there would be no bias at all, but in practice,

the world of possibilities being so large, it needs to be more of a feedback loop and then iterate until you have something that you can really trust and then you can act on it. Earth modeling data is very anthropocentric, right? So, you know, we focus on human populations and so on. Should we also focus on, you know, like just ecosystems and places that have got nothing to do with humans? Oh, yeah, that's a great question. And in fact, that's one of the big issues with a lot of the data science

set, which is a crowdsourced set, because by definition, the amount of data that you get is proportional to the number of users you have depending on the location. This means you have a huge bias in what your model is learning and what your model is focusing on, which means you miss out on a lot of things. I think that's also one thing that, okay, crowdsourcing can give you a lot of data quickly, but it's very biased data.

So then the question is, how much of this biased data versus maybe paying a lot more and capturing other parts of the globe, how much of the two you should have? And maybe you could be able to show that under some specific condition, just having 10% of the data, which is high quality, uniformly sampled, and then 90%, which is crowdsourced. You can try to use those 10% to anchor your representation and then use all that data together. But there is a huge amount of research question in that

because that's a very big source of bias. And this is a bit of a policy question, but we are using these things to do resource allocation, right? So giving more resources to some populations might be taking it away from others. And then there's the fairness over time thing as well, which is that what is fair now might not be fair in 100 years' time. So how should we think about it? Yeah, that's a good question. I think this is also very...

application-specific. So, for example, if you want to predict where to build a house to solve some specific problem, maybe you don't really mind having bad prediction where there is no population anyway because you are not going to build a house there. So, in this case, maybe the crowdsourcing type of data is actually good, but this could really be dependent on the type of application. And just one thing I will say regarding the point you made before, this type of bias actually is something that you have in computer vision. So, there is like a very nice

paper done by Mark Ibrahim. Basically, they showed that most of the data we have from ImageNet is from North America. And so maybe you reach like 90% state-of-the-art performance to predict, for example, type of chairs, cars, but only for North American models.

And when you start looking at type of cars or chairs in Central Africa or East Asia, suddenly the model performance is extremely bad. So this type of problem is something you have across modalities and that's something that's a very big, big issue. Randall, it's always a pleasure and an honor to have you on the show. Thank you so much. Likewise, likewise. Thank you so much. Thank you.

Prof. Randall Balestriero - LLMs without pretraining and SSL 34:30 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

Prof. Randall Balestriero - LLMs without pretraining and SSL