We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

OpenAI's "Scaling Laws for Autoregressive Generative Modeling"

2020/11/8

Last Week in AI

AI Deep Dive AI Chapters Transcript

People

Tom Henighan

Topics

Tom Henighan: 本文研究了自回归生成模型的性能如何随着模型规模、数据集大小和计算量的变化而变化。研究发现，在没有其他因素限制的情况下，测试损失随着这三个因素的增加而呈幂律加常数偏移的趋势下降。常数项代表数据的真实熵，即完美模型所能达到的最低不确定性；幂律部分代表可减少的损失，即模型与真实数据分布之间的KL散度。可减少的损失是衡量模型逼近真实数据分布程度的重要指标。对于给定的计算预算，存在一个最佳模型大小，它能够在损失显著下降之前达到收敛性能。最佳模型大小与计算预算之间呈幂律关系，并且在不同领域中幂律指数惊人地相似。即使在达到生成模型的不可减少损失之后，用于ImageNet分类的微调模型的分类损失仍然会随着模型大小的增加而呈幂律下降，这表明关注可减少损失比关注总损失更重要。更大的预训练模型在微调时效率更高，需要更少的数据就能取得更好的结果。虽然更大的模型在样本效率方面表现更好，但在推理阶段的成本也更高，因此需要权衡训练成本和推理成本。对于图像数据，损失与数据集大小的关系并非总是呈幂律关系，在一定范围内呈线性关系，之后会趋于平缓，这可能是由于模型过拟合造成的。未来的研究方向包括研究模型剪枝对幂律关系的影响，以及将研究扩展到其他架构（如卷积神经网络）和模型类型。 Andrey Kurnikov: (问题引导，未形成核心论点)

Deep Dive

Chapters

The paper focuses on understanding trends in performance across various domains by examining the relationship between loss and factors like data set size, compute, and model size.

Shownotes Transcript

Translations:

中文

Hello and welcome to SkyNet Today's Let's Talk AI podcast, where you can hear from AI researchers about what's actually going on with AI and what is just clickbait headlines. I am Andrey Kurnikov, a third-year PhD student at the Stanford Vision and Learning Lab and the host of this episode. On this interview episode, you'll get to hear from Tom Hannigan, a member of the technical staff at OpenAI, working on the safety team.

Tom, along with Jared Coplin, Moore Katz, and others, offered the recent paper Scaling Laws for Autoaggressive Generative Modeling. He completed his PhD in the physics department at Stanford, where he studied atomic motion and solids, advised by David Rees. Thank you so much, Tom, for making the time to be on this episode. Thanks for having me. Pleasure to be here.

So our focus will be on your paper, which you cover to many people at OpenAI, and you can say right away, Scaling Laws for Outer Aggressive Generative Modeling, which just came out a few weeks ago. Following up on a few other papers from OpenAI, including Language Models of Few-Shot Learners, which famously introduced GPT-3, and also Scaling Laws for Neural Language Models, which came out also this year.

Before we dive into any of the details, how about I just let you provide kind of a summary of what the paper is about and what are its main conclusions?

Yeah. So, you know, I think a lot of the field of machine learning is focused on getting state of the art results. And so people are trying to find ways of tweaking things to improve loss accuracy or whatever their metric of choice is to to get a new state of the art result. And a lot of that happens sort of on the edge of learning.

technological progress or what's possible, making a model a little bit bigger, adding a little bit more data, those kinds of things. And so I think I view this as a way of kind of trying to the focus of this work is trying to kind of zoom out and say, OK, well, you know, what is what are the trends in performance look like? Not if I just, you know, increase the data set or, you know, make things better.

two times bigger or even 10 times bigger. But what if I look over, you know, something like five orders of magnitude? Is there some sort of macroscopic trend that's happening there that might be informative? And somewhat surprisingly, we've been finding that in the case of measuring the test loss,

It seems that the test loss as a function of one of either data set size or the amount of compute you invest in training or model size increases significantly.

or the loss decreases with power law plus constant offset trend for any of those three things, so long as you're not bottlenecked by the other two. So, for instance, if you have plenty of data and you have lots of compute, so you can train till convergence, loss as a function of model size seems to be power law plus constant offset for transformers.

We first saw this in language, as you mentioned in our paper, Neural Scaling Laws for Neural Language Models. But the emphasis of this paper was seeing if that generalized to other domains. And it seems like that is the case.

I see. Yeah, thank you for that great summary. Yeah, I was just looking over figure one. And as you say, the idea here is that you do this for a few different types of models. So you do this, I think, for images, for language, for text to image tasks, image to text, video.

And if I understand correctly, you have the same kind of architecture, the transformer architecture, which is also the basis of GPT-3. And so you apply the same model with the same task, with the same loss of cross-entropy. And so with kind of the same constant across different types of data, you get this parallel relationship, which is basically saying that

As you change one of these variables of, let's say, for instance, compute, for every order of magnitude, you see a sort of linear decrease in the loss. Is that a correct description of the main outcomes? Yeah. So I guess I would say the...

If you increase the compute by a factor of 10, you always see the same fractional decrease in the loss. But how big that fractional decrease is, whether it's 50 percent or 30 percent or whatever it might be. And I'm making those numbers up, by the way, that's probably not the actual ones. It depends on the domain. And so it's a it's a problem.

a power-law relationship, so it looks linear on a log-log plot. Just to be, I know it's a little persnickety, but just to get that right. Of course, yeah, it's important to get the details. And maybe now we can dive into a bit of the details. So you also have, in describing results, a pretty interesting idea of reducible and irreducible loss. And so this power-law relationship, I think the main results are for reducible loss.

So can you try and explain to listeners what are these reducible and irreducible loss and how you get those? Yeah. So we've been finding that the relationship between the loss and compute or model size or data set size, whichever of those three it might be, is a power law plus constant offset fit. And so for those familiar with...

information theory, there's actually really that suggests this kind of really nice interpretation that the constant in that power law plus constant fit. So the value you're approaching as you go to infinite data, infinite compute, infinite model size is sort of the true entropy of the data you're trying to model.

It is the sort of the best, lowest uncertainty, you know, a perfect model of that of that data could achieve.

Whereas the power law component, so you have a constant, that's the constant. The additional power law component is the quote reducible loss, which is the component of the loss that can be learned. It actually represents the KL divergence, the Kovac-Liber divergence between the model's distribution of the data and the true data distribution itself.

I see. So if I understand correctly, basically it's saying irreducible loss is even if you had the perfect model that could perfectly learn everything from a training set, just due to the nature of the data, it can't get to zero loss because there's some amount of randomness that's inherent and you're never going to overcome it. And so you can find to some extent this irreducible loss and that becomes a constant offset of

in your parallel relationships. So on the graphs, there's a line, a log-clog linear. And so what you're plotting there is the irreducible loss, which, sorry, the reducible loss, which

doesn't include this impossible to get away from irreducible loss. Is that right? That's exactly right. And what we're suggesting is that perhaps the reducible loss is really the important quantity here. It's sort of telling you how close you're getting to modeling the true distribution of the data.

And yeah, sort of if I can give an anecdote of how I sometimes think about the irreducible loss, say the task for language. So in this case, these are autoregressive transformers. So you're just trying to predict what the next words are going to be.

If I read the first chapter of a murder mystery, no one in the world could say with 100% certainty that they knew who the murderer was, that they knew it was Professor Plum in the study with the candlestick. They would have some probability distribution over who the likely murderer was, and there's

There's just some intrinsic limit to how calibrated or how good you can make that prediction. Sorry, there's a limit to how accurately you can make that prediction. And so that represents the irreducible loss. So that's sort of like unachievable to do better than that. So perhaps what's the really important metric that we should focus on is the component of the loss that can be learned. So the reducible loss.

Great. Yeah, that hopefully is clear to readers. I think that makes a lot of sense to me. And diving again a little bit more into the details, I think there's a few main kind of quantities or quantitative results you have. And those are for loss as a function of model size, loss as a function of compute. And then also you have something quite interesting, I think, which is

the optimal model size as a function of compute, which I think you call an opt C for a given compute budget finds the optimal model size. And you also show that this optimal model size can accurately be modeled as a pure power law. So can you just tell us a bit more about these different quantities and how consistent they are? And yeah, what have been many kind of exciting things for you there?

Yeah. So first, just briefly describe what we mean by the optimal model size for a given compute budget. So you could imagine that if you use a very small model and invest a lot of compute in it, so you train it for a very long time, the loss will go down only so much because your model capacity is limited and a small model can learn only so much.

But in contrast, if you use an exceedingly large model and invested, say, the same compute budget, you might only be able to take one step because you have so many parameters and need to do...

so many floating point operations that you're only able to look at one batch of data. And it's hard to imagine that a model could learn much by only looking at one batch of data. So for a given compute budget, there is some Goldilocks region in between, Goldilocks choice of model size where...

you're able to look at enough data for the loss to drop significantly, but it's sort of before the loss starts to level off and asymptotically approach whatever the converged performance is going to be. And so you can, using these results, you can extract this for the trans, again, for decoder only transformer and these different domains, images, videos, math, language,

And a surprising thing is that it looks like the optimal model size is function of compute budget is a power law. And not only that, but the power laws are surprisingly similar for all these, you know, seemingly to me, at least pretty different domains. All of them have an exponent that's right around 0.7. Now, and when I say similar, I mean, some of them, you know, the actual values might be different by an order of magnitude. And so, yeah.

But when you look at it on the log-log plot, these lines seem to be almost on top of each other, which came as a real surprise to me. And it feels like it wants to tell us something about some theoretical thing that we don't quite understand yet, which I think is exciting. Definitely. It's very interesting to see these sort of trends across different data types, which I guess is the whole kind of exciting thing about the paper.

Moving on to a little slightly more specific detail, which I found interesting. You say in a paper that when generative image models are fine-tuned for ImageNet classification, you find a parallel for classification loss versus model size.

So again, basically it's saying that if you increase the model size by 10, you get some fractional decrease, 10, 20% for classification loss consistently, which is pretty cool. You can basically do better and better. But a detail here is that that happens even beyond the model size where you approach irreducible loss for generative modeling. So you can go beyond that.

loss for generative modeling. And here you also say that you conclude that the approach to the irreducible loss does not necessarily indicate diminishing returns for representation quality or semantic content, which is interesting. So the point, if I understand correctly, is that you might interpret the power law as being diminishing returns. Basically, you

As you increase by 10, so you go from million to billion to trillion, you get the same return every time, right? Which maybe is bad because that means that once you get to billion, it's very hard to get to trillion to get another 10%.

So can you speak a bit more about this image net classification and the question of if this indicates diminishing returns? Yeah. So I think for me, this relates to the constant in that parallel plus constant equation and

which, as we said earlier, corresponds to the irreducible loss. So if you were looking at the reducible loss, then what you said would be true. It might be the case that every time you 10x your investment in compute or model size or whatever it might be, maybe you would decrease your loss by another 10% or 20% or whatever it is.

But actually, when you have the power law plus constant offset fit, those even that fractional return begins to diminish because you begin approaching the irreducible loss and asymptotically approaching that. And so it might be the case that you're increasing your model size by 100 times or 10 times, 10 times or 100 times. And the loss is only decreasing by, you know, maybe 1 percent or 0.1 percent.

And so from that, if you weren't looking at these macroscopic trends, you might conclude that, oh, well, that these I'm as I'm increasing my model size, my loss isn't going down by very much. So I think my model isn't really getting any better. And my downstream past performance is also probably not improving. But but actually, that's not the case. The

As we see here, when we fine tune it for the classification objective, the performance continues to improve in a smooth power law kind of way, both for the classification loss and for the classification error rate.

And so I think this, again, suggests that maybe the important quantity is not the total loss, but specifically the reducible loss. Because if you were if you did that same looked at that same trend, but you were looking instead of at the total loss, the reducible loss, the irreducible loss, excuse me, you'd see that that was consistently decreasing by 10 percent every decade. And so that would maybe be a better indicator of what sort of downstream task performance you should be expecting.

Yes. And yeah, I'm curious in particular about emotional classification because this is a pretty concrete task we can focus on. And, you know, to some extent you could say matters and products and so on already that matter.

ImageNet classification, for those who don't know, is image classification, saying what is in this image. And a lot of online APIs exist to do it. So I would imagine that the companies providing these APIs care about performance and want to try and do it as better as possible.

And so it's very interesting that you have this power law of as model increases, classification also improves by some consistent-ish number. And additionally, you have, I think, even more interesting results here. So you have something that others have also shown to some extent, I think, that as you go to bigger and bigger models,

And here you are pre-training them on this generative task of learning representations. Then you fine-tune them for classification, and you find that larger pre-trained models fine-tune significantly faster. So you need to look at less data to achieve better results. So yeah, in some sense, it's easier to optimize larger models. And they actually perform better over time.

So, yeah, I'm curious if you can highlight some results that you think maybe from a programmatic perspective, from like an operational perspective of wanting to, if you're building a product, if you're building a neural net to do some tasks, I think this is one of the results that seems interesting, that it's easier to optimize larger models. Maybe for ImageNet, you just go big. Are there any other results like that, working from this, that you think

that you could highlight? Yeah. So, uh, I guess I would, I would start by just copy adding that, uh, the way we, um, do the training is, uh, you know, these models, we're not achieving anything close to SOTA, right? So like, uh, I don't know, on this graph, the classification error rate is maybe, uh, 30% or something, which is not exactly a state of the art for ImageNet these days. But, um, but I think what you're alluding to is like, maybe the

These macroscopic trends might-- the results here do maybe suggest some practical tips and tricks that could be used for practitioners. And I think the point you raised about bigger models being more sample efficient is something that shows up again and again in these works. And so-- and I think that the generative pre-training did

We can see that it appears that generative pre-training did give these models, did result in these models having good representations that transferred well to then the downstream task classification and that they picked up on that classification task pretty quickly with relatively few samples, especially for larger models.

But I guess one other practical point is you might think, oh, then a bigger model is just the way to go. But a practical point for many companies is that a lot of their compute cost is not in training, but also at inference time. And a bigger model will cost you more at inference time. So you have to weigh those two things against each other if you're

If all of your cost is in training and you're not worried about the cost at inference time, then indeed it can make many instances make sense to use a larger model. But if you're primarily constrained by inference costs, you may be more frugal with how big of a model that you want to use. I see. Very interesting. Yeah.

And speaking of that, again, still on the images, one thing I found interesting is you have a section here called inconsistency in compute and data size scaling laws. So for some data, and I think here it was particularly images,

You showed that actually here as you scale the dataset size or the amount of data seen by the model, it's not quite a parallel. So you get a linear relationship for a while and then it sort of tapers off and you get almost like an L shape for loss as a function of dataset size.

And that's, I think, something that's quite interesting also for practitioners. If you say a company and you're collecting data, you want to know how much more data to collect. Are you going to get more payout for additional data? So a law here would be very interesting. So can you speak to what you found with respect to loss as a function of data set size and maybe the inconsistency that you have? Yeah.

So we found that the loss as a function of data set size was a power law plus constant offset. And so the L shape you're referring to, the bend in the curve on this log-log plot is a result of that power law approaching what we're calling the irreducible loss, which is causing it to asymptote towards this constant value, which is the constant in the power law

constant equation and we think represents the irreducible loss. And the inconsistency you're alluding to is something that we saw for images here in previous work. We also saw it for a

for language where, so if I can try to describe it, it's a little bit of a mind warp. So you can plot the loss you're going to achieve as a function of data set size if you allow the model to look at as many epochs as it wants. Now, if you're limiting the data set size, eventually it will overfit. And so here we're just using early stopping to tell us what the best achieved loss was.

So that gives you one curve of loss as a function of data set size. Another way you can get loss as a function of plot loss as a function of data set size is to look at the loss that a model would get if you let it look at that data set size, but only go through one epoch.

So you only let it look at each example once. And you could do that for a handful of different data set of a handful of different model sizes. And so the smaller models will plateau at larger values of the loss because they're constrained by how many parameters they have. And then as bigger and bigger models will have lower and lower loss. And as we spoke about earlier, they seem to be also be more sample efficient. So efficient. So the loss drops faster.

What's interesting is that we weren't able to actually see them intersect. But if you extrapolate those two trends out, it looks like they're going to intersect, which would suggest that at some point. Yeah, I don't know. You can interpret it as you will. I think, you know, I might guess that at some point.

you were going to be yeah maybe maybe in the long run it's going to be the this sort of loss as a function of data set size is going to be constrained by the multi-epoch case where actually after the model has seen uh some amount of data uh some fraction of the epoch that looking at more is is actually giving diminishing returns um but

It's hard to say. I have to say it's an interesting question. I don't know what would happen if you were to go to larger models with more data. I see. Oh, okay. I think I mischaracterized it a bit. So the inconsistency is basically if you plot...

both of these linear parallel lines of losses of function of dataset and losses of function of compute, they are slightly different slopes, which means that at some point, one of the slopes has to, the lower slope, which is compute, will have to maybe overcome or overtake the data slope. So that is interesting, as you say. And I think, yeah, as you say, I think,

There's not too much research of this vein. There is starting to be more of it, and you do cite a lot of the relevant literature. But these kind of microscopic or macroscopic trends that are empirical but seem to hold are definitely interesting to be aware of. Yeah.

Yeah, and I think for me, you know, coming from physics, you know, you see a power law and you think, oh, it feels like it wants to tell us something, you know, and I think there's a lot about AI and machine learning and how neural networks are working so well that we don't have a good grasp on theoretically. And so, yeah, I don't know, maybe results like this will help us make some progress in that direction.

Indeed. Actually, I think now maybe is a good time to acknowledge once again, this is a big team project. So there's, I think, I don't know, 2000 offers, a large number of offers. Yeah.

And maybe, I don't know, I'm curious if you can zoom out a bit from the results and can you tell us a bit of what was involved in getting this paper together? I mean, infrastructure-wise, experimental design-wise, all the different details. What are some of the things you can think of that were really big efforts on this? Yeah, I mean, obviously, there's...

you know, a fair amount of, um, engineering that goes into, um, building these models and to making, uh, making it possible for us to, to train them. Um, there's a huge, uh, that's a huge effort. Um, and you know, this work really wouldn't have been possible without all of the, um, people really working out the details there to make that, make that happen. Um, and you know, I think I've, I've, I was also fortunate to have, um,

really great colleagues and mentors in terms of research direction who thought these were interesting ideas to pursue. And I think it wasn't as obvious to me and maybe at the beginning of the project, but in hindsight, I think it's really, really cool. And I'm glad we did it. It's

Yeah, I mean, you know, I feel pretty lucky. I do feel like it was a pretty collaborative effort. And yeah, it's been great. It was a great experience. Great to hear. Yeah, definitely. When there's such a big team, it's good to hear that it kind of all came together. And I do think, yeah, you got some very interesting results. Now, speaking a bit about architecture and sort of

I was also wondering about some of the things you might do next or that you thought of doing as far as some of these empirical studies. So one thing that I find interesting is you evaluate here this transformer architecture and you show it for different model sizes. And yeah, I wonder if you've thought about maybe going and evaluating these trends for

let's say, programmatic design choices. So one thing I was thinking is for very large models,

Maybe it's not usable at deployment times. People would use like pruned models, you know, or optimized models. And there to be interesting, do these trends hold or do these pruned models inherently perform worse? So, yeah, I was wondering kind of what sort of things were you thinking of perhaps looking into next, possibly including pruned models or anything else really? Yeah.

Yeah, I think pruning is definitely an interesting research direction here. You could ask, how are these trends changed as I prune the model after the fact more and more? And there's a lot of practical knowledge to be extracted there. I think that would be interesting. Yeah.

I think one obvious thing is we only did decoder-only transformers here. And in many of these problems, that's probably not the best architecture to be using, language being an exception, of course. Transformers have worked

you know, maybe is not the best, you know, the decoder only transform, you know, obviously there's other things out there like BERT and those sorts of things, but something in that vein works, seems to work quite well for language, but for generative image modeling, you know, the decoder on the only transformer probably doesn't have, is not the best trace of architecture. And in,

Our prior paper on the neural scaling laws for language modeling, we looked at comparing the trend for the transformer and the LSTM. And I think that was informative in that if you looked at loss as a function of model size, they seem to have roughly the same exponent in the power law, but the...

LSTM had a different multiplicative constant in front of the power law. So on a log-log plot, they were parallel lines, but at every point, the LSTM had higher loss than the transformer. And so seeing if...

And so that, I mean, that's another thing that's surprising that the two exponents were the same, but then naturally, another natural question is to wonder if we use something more natural for generative image modeling than a decoder-only transformer, if we use maybe PixelCNN or something in that vein, would it have the same exponent and be offset in the same way? I'm pretty curious about that. Exactly. I was actually going to ask you that next is if you're thinking of going to other architectures like convolutional neural nets,

So glad to hear that sort of also to you, the obvious next step.

Yeah, absolutely. I think that's pretty much all I had to ask. There's a ton of interesting details in the paper. So again, just to mention the title is Scaling Loss for Auto-Aggressive Generative Modeling. You can find that on archive and take a look yourself. It's quite readable despite a little bit of technical. I think the empirical results are pretty easy to get.

Um, is there anything else you'd like to mention or highlight from the paper we haven't touched on yet? Yeah, I think those are the big things. I mean, I guess I would echo what you just said, which is that this is, this paper is primarily focused on, uh, empirical results. Um, you know, we do make some conjectures about what, uh, about how to interpret some of them. Uh, I'm excited about, uh,

you know, others pursuing, um, uh, theoretical work related to this to, um, have some ideas for why all these things might be power laws. Um, my colleague on the paper, Jared, and, um, uh, some of his, one of his students has worked on, um, trying to, you know, a theory for, uh, as sort of a

a first pass at a theory for why this might be the case. But I think there's a lot of more interesting stuff to pursue, not only in this line of empirical work, but also seeing if we can extract some theoretical understanding from it that I'd be excited about. Definitely. Yeah, that'll be really exciting to see. Well, in that case, I think we've got pretty much a good overview of the paper. Thank you so much for joining us on this episode, Tom.

Yeah, thanks. It was a pleasure. And thank you so much, listeners, for being with us on this episode of Scan Today's Let's Talk AI podcast. You can find articles on similar topics to the one we discussed today and subscribe to our weekly newsletter at scantoday.com. Subscribe to us wherever you get your podcasts and don't forget to leave us a rating if you like the show. Be sure to tune in to our future episodes.

OpenAI's "Scaling Laws for Autoregressive Generative Modeling" 33:15 Share

Last Week in AI

Deep Dive

Shownotes Transcript

OpenAI's "Scaling Laws for Autoregressive Generative Modeling"