We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Is AI Scaling Dead? — With Gary Marcus

2025/5/7

Big Technology Podcast

AI Deep Dive Transcript

People

Gary Marcus

一位批评当前人工智能研究方向的认知科学家和名誉教授。

主

主持人

专注于电动车和能源领域的播客主持人和内容创作者。

Topics

Gary Marcus: 我认为，简单地通过增加计算能力和数据来提升大型语言模型性能的方法已经达到了极限，我们正处于收益递减的阶段。许多AI领域专家也已经承认了这一点，并且正在尝试一些补救措施。对大型语言模型扩展定律的过分乐观导致了巨额投资，但实际结果令人失望。大型语言模型扩展定律并非真正的数学定律，而只是在一段时间内有效的经验概括。大量的实验结果表明，简单地增加模型规模并不能带来预期的显著性能提升，这说明了模型扩展方法的局限性。大型语言模型的性能提升已经偏离了最初的指数增长曲线，进入了收益递减阶段。增加计算量和数据虽然能带来一些性能提升，但收益已经远不如以前，这说明了“收益递减”的现象。即使投入巨资增加模型规模，性能提升也并不显著，这表明单纯的规模扩展并非解决问题的有效途径。当前大型语言模型的“推理”能力实际上是对人类推理模式的模仿，而非真正的推理能力，因此其能力是有限的。大型语言模型的“黑箱”特性阻碍了我们对其工作机制的理解，也使得我们难以解决模型中的问题，例如幻觉现象。大型语言模型缺乏可解释性，这使得我们无法理解其内部工作机制，也限制了我们对其进行改进和修复的能力。虽然大型语言模型在某些基准测试中表现有所提升，但在实际应用中，其性能提升并不显著，并且仍然存在许多问题，例如幻觉和推理错误。大型语言模型的改进并未达到预期，仍然存在幻觉、推理错误等问题，这表明需要采用不同的架构来解决这些问题。大型语言模型仍然存在细微的错误，这些错误可能不被用户注意到，但会影响其可靠性。大型语言模型擅长于重复已有的信息，但难以处理需要创造性和原创性的任务。人们过度依赖大型语言模型会削弱他们的批判性思维能力，这是一个严重的问题。我认为，进一步的投资不会带来显著的回报，大型语言模型的推理能力的提升有限，不会带来质的飞跃。在没有重大创新突破的情况下，未来几年内不会出现显著的性能提升。当前大型语言模型的架构可能存在缺陷，需要探索新的方法来实现真正的AGI。大型语言模型领域的投资估值过高，未来可能会出现金融危机。如果大型语言模型的进一步扩展不能带来显著的性能提升，将对英伟达等依赖于模型扩展的公司造成重大打击。大型语言模型领域存在价格战、缺乏技术壁垒等问题，这将进一步限制其发展。实现真正的AGI需要结合神经网络和符号人工智能两种方法，即神经符号人工智能。当前的神经网络模型主要模拟快速直觉思维，缺乏理性思维能力，这限制了其发展。神经符号人工智能是结合神经网络和符号人工智能的有效方法，可以克服两种方法各自的局限性。主持人: 大型语言模型的性能提升已经达到极限，进一步的投资不会带来显著的回报。大型语言模型的推理能力的提升有限，不会带来质的飞跃。即使大型语言模型没有达到AGI的水平，也仍然存在很多风险，例如幻觉和在病毒学领域的应用。人工智能技术已经发展到可以辅助病毒研究的阶段，这引发了人们对生物安全风险的担忧。人工智能技术存在多种风险，包括低水平人工智能的错误操作和高水平人工智能的潜在威胁。即使没有达到AGI水平，人工智能技术也可能被恶意利用，例如用于研发危险病毒。开源人工智能模型存在安全风险，因为恶意行为者可能会利用这些模型进行有害活动。即使人工智能模型的性能不再提升，仍然会发现新的用途，其中一些用途可能是积极的，另一些则可能是消极的。大型语言模型公司可能会为了盈利而将用户数据用于商业用途，这引发了人们对隐私的担忧。如果大型语言模型公司无法实现AGI，他们可能会将收集到的用户数据出售给其他公司，这将对用户隐私造成威胁。大型语言模型的商业模式很可能会转向监控和精准广告，就像社交媒体一样。一些公司已经开始意识到大型语言模型无法实现其最初承诺的功能，并开始调整其战略。实现AGI需要结合快速直觉思维和慢速理性思维两种认知方式。

Deep Dive

Shownotes Transcript

From LinkedIn News, I'm Leah Smart, host of Every Day Better, an award-winning podcast dedicated to personal development. Join me every week for captivating stories and research to find more fulfillment in your work and personal life. Listen to Every Day Better on the LinkedIn Podcast Network, Apple Podcasts, or wherever you get your podcasts.

Is the AI field reaching the limits of improving models by scaling them up?

And what happens if bigger no longer means better? That's coming up with AI critic Gary Marcus right after this.

Welcome to Big Technology Podcast, a show for cool-headed and nuanced conversation of the tech world and beyond. We're joined today by AI critic Gary Marcus, the author of the book Rebooting AI and Marcus on AI on Substack. And he's here to speak with us about whether the AI industry is hitting the limits of scaling generative AI models up and what it means if we're truly seeing diminishing returns from making these models bigger. Gary, it's great to see you. Welcome to the show.

Thanks for having me. So the genesis of this episode is that I did an episode with Mark Chen from OpenAI about GPT 4.5. And you come into my DMs and you say, listen, I want to give a rebuttal. Scaling is basically over and it's not exactly what OpenAI has said. Now, for those who don't know about the scaling laws, basically the idea is that the more compute and data you put into these large language models, the better they're going to get, basically predictably, linearly.

Well, exponentially was the idea. Right. And so the context here is now we've seen almost every research house all but admit that that has hit the point of diminishing returns. I think Mustafa Suleiman was here. He pretty much admitted it. Thomas Kurian, CEO of Google Cloud, said that their diminishing returns are happening. Jan LeCun has also talked about the fact that you're just not going to see as many returns from AI scaling as you would beforehand. So

Just describe the context of what we're seeing right now. How big of a deal is it? And then what are the implications for the AI industry? Because this is the big question. I mean, how much better can these things get? Right. That is the big question with AI today. Well, I mean, I have to laugh because I wrote a paper in 2022 called Deep Learning is Hitting a Wall.

And the whole point of that paper is that scaling was going to run out, that we were going to hit diminishing returns. And everybody in the field went after me. A lot of the people you mentioned, I mean, Lacoon did. Elon Musk went after me by name. Altman did. And they all, like Altman said, give me the strength of a mediocre deep learning skeptic. So people were really pissed when I said that deep learning was going to run out. So it's amazing to me that a bunch of people have

conceded that these scaling laws are not working the way they used to be. And they're also doing a bit of backpedaling. I think that that Mark Chan interview, I can't quite remember the details, but I think it was a version of backpedaling and redefining things. So if you go back to 2008,

there were these papers by Jared Kaplan and others at OpenAI. And they said, look, we can just mathematically predict how good a model is going to be from how much data there is. And then there were the so-called chinchilla scaling laws. And everybody was super excited. And basically, people invested half a trillion dollars assuming that these things were true. They made arguments to their investors or whatever. They said, if we put in this much data, we're going to get here. And they all thought that...

here in particular was going to mean AGI eventually. And what happened last year is everybody was disappointed by the results. So we got one more iteration of scaling after 2002, after 2022, that worked really well. And we call that GPT-4 and all of these models that are sort of like that. So I wrote that paper around GPT-3.

We got another iteration of scaling. So, right, three was scaling compared to two. It was much better. Two was scaling compared to one. It was much better. So much better. Sorry, much more data meant much better. But what is what? What is much better?

Well, I mean, one way to think about it is you didn't need a magnifying glass to see the difference between GPT-2 and we didn't call it GPT-1, but the original GPT. And you didn't need a magnifying glass for GPT-4 as opposed to GPT-3. It was just obviously better. A lot of people thought is that we would

pretty quickly see GPT-5 and a lot of people raced to build it. So OpenAI tried to build GPT-5 and they had a thing called Project Orion and it actually failed and eventually got released as GPT-4.5.

So what they thought was going to be GPT-5 just didn't meet expectations. Now, they could slap any name on any model they want. And in fact, lately, nobody understands how they're naming their models. But they haven't felt like any of the models that they've worked on since GPT-4 actually deserve the name GPT-5. And it didn't meet the performance that these so-called mathematical laws required. And what I said in that paper is they're not really mathematical laws.

They're not physical laws of the universe like gravity. They're just generalizations that held for a little while. Like a baby may double in weight every couple of months early in its life. That doesn't mean that by the time you're 18 years old that you're going to be 30,000 pounds. And so we had this doubling for a while. And then it stopped. And we can talk about why. But the reality is it's not really operative anymore. So there's been efforts to kind of misdirect and shift direction. So...

I think everybody in the industry quietly or otherwise acknowledged that, hey, we're not getting the returns that we thought anymore. And nobody's been able to build a so-called GPT-5 level model. That's a big deal, right? I'm a scientist. I was originally a scientist. As a scientist, we have to pay attention to negative results as well as positive results. So when 30 people try the same experiment and it doesn't work, nature is telling you something. And everybody tried the experiment,

of building models that were 10x the size of GPT-4, hoping to get to something they could call GPT-5, that was like a quantum leap better than GPT-4. They didn't get there. So now they're talking about scaling inference time compute. That's a different thing. Before we get there, I just want to talk to you. I want to test your theory here.

So it's not that scaling is over, right? I don't think anyone that we're talking about say scaling is over. Basically, what they're saying is if you want to make the model better, and I think that means more intelligent, more conversational, even more personable,

You can still do it by scaling. I think what they admit, the thing that they admit though, is that it takes much more compute and much more data to get the same results that you would in the previous loops. So let's clarify two things.

One is that what people talked about, about scaling originally, was a mathematically predictable relationship between performance and amount of data. You can go back and look at the Chinchilla paper, the Jared Kaplan paper, and lots of things that were posted on the internet. There were papers that saying, or t-shirts saying scale is all you need.

And you looked at that t-shirt and it had equations from the Jared Kaplan paper and it said, you know, here's the exponent, you can fit the equation. If you have this much data, this is the performance you're going to get. And there were a bunch of papers, a bunch of models that actually seemed to fit that curve, but it was an exponential curve. And

What's happening now is, yeah, you add more data, you get a little bit better, but you're not fitting that curve anymore. We've fallen off the curve. That's what it really means to say that scaling isn't working anymore. If I drew a curve for you, it was going up and up and up really fast, and it's not going up as a function of how much data you had or how much compute you had.

You added a bunch of compute and you got this much better performance. And this is how people justified running these experiments that cost a billion dollars. They're like, I know what I'm going to get for the billion dollars. And then they ran the billion dollar experiments and they didn't get what they thought they would. Yeah, you get a little bit better, but that's what diminishing returns means. Diminishing returns means you're not getting the same bang for your buck as you used to. That's where we are now. So anytime you add a little piece of data, the model is going to do better.

excuse me, on that piece of data. But the question is, is it generalize and give you significant gains across the board? And we were seeing that and we just aren't anymore. So is there still a path for these models to become much more performant? I mean, let's say you do supersize these clusters to the point that is, um, insanely, they're insanely bigger than they were previously. Let's talk about like Elon Musk's one, uh,

million GPU cluster. Well, let's look at what Elon got for his money, right? So he built Grok 3 and by his own testimony, it was 10 times the size of Grok 2. It's a little better, but it's not night and day, right? Grok 2 was night and day better than the original Grok. GPT-4 was night and day better than GPT-3. GPT-3 was night and day better than GPT-2. Grok 3 is like, yeah, you can measure it. You can see that there's some performance. But for 10x

the investment of data, compute, and not to mention costs of energy to the environment, it's not 10 times smarter by any reasonable measure. It just isn't. Okay. And so this would be the point where I say, well, then this entire AI moment is done. However... Well, it's this moment. There will be other AI moments, but this one... I'm setting it up to say that it's not because...

Like you mentioned, you're talking about test time compute. That's another way to say reasoning, I think, which is these models. Well, I'm going to give you a hard time about that. But I mean, people do do that. But with reasoning or test time compute, you'll help me figure out the finer details. What these models are doing is they're coming to try to find an answer and they're checking their test.

progress and deciding whether it's a good step or not, and then taking another step and another step. And we've seen that they have been able to perform much better when you put that reasoning capabilities on top of these large models, which has enabled these research houses to continue the progress in some way. And give you, but it's not really you, it's these companies, some pushback on that. So it is true that you can build

a model that will do better if you put more compute on it. But it's only true to some degree. So I'll get to whether it's actually reasoning or not. But it turns out that on some problems, you can generate a lot of data in advance. And for those problems, adding more test time compute seems helpful. There was a paper this weekend that's

Because I'm calling some of this into question by the way just to explain to folks test time is when the model is giving an answer That's what test that's right So you have these models now like oh three and oh four that will sometimes take like 30 seconds or five minutes or whatever to answer a question and sometimes it's absurd because you ask it like what's 37 times 11 and it takes you know

you know, 30 seconds. You're like, my calculator could have done it faster. But we'll put aside that absurdity. In some cases, it seems like time well spent, sometimes not. But if you look carefully, the best results for these models are almost always on the same things, which are math and programming.

And so when you look at math and programming, you're looking at domains where it's possible to generate what we call synthetic data and to generate synthetic data that you know are correct. So, for example, on multiplication, you can train the model on a bunch of multiplication problems and you can figure out the answer in advance. You can train the model what it is that you should predict.

And so on these problems in what I would call closed domains, where we can do verification as we create the synthetic data, we verify that the answer we're teaching the model is correct. The models do better. But if you go back and you look at the 03, sorry, the 01 paper, even then you could already see that the gains were there and not across the board.

they reported that on some problems, O1 was not better than GPT-4. It's only on other problems, these cut and dry problems with the synthetic data, that you actually got better performance. And I've now seen like 10 models and it always seems to be that way.

We're still waiting for all the empirical data to come in, but it looks to me like it's a narrow trick that works in some cases. The amazing thing about GPT-4 is that it was just better than GPT-3 on almost anything you could imagine.

And GPT-3, the amazing thing is it was better than GPT-2 on almost anything you can imagine. Models like O1 are not systematically better than GPT-4. They're better in certain use cases, especially ones where you can create data in advance. Now, the reason I wouldn't call them reasoning models, though you're right that many people do, is what I think they're doing is basically copying patterns of human reasoning. They're getting data about how humans reasoned.

certain things. But the depth of reasoning there is not that great. They still make lots of stupid mistakes all the time. I don't think that they have the abstractions that we think, for example, a logician has when they're reasoning. So it has the appearance of reasoning, but it's really just mimicry. And there's limits to how far that mimicry goes. I'll give you just one more example is O3 apparently hallucinates more than the models that came before it. Which is stunning. Like, how does that happen?

I mean, that's a good broader question, which is our understanding of these models is still remarkably limited. So the technical term or one technical term. Interoperability. Well, I was going to give you a different one, which is black box. Okay. But they're closely related, those two terms. Interoperability to figure out what's going on in the black box. If you can at all. I mean, I'd almost put it another way, which is that black box. But isn't the black box the thing in the plane that tells you what actually happened?

Well, that's a different thing, right? So a black box in a plane is actually a flight recorder that records a lot of data. But what we mean in machine learning by black boxes, you have a model where you have the inputs and you have the outputs. You know how you calculate them, but you don't really understand how the system gets there. So in this case, you're doing all this matrix multiplication. Nobody really understands it.

And so nobody can actually give you a straightforward answer for why O3 hallucinates more than GPT-4. We can just observe it. That's what happens with black boxes is you empirically observe things and you say, well, it does that, but you don't really know why and you don't really know how to fix it either.

Another example, just in the last couple of days is apparently Sam Altman reported, I forget the new model is, is stubborn or what was it? I forget. No, it's not stubborn. It's a bro. It's a bro. But that's GPT-4-0. It's just like, it became very fratty. Became very fratty. And like you, right. You would be like, what's going on? Like, help me with this. And it's like, yo, that's a hell of a good question, bro. And they're like, we don't know why this happened. And they rolled it back completely. Yeah, exactly. Or I thought they were partly rolled over.

roll it over or whatever. No, no. Sam said it's now the latest iteration. It's been completely rolled back. Completely rolled back. So right. That was what I would call again, empirical. Like they tried it out and it didn't work or it worked in the way that it irritated people. Right. And so we don't know in advance, like there's a lot of just like, try it because that's how black boxes work. And we have some things, but those things are not very strong. So the scaling quote laws are,

were empirical guesses about how these models work and they were true for a little while, which was amazing, and they're not true anymore, which is also amazing in a way. So we don't know what's going to happen from the black boxes. - Right, okay, so but let me now sort of-- - And sorry, let me come back to one other thing quick, which is interpretability. So that's a very closely related notion.

Let's say you look at a GPS navigation system. That's a piece of AI that's very interpretable. So you can say it is plotting this route. It says, you know, you can go this way, you can go that way. This is the function that it's maximizing. This is the database it's using. This is how it looks up the data. We don't have any of that in these so-called black box models. We don't really know what the database is that it's consulting. It isn't exactly consulting a database at all. And we don't know how to fix it. And so, you know,

Dario Amode, who's the CEO. We just talked about this on the show. You actually praised his interpretability post. That's right. Call for interpretability. I'll be honest. I haven't read the paper yet. I just read the title, so bad on me. But the title of his paper was something like On the Desperate Need for Interpretability. That captures it. And I think he's right. I've said this too myself. In my last book, I talked about interpretability being really important. The only difference between Dario and me on this point is

We both think that we're screwed as a society if we stick with uninterpretable models. He just thinks that LLMs will eventually be interpretable. And his company, to be fair, has done the best work on interpretability of LLMs that I'm aware of. Chris Ola, I think, is brilliant.

But they haven't got that far. They've gotten further than anybody else. But I don't think we're ever going to get very far into the black box. And so I think we need to start over and find different approaches to AI altogether. Right. So, Gary, if I'm listening to what you're saying on this show so far, it is basically after GPT-4, we haven't made a lot of progress. However, a little bit. But let me just do the pushback here, which is, I mean, if you think about what it's like using these models after GPT-4,

They are significantly better. I'll give you one example.

I was using O3, this new reasoning model or test time model, whatever you want to call it. And I just, I'm in it and I'm doing crazy things and it's exceptionally helpful. So I put a photo of myself on a rock climbing wall and said, what's going on? And it like was able to look at the form where my body was, where my, what my posture was and like analyze all these things and give actually helpful coaching tips, which you never would have had with,

um, with GPT-4. Then you think about what Claude is doing, the anthropic bot. Uh, I was with some friends last night and this is what we do for fun. I vibe coded a retirement calculator, uh, directly in Claude. It took like 10 minutes. We went from, we took a bank statement, we got a line graph of the person's balances, a bar graph of their expenses, a financial plan. And then we coded a retirement calculator based off of the data that we had there. Um,

You also have PhDs that are now adding their unique insights into these models for training. They just basically are sitting and writing down what they know and the model is absorbing it. So we are seeing, I would call it vast improvement over the GPT-4 models.

So, I mean, there's a couple of different ways to think about that. So one is on a lot of benchmarks, there is improvements, but there's also issues of data contamination. Alex Reisner wrote an excellent piece in The Atlantic about the issues of data contamination. And we've seen a lot of studies where people are like, well, we tried it in my company. It's not really that much better. So they're better on the benchmarks. Are they better in general? Not so clear. It was a new benchmark recently.

released by a company called Val AI or something like that. Val's AI, The Washington Post talked about yesterday, where they looked at things like, can you pull out a chart based on a series of financial statements, SEC statements from a bunch of companies? And these systems all claim to do it, but accuracy was under 10%. And overall, on this new benchmark, accuracy was at 50%. Would these new models be better than GPT-4? Maybe, but they weren't that good. So

I think people tend to notice when they do well, they don't notice as much when they do poorly. And although I think there's been some improvement, there has not been the quantum leap that people were expecting. We have not moved past hallucinations. We have not moved past stupid reasoning errors. If you go back to my 2022 paper, Deep Learning is Hitting a Wall,

I didn't say there'd be no progress at all. What I said is we're going to have problems with hallucinations. We're going to have problems with reasoning, planning until we have a different architecture in some sense. And

I think that that's still true. We're still stuck on the same kinds of things. So if you have your deep research right to a paper, it's going to make up preferences. Okay. It's probably going to make up numbers. Like, you know, did you actually go back and check? So for example, what I think it's called, they all have similar names now, whatever Grok's version is deep search, deep research. Yeah. Some, I'm,

Deep Research Mini 06. I won't be convinced that we have AGI until these companies learn how to call Deep Research something other than Deep Research. They all use the same exact name. It's really bizarre. So whichever version Grok has, I asked it, for example, to list all of the major cities that were west of Denver.

And to somebody who wasn't paying attention, it'd be super impressive because I really wanted to know how well it was working. I checked and it left out Billings, Montana. So you got a list that looks really good and then there are errors. This often happens. And then I had a crazy conversation with it after that. I said, what happened to Billings? And it said, well, there was an earthquake there on February 10th or whatever. And I looked up in the seismological data. I used Google because I want to...

have a real source or DuckDuckGo. And there was no earthquake then. And I pushed it on and said, well, I'm sorry for the error or whatever. So we're still seeing those kinds of things. We may see them less.

But they are still there. We still have those kinds of problems. So I don't doubt that there's been some improvement, but the quantum across the board that people were hoping for is not there. The reliability is still not there. And there's still lots of subtle errors that people don't notice. And then, you know, if you want to talk to me about retirement calculators, there are a lot of those on the web. So the easy cases for these systems.

are the ones where the source code is actually already there on the web. Like Kevin Roos talked about this example of having-- he quote, "vibe coded" a system to look in a refrigerator and tell them what recipe to make. But it turns out that app is already there on the web, and there are demos of that with source code. And so if you ask a system to do something that's already been done,

That's always been true with all of these systems. That's their sweet spot is regurgitation. And so, yeah, they can build the stuff that's out there. But if you want to code things in the real world, you usually want to code something that's new. And these systems have a lot of problems with that. Another recent study

excuse me, showed that they're good at coding, but they're not good at debugging. And like coding is just the tiniest part of the battle, right? The real battle is debugging things and maintaining the code over time. And these systems don't really do that yet. - But then, you know, search has made them more reliable.

When these bots are able to search the web and they are now starting to give you lots of links in the actual answers. I still like get daily people sending me examples of, you know, it hallucinated these references. I'm not saying hallucinations have been solved, but for me, like I will use it. It's an incredible research assistant. And then when it links out to things and I'm not sure of those figures, I'll then go to the primary sources and start reading.

I mean, good on you that you go to the primary source. I worry the most about people who don't. And we've seen countless lawyers, for example, get in trouble using these systems. Has it been countless? I just heard of one. Oh, no, no, no. There's many more than that. There's some in the US. There's some in Canada. I think there was just one in Europe. I mean, it's not really countless. One could sit there and do it. But it's got to be at least a dozen by now.

And whether this is going to be, all right, I think we can both agree on this, that whether this is the end of progress or towards the end of progress or whether there's a lot more progress, there's a real problem of people outsourcing their thinking to these bots. Well, Microsoft did a study, in fact, suggesting that critical thinking was getting worse as a function of them. And that wouldn't be too surprising. We have a whole generation of kids who basically rely on these bots and who don't really know how to look at them critically.

you know, in previous years, we were starting to get too many kids relying on whatever garbage they found on the web, basically. And I mean, chat bots are basically synthesizing the garbage that they find on the web. And so we're not really teaching kids critical thinking skills. And nowadays, like the idea for many kids of writing a term paper is I typed in a prompt in chat GPT and then maybe I made a couple edits and I turned it in. You're obviously not learning how to actually think or write in that fashion. A lot of these tools

I think, are best used in the hands of sophisticated people who understand their limits. So coding has actually been, I think, one of the biggest applications. And that's because coders understand how to debug code. And so they can take the system. Basically, it's just typing for them and looking stuff up. And if it doesn't work, then they can fix it. Right. The really dangerous applications are like when somebody asks for medical advice and they can't debug it themselves and something goes wrong.

Okay, so I'm going to take into consideration all the things that you've said so far and see if I can get a sense as to where you think we're heading. It seems like there was a push to just make these models better based off of scale. That could be things like the 300,000 GPU cluster I think Meta used for Lama 4, or it could be the million cluster GPU center that Elon's built for Grok.

And what you're saying is that's been maxed out pretty much like no one's more careful. It's not maxed out, but it's just diminishing returns, diminishing returns. So the point that a point that I'm trying to make here is you don't believe that there's going to be anyone that's going to build a bigger company.

GPU data center than that because if you're seeing diminishing returns from something that costs billions of dollars, it doesn't make sense to invest. Well, wait a second. I'm not saying people are rational. I think that people will probably try at least one more time. They'll build things. Probably Elon will build something that's 10 times the size of Grok 3, which will be huge and it will have a serious impact on the environment and so forth. It's not just GPUs. Also, it's data.

Right. Like how much more data? Well, let's come to the data separately. Yeah. And so I think people will actually try. Right. I think Masa has just bankrolled Sam to try. I just don't think they're going to get that much for it. I don't think they'll get zero. I mean, there will be tangibly better performance on certain benchmarks.

and so forth but i don't think that it's going to be wildly impressive and i don't think it's going to knock down the problems of hallucinations boneheaded errors so here's what i'm getting at that's not going to feel much better than what we have today it doesn't seem like you believe that reasoning is going to make the bot feel much better than we have today not not the kind of reason there's no emergence there's no emergent coding so are you basically saying that

What we have in AI today, this is it? For a while, I guess. I mean, look, I put out some predictions last year in March that people can look up that I had on Twitter. And those predictions include, I said, there'd be no GPT-5 this year, or if it came out, it would be disappointing. It's supposed to come in summer.

Well, this was last year. So I said, in 2024, we won't see this. And that was a very contrarian prediction at that point, right? This was a few weeks after people had said, oh, I bet GPT-4 is going to drop off to the Super Bowl, like right after the Super Bowl. Won't that be amazing? So people really thought it was going to come last year, if you go back and look at what they said on Twitter, et cetera. And it didn't. And I correctly anticipated that it wouldn't.

And I said, we're going to have a kind of pileup where we have a lot of similar models from a lot of companies. I think I said seven to 10, which was sort of roughly right. And I said we were going to have no moat because everybody is doing the same thing. And the prices were going to go down. We have a price war. All of that stuff happened. Now, maybe we get to so-called GPT-5 level this year. Keeps getting pushed back.

I don't know if we'll get much further than that without some kind of genuine innovation. And I think genuine innovation will come. But what I think is we're going down the wrong path. Yann LeCun used this notion of, you know, we're on the exit ramp. How do you say it? Large language models are the off ramp to AGI.

You know, they're not really the right path to AGI. And I agree with him, or you could argue he agrees with me because I said it, you know, for years before he did, but we won't go there. The broader notion is sometimes we make mistakes in science. I think one of the most interesting ones was people thought that genes were made of protein for a long time. So the early 20th century, lots of people tried to figure out what protein is a gene made of. It turns out it's not made of a protein. It's made of a sticky acid that everybody now knows.

called DNA. So people spent 15 years or 20 years like really looking at the wrong hypothesis. I think that giant black box LLMs are the wrong hypothesis. But science is self-correcting. In the end, people put another $300 billion into this and it doesn't get the results they want. They'll eventually do something different. Right. But what you're forecasting is basically an enormous financial collapse. It

That's right. I don't think LLMs will disappear. I think they're useful. But the valuations don't make sense. I mean, I don't see open AI being worth $300 billion. And you have to remember that venture capitalists have to like 10x to be happy or whatever. Like, I don't see them, you know, IPOing at $3 trillion. I just don't. No, it's interesting because I almost see the open AI valuation as the one that makes the most sense because they have a consumer app.

The place that I start to get, if what you're saying is correct, that we're not going to see any more, if we're seeing real diminishing results from scaling and this is basically where we are, then there's real worry for companies like NVIDIA, which has basically risen on the idea of scaling. I mean, they're down a third this year or something like that. Can I just tell 2 point something, 2.5 trillion last time I looked? They're a genuinely good company. They have a wonderful ecosystem. They're worth a lot of money. I mean, I don't,

I don't want to put an exact figure, but I'm not surprised that they fell and I'm not surprised that they're still worth a lot. No, but this is the thing.

If we end up seeing the fact that this next iteration, the $10 billion that Sam is going to spend seemingly on the next set of GPUs, if that doesn't produce serious results... That's going to hurt NVIDIA. That will cause a crash in NVIDIA because so much of the company's demand is coming based off of this idea that scaling is going to work. So they have multiple problems, both OpenAI and NVIDIA. So one is...

It does look to me like we're hitting diminishing returns. It does not look to me like this inference time compute trick is really a general solution. It doesn't look like hallucinations are going away. And it does look like everybody has the same magic formula. So everybody's basically doing the same thing. They're building bigger and bigger LLMs. And what happens when everybody's doing the same thing? You get a price war. So DeepSeek came out and OpenAI dropped its prices quite a bit. And so every...

Because everybody, I mean, not literally everybody, but 10, 20 different companies all basically have the same idea or are trying the same thing,

You have to have a price for it. Nobody has a technical mode. OpenAI has a user mode. They have more users. That's the most valuable thing they have. That is the most valuable thing. I would say the API is close to worthless. I don't know if worthless is the right word, but it's not worth very much. It's not a unique product. It's the brand name that is most valuable. I also think it's the best bot right now.

It might be. I mean, I think people go back and forth. Some people someday say it's Claude. I've been on the Claude train for a long time. And now you're on the ChatGPT. And I'm on ChatGPT, I think. What I think is going to happen is you're going to have leapfrogging. But the leaps aren't going to be as big as they were. So four was a huge leap. I mean, this is a different way of saying it. It was a huge leap over three. You know, let's say...

I can't even keep up with the naming scheme. GPT 4.1, let's say, is better than GROK 3.7, let's just say, hypothetically. And so people run to this side of the room. And then, you know, CLAWD, whatever, 3.8.1 or whatever will be a little better. And then some people will run to that side of the room. Yeah.

But nobody's going to be able to charge that much money because the advances are going to be smaller. And people start to say, well, you know, I use this one for coding and this one for brainstorming and whatever. But nobody anymore says this is just like dominant. Like GPT-4 was just dominant. When it came out, there was nothing as good as it. For anything, if you wanted this kind of system, you used it, right? I mean, that's my memory of it.

I don't hear any of the chat GPT or whatever. I can't even keep up with the names anymore. Any of those products, any of the open AI products being referred to in the same kind of hushed tones, like they're just better. And like, you know, Google's still in this race and they may undercut on price. Meta's giving stuff away. People are building on it. DeepSeek, I hear, has something new that's going to be better than chat GPT. And, you know, maybe it's true, maybe it's not. But we were...

We're in this era where the differences between the models are just getting really small. I want to ask you when you're going to admit that you were wrong about things or if you ever will. Which things? Which things? I think that... But I also realize that the question doesn't really hit...

Because I just want to say we spoke the last time you were, I think you've been on the show two times, once with Blake Lemoyne, once one-on-one. And we, because it's interesting. I think you're one of the most outspoken AI critics. And you say a lot of the things that we say here on the show, which is that AGI is marketing. And even if we don't hit AGI, there's still a lot to be concerned about, whether that's the BS that people are talking about or being able to use these models for, um,

you know, for nefarious purposes by churning out like content. Like, I don't know if you saw, there was this study of this university of Zurich tried to fool people on Reddit or try to convince people on Reddit based off of answers by a GPT. And it's still convinced more people than is the new persuasion, the persuasion study. I'm aware of it, but yeah. So I, I guess like to me, it's, it does seem like it's kind of tough to be a critic of LMS right now because they have been getting so much better, but yeah,

I don't know. Just sort of like, I mean, people say, Gary, you're wrong. And I say, well, here are the predictions I actually made. Like I've actually reviewed them in print and I asked people who say that I'm wrong to like point, what did I say that was wrong? I think that sometimes people confuse my skepticism with other people's skepticism. Um,

But I think if you look at the things that I have said in print, they're mostly right. And it, you know, like Tyler Cowen said, you're wrong about everything. You're always wrong. And I said, Tyler, can you point to something? And he said, well, you've written too much. I can't do it. Well, I looked through some of your stuff and I do think that sometimes it seems like you might have,

put like this enormous burden of proof for the AI industry. Like you do pick out sometimes like everyone that says like AGI is coming this year and you're like, these people are liars. But that being said, like I think your core arguments about scaling- - Well, some people are wrong. I've offered to put up money. I offered Elon Musk a million dollars. - Elon, a million, right? - And I offered Criterion, I'll tell you about that. In 2022 in May, I offered him $100,000 bet. Later I upped it to a million dollars.

And I put out criteria on Twitter. I said, I'm going to offer these. Do these make sense to you? And everybody on Twitter, not everybody, nearly everybody on Twitter at the time said those were fine. Like people accuse me of goalposting shifting, but my goalposts are the same, right? But 2014 goalposts,

paper in the New Yorker article in the New Yorker where I talk about a comprehension challenge I've stuck by that that is part of my AGI criteria I made a bet with Miles Brundage on the same criteria which he actually took the bet to his credit um

But when I put them out in 2022, this is the important part. Everybody was more or less in agreement that those were reasonable criteria. And I said, if you could beat my comprehension challenge, which is to say, you know, watch movies, know when to laugh, understand what's going on. If you could do the same thing for novel, if you could translate math, uh,

from English into stuff you could formally verify. If you could go into a random kitchen, you know, tell operating a robot and, you know, make a dinner. If you could, what was the other criterion? Oh, you write, I think it was 10,000 lines of bug free code. I mean, you could do debugging to get there, whatever, you know. Okay. If you could do like three out of five, we'll call that AGI. And at the time, everybody said, that's fine.

Now people are backtracking. Like Tyler Cowen said, O3 is AGI. By what measure? I felt that that was kind of a stretch. That was cheesy. And he said the measure was him. It looked like AGI to him. He invoked the classic line about pornography. I know what I'm saying. But people have pointed out lots of problems with O3. I think it's absurd to call O3 AGI. I wouldn't call it AGI.

So, you know, you a minute ago said, Gary, you're wrong. But then you ticked off a bunch of things I'm actually right about. I didn't say, Gary, you're wrong. I said, is there a point you'll admit you're wrong? Yes, there is. It's the point at which I'm wrong. So let me clarify one other thing. But let me just say, I didn't say that you're wrong. I just said, like,

What is the point of advance that you would say, okay, I've been wrong about this stuff? Because I have listened to some of your... Let me clarify something. But I also, right after I said that, I was like, you know, it's kind of like a tough question. And then I explained where I agreed with you. Yeah, that's what happened. So some people take me as saying that AI is impossible. And that's not me, right? I actually love AI. I want it to work. I just want us to take a different approach.

It wants to take a neurosymbolic approach where we have some classical elements of classical AI, like explicit knowledge, formal reasoning, and so forth, that people like Hinton have kind of thumbed their nose at, but that, say, Demis Hassabis has used very effectively in AlphaFold. So we get into that if you want. If we get to AI, the question about whether I'm right or not depends on how we get there. So I've made some pretty particular guesses about it, and I have guessed that pure LLM will not get us there, pure large language model. So

will I concede I'm wrong when we get to AI that actually works? Depends on how it works. Okay. Yeah. And I think it's clear that, I mean, I don't know, we could watch this back in a couple of years. If we get to pure LLMs, if another round of scaling, you know, gets us to AGI by the criteria that I laid out, then I will have to concede that I was wrong.

Okay. All right. I'm going to take a quick break and then let's come back and talk a little bit more about the current risks and maybe read some of your tweets and have you expand upon them. We'll be back right after this. And we're back here on Big Technology Podcast with AI skeptic, Gary Marcus. Gary, let me ask you this. So, you know, one of the things we talked about last time you were here was that AI doesn't have to reach the AGI threshold to

to be something that we should be concerned about. Absolutely not. And a lot of the focus was on hallucinations. You and I both, I think we have a little bit of a diverging opinion on hallucinations. I think they've gotten much better. You'd think it's still a big problem. Those could both be true, by the way. That could both be true. All right. So let's put a pin in that for now. I think where I'm seeing the most concern is virology.

or we just had a study that came out that showed that AI is now in PhD on PhD level in terms of virology. We had Dan Hendricks from the center for AI safety who was here. We talked about the fact that like AI can now walk virologists through how to create or enhance the function of viruses. And we're starting to see some of these AI programs, like you mentioned, deep seek and,

be available to everybody, be pretty smart and be released without guardrails or not enough guardrails, especially if they're open source. So what are you worried about here? Is that the core concern or is there other stuff?

I think there's actually multiple worries and the different worries from different architectures and architectures used in different ways and so forth. So dumb AI can be dangerous. So if dumb AI is empowered to control things like the electrical grid and it makes a bad decision, that's a risk, right? If you put a bad driverless car system in, you

you know, a million cars, a lot of people would die. Right. The main thing that has saved a lot of people from dying in driverless cars is there aren't that many of them. And so, you know, even though they're not actually super safe at the moment, you know, restrict where we use them and so forth, we don't put them in situations where they wouldn't be very bright.

So dumb AI can cause problems. Super smart AI could maybe lock us all in cages if it wanted to. I mean, we have to talk about the likelihood of it wanting to, but there are definitely worries there and we need to take them seriously. And then you have things that are in between. So for example, the virology stuff is AI that's not generally all that smart, but it can do certain things. And in the hands of bad actors, it can do those things. And I think it is true that

either now or will be soon enough that these tools can be used to help bad actors create viruses that cause problems. And so I think that's a legitimate worry, even if we don't get to AGI. So we have dumb AI right now is a problem. Smarter AI, even if it's not AGI, can cause a different set of problems. And, you know, if we ever got to super intelligent, that...

that might open a different can of worms. I mean, you can think like, you know, human beings of different degrees of brightness and with different skills, if they choose to do bad things can cause different kinds of harm. And so what's your view on open source then?

i worry about it i do worry about it because bad actors are using these things already they're mostly using them for misinformation not sure how much biology they're doing um but they will and they're going to be interested in that you know state actors that want to do terrorist kinds of things will do that um i am worried about open sourcing at all and i think the fact that meta could be basic that meta could basically make that decision for the whole you know

world is not good. Like I think there should have been much more government oversight. Scientists should have contributed more of the discussion. But now those kinds of models are open source. They've been released. We can't put that genie back in the bottle. And over time, just like people

i should have said this earlier even if the models don't get any better we will still find new uses for them and some of those new uses will be positive and some of them will be negative right we're still exploring what these technologies can do and people are finding you know ways to make money in dubious ways and to cause harm for various reasons and so forth and so

giving those tools very broadly has problems. On the other hand, I think what we've learned in the last three years is that the closed companies are not the ethical actors that they once were. So Google famously said, don't do evil, and they took that out of their platform. Microsoft was all about AI ethics. And then when Sydney came out, they're like, we're not taking this away. We're going to stick with it. Oh, they did kill Sydney, right? Sydney was this very...

I don't know, raunchy AI that tried to steal Kevin Roos' wife. Yeah, I mean, they reduced what it could do, but they stuck with it in some sense. But

And like OpenAI said that we're a nonprofit for public benefit. Now they're desperately trying to become a for-profit that is really not particularly interested in public benefit. It's interested in money. And they may become a surveillance company, which I don't think is very... Because what you're talking about with the advertising side? So basically they have a lot of private data because they have a lot of users and people type in all kinds of stuff. And they may...

have no choice but to monetize that. And, you know, they've been showing signs of that. They hired Nakasone, who used to be at the NSA. They bought a share in a webcam company and they recently announced they're trying to build a social media company. They want, you know, they look like they're on a

a path to sell your data, your very private data to, you know, whoever they care. It's concerning because whatever data I gave to Facebook, I always used to think that this conversation around Facebook data was a little ridiculous because I didn't think I was giving that much information to Facebook, but I am giving open AI, uh,

a lot of information. I mean, there's a lot of people that treat it as a therapy. Well, that's the number one use is therapist companion. I don't use it as a therapist, but I'm like putting a lot of my work information in there. I read a great book called Privacy and Power, I'm blanking slightly on the title, by Carissa Valise. And she had examples in there, like people were taking data from Grindr and extorting people, right? Grindr is an app for gay people, if you don't know. And, yeah,

you know that's still in our society and like in some places it's acceptable in other places um you know people don't necessarily want to come out if they're gay whatever and so people have been extorting people with data from grinder imagine what they're going to do you know people type into chat gpt like their very specific sexual desires maybe crimes they've committed like

People type in a lot of stuff. - Crimes they want to commit. - Crimes they want to commit. We have a political climate where

you know, conspiracy crime or conspiracy might be treated in a different way than it once was. And so just typing it into ChatGPT might, you know, get somebody deported. Who knows? Now I'm freaked out. It's I wouldn't personally use the system because the writing is on the wall. And I think that they they make some promises to their business customers, but not to their consumer customers. And

That stuff is available for them to do what they want with it. And they probably will because that's how they're going to make money. Here's another way to put it is suppose I'm right about the things I've been arguing and they can't really get to, you know, the GPT-7 level model that everybody dreamed of. They can't really build AGI. But they're sitting on this incredible treasure chest of data. What are they going to do? Well, if they can't make AGI, they're going to sell that data.

- This is why I always thought like when you take in a lot of money, you always have to pay that money back

in some way and that changes the way you operate. That's right. I mean, look at 23andMe. They're out of business and now that data is for sale. Who knows what's going to happen with the 23andMe data? I hope you're wrong about this one, but the history of the internet is... I can't see how I am wrong. Exactly. I'm not saying you are. I'm just saying I hope you are because that would be bad. I hope I'm wrong too. But there is a level of... There's a lot of things I hope I'm wrong. Gary, if people got freaked out about what Facebook was doing with your data, if they overstepped, there's going to be a major societal backlash.

Maybe. I mean, sometimes people just accommodate to these things. I've been amazed at how willing people are to give away all that information to Facebook. I don't use it anymore, but.

Let me ask you this. You quote tweeted one of these, so we'll get into a tweet here. You quote tweeted one of these tweets, is the push to optimize AI for user engagement just metric chasing Silicon Valley brain or an actual pivot in business model from create a post-scarcity society God to create a worse TikTok? This is what basically we're talking about is that that might be the pivot. Yeah, that's right. I think that was someone else's tweet that I quoted. Yeah, Daniel Lit, and you said I've been basically telling you about this. Yeah, exactly.

So that's what it is. You also wrote this saying the quiet part out loud, the business model of Gen AI will be surveillance and hyper-targeted ads, just like it has been for social media. That's right. And we were just talking about that. And when I was quote cheating, there was something from Aravind Srinivas, if I pronounce his name correctly, who's the CEO of Perplexity. And he basically, I said, he's saying quiet part out loud. He basically said, we're going to use this stuff to hyper-target ads.

You also said that companies like Johnson & Johnson will finally realize that Gen AI was not going to deliver on its promises. Have there been companies that have pulled back? Are you just using Johnson & Johnson as an example? That was based on a Wall Street Journal thing, and I may have failed to include the link because of Elon Musk's crazy notions around links. Elon, you've got to put the links in the... Elon, you've got to put the links in. Whatever else you do. That's right. So anyway, that was...

I was alluding to a Wall Street Journal report that had just come out, which showed that J&J had basically said, in so many words, I'll paraphrase it, they tried Gen AI and a lot of different things, generative AI, and a few of them worked and a lot of them didn't, and they were going to stick to the ones that did, like customer service, and maybe not do some of the others. You have to go back a year and a half in history to when people thought Gen AI was going to do everything that an employee was able to do, basically.

And I think what J&J and a bunch of companies have found out is that's not really true. You know, they can do a bunch of things that employees do, but they can't typically do everything that a single employee does. And, you know, they're reasonably good at triaging customer service and they're not necessarily good at creating, say, a careful financial production. Okay. So, Gary, we have like five minutes left.

You said something in the first half about the path that you think needs to be taken to AGI. Can you explain what that is in as basic of a way as you can to...

you know, make it as simple to understand for anyone who's not caught up with the systems that you spoke about? Sure. So a lot of people will have read Danny Kahneman's book, Thinking Fast and Slow. And there he talked about system one and system two cognition. So system one was fast and automatic, reflexive. System two was more deliberate, more like reasoning.

I would argue that the neural networks that power generative AI are basically like system one cognition. They're fast, they're automatic, they're statistically driven, but they're also error prone. They're not really deliberative. They can't sanity check their own work.

And I would say we've done that pretty well. But system two is more like classical AI, where you can explicitly represent knowledge, reason over it. It looks more like computer programming. And these two schools have both been around since the 1940s, but they have been very separate for what I think is sociological and economic reasons. Either you work on one or you work on the other. People argue or fight for graduate students and fight for grants and stuff like that.

So there's been a great deal of hostility between the two. But the reality is they kind of complement each other. Neither of them has worked on its own. So the classical AI failed, right? People build all these expert systems, but there were always these exceptions and they weren't really robust. You'd pay graduate students to patch up the exceptions. Now we have these new systems. They're not really robust either, which is why

Open AI is paying Kenyans and PhD students and so forth to kind of fix the errors. The advantage of system one is it learns very well from data. The disadvantage is it's not very accurate. Sorry, very abstract. So the... I should have said that slightly differently. The large language models and that kind of approach, transformers...

are very good at learning, but they're not very good at abstraction. You can give them billions of examples and they still never really understand what multiplication is. And they certainly never get any other abstract concept well. The classical approach is great at things like multiplication. You write a calculator and it never makes a mistake, but it doesn't have the same broad coverage and it can't learn new things. You can wire multiplication in, but how do you learn something new? The classical approaches have had trouble with that.

And so I think we need to bring them together. And this is what I call neuro symbolic AI. And it's really what I've been lobbying for for decades. And I think it was hard to raise money to do that in the last few years because everybody was obsessed with generative AI. But now that they're seeing the diminishing returns, I think investors are more open to trying alternatives. And also AlphaFold is actually a neuro symbolic

model and it's probably the best thing that AI ever did. And so- De-coding proteins, protein folding. Yeah, figuring out the three-dimensional structure of a protein from a list of its nucleotides. And so- Are you going to raise money to try to do this? I'm very interested in that. Let's put it that way. Masa-san, if you want to make use of your money, no, I'm kidding. You talking to him? Not at this particular moment. Okay. Masa, if you're watching, I don't know, try and help. Okay.

Okay, great. Well, Gary, can you shout out where to find your sub stack? So if anybody wants to read your longer work on the state of AI, where should they go? Sure. So people might want to read my last two books, by the way, Taming Silicon Valley, which is really about how to regulate AI and rebooting AI, which was 2019 is a little bit old, but still I think anticipates a lot of the problems around common sense and world models that we're still facing today. And then for kind of almost daily updates,

I write a Substack, which is free, although you can pay if you like to support me. And that's at garymarcus.substack.com. Okay, well, I'm a subscriber, Gary. Great to have you on the program. Thanks so much for coming. Thanks a lot for having me again. Yet again. Yet again, yet again. Well, we'll keep doing it. It's always nice to hear your perspective on the world of AI. So I always enjoy our conversations. Thanks for having me. Yes, same here. All right, everybody. Thank you for listening. We'll be back on Friday, breaking down the week's news. Until then, we'll see you next time on Big Technology Podcast.

Is AI Scaling Dead? — With Gary Marcus 54:22 Share

Big Technology Podcast

Deep Dive

Shownotes Transcript

Is AI Scaling Dead? — With Gary Marcus