We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

No, Apple's New AI Paper Doesn't Undermine Reasoning Models

2025/6/10

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

AI Deep Dive AI Chapters Transcript

People

Andrew Choi

Andrew White

Azim Azhar

Francois Chalet

Gary Marcus

一位批评当前人工智能研究方向的认知科学家和名誉教授。

Henry Arith McQuine

Josh Gans

Kat Woods

Kevin Bryan

Kevin Roos

Linus Ekenstam

Lisan Al-Gaib

Mark Gurman

Matthew Berman

Nathan Snell

Nathaniel

Pliny the Liberator

Ruben Haseed

Topics

Nathaniel: 苹果在WWDC上的人工智能战略缺失，但其发布的关于推理模型局限性的论文值得讨论。我认为苹果的目标是为普通用户提供实用的人工智能，而不是复杂的技术。然而，苹果在人工智能的执行方面存在问题，进展缓慢。WWDC上没有重大的人工智能发布，现有功能更新也不够吸引人。即使O3不是真正的推理，它仍然可以在商业上发挥重要作用。工具是否进行思考或推理并不重要，重要的是它能提供多大的帮助。 Linus Ekenstam: 我觉得苹果公司迷失了方向，需要回归本源。苹果的新设计语言令人困惑，用户体验不佳。苹果公司需要彻底变革才能扭转局面，目前的WWDC令人失望。 Mark Gurman: 我认为WWDC在设备集成和生产力功能方面表现出色，但在人工智能方面缺乏创新。 Azim Azhar: 我质疑没有人工智能功能的WWDC是否还能称得上优秀？ Andrew Choi: 我认为苹果在人工智能领域的落后是一个潜在的生存风险。 Ruben Haseed: 苹果的论文证明，人工智能推理模型并不真正进行推理，而只是记忆模式。 Henry Arith McQuine: 苹果在人工智能竞赛中落后，并发布论文质疑人工智能的重要性。 Pliny the Liberator: 在Siri改进之前，我不信任苹果发布的人工智能研究论文。 Andrew White: 苹果的人工智能研究人员对大型语言模型持怀疑态度，并发布多篇论文论证其局限性。 Gary Marcus: 我认为大型语言模型无法直接实现能够根本改变社会的通用人工智能。 Kat Woods: 人们常常只看论文标题就以为理解了研究结果，苹果的论文并非否定大型语言模型的推理能力。 Lisan Al-Gaib: 我认为模型是因为token限制才无法完成任务，而不是推理能力不足。模型实际上以纯文本和代码的形式背诵了算法。 Matthew Berman: 我认为模型编写代码的能力极大地改变了解题能力。 Kevin Bryan: 苹果的论文实际上是在测量推理的自我施加限制，而不是推理本身。性能会随着推理token的增加而严格提高。 Nathan Snell: 大型语言模型具有有限的推理能力，但这并不影响其价值。我对苹果公司发布的人工智能相关研究持怀疑态度。 Francois Chalet: 我认为推理和模式匹配之间存在根本差距，影响了系统的实际能力和行为。我们关注推理是因为它能够实现自主获取新领域技能，而不仅仅是模仿现有技能。 Josh Gans: 我认为推理模型在企业和学术界发挥着重要作用，并且正如人们所期望的那样工作。 Kevin Roos: 有一种人工智能怀疑论认为，现在仍然是2021年，没有人可以真正使用这些工具。

Deep Dive

Chapters

Apple's WWDC2024 lacked significant AI advancements, disappointing expectations. Criticisms focused on the absence of compelling AI features, Siri's shortcomings, and a poorly-received UI redesign. Investors express concern about Apple's lagging AI strategy.

Lack of significant AI announcements at WWDC.
Criticism of Siri's performance and the new iOS UI.
Investor concerns about Apple's AI strategy as an existential risk.

Shownotes Transcript

Translations:

中文

Today on the AI Daily Brief, a look at Apple's non-existent AI strategy at WWDC, plus a deep dive on a very controversial paper from the Cupertino company that I think you can, in most cases, safely ignore. But which is probably still worth talking about anyways. The AI Daily Brief is a daily podcast and video about the most important news and discussions in AI.

All right, friends, quick announcements as always. First of all, thank you to today's sponsors, KPMG, Blitzy.com, Vanta, and Super Intelligent. As always, if you are looking for an ad-free version of the show, you can get it for just $3 over at patreon.com slash ai-dailybrief.

Also, I am traveling this week, which always means that there might be a little bit of variability in the show format. Obviously, you got an interview yesterday. And today we have a full episode dedicated to the main. There are a bunch of important headlines though, so we will most certainly be coming back to our normal format tomorrow. For now though, let's talk about WWDC and the illusion of thinking. Welcome back to the AI Daily Brief. And today, of course, we are talking about Apple.

First, we're going to talk about their non-existent AI at WWDC, but then we're going to spend more time on this paper that everyone is talking about, The Illusion of Thinking.

You can probably tell from my title how I feel about it, but that is for just a minute from now. First of all, however, let's talk about WWDC yesterday. Now, you may remember that last year, Apple finally came out of the gate and shared an AI strategy for the first time since the launch of ChatGPT.

It was, of course, Apple intelligence, because Apple had to brand its own thing. And the idea of it was, in short, to provide regular everyday users with the use cases that actually would matter to them. AI that wasn't big and techie and burdensome, but was just useful.

The principle of it was good. It felt just like Apple. The problem has been in execution. None of the solutions they were talking about were really ready. Siri was an absolute disgrace. And basically, Apple has pushed nothing of note on Apple intelligence, which just falls farther and farther behind. Now, expectations were already on the floor heading into this event when it came to AI specifically, because it basically seemed like they were going to forego the topic entirely. And indeed, that's exactly what we got.

There were no big announcements like we've seen in previous years. AI Siri was completely absent from the conference. There were some minor feature updates and a new image model, but nothing really compelling was unveiled. We did, I guess, get a new numbering system for iOS models. And we got a graphical redesign of iOS that has just been universally maligned for being confusing and weird and not really clearly having any particular purpose.

Reports were pretty grim out on the conference floor. Linus Ekenstam tweeted, Apple has clearly missed the mark for far too many times now. I felt today was yet another one of these occurrences. Sadly, Apple is trying hard to do too much. There's too much fat. They need to trim it and back to basics. Apple desperately needs to reinvent itself or become the new Nokia.

During the first 40 minutes, there was nothing that made me feel wow. Actually, there was one thing after another, leaving me with way more questions than answers. Genmoji, backgrounds and group messages, visual intelligence, Apple games. And what is up with the new unified design language? The Glass UI is a UX nightmare. Visual after visual in the presentation is worse than the previous. Apple needs to go back to its roots, make a really good operating system, make really good scaffolding for others to make the apps, and stuff that lives on the device.

I'm completely underwhelmed. Apple needs a step change to their entire existence if things are going to turn around. Sure, I'm typing this on an Apple device because there are not a lot of options out there, but clearly this WWDC might go down as the most boring one ever.

Now, Apple watcher Bloomberg's Mark Gurman was a little more charitable. He said, excellent WWDC, cohesive story, deep integration and continuity across the devices. Zero false promises, impressive new UI and significant new productivity features on the Mac and iPad. But the lack of any real new AI features, despite that being my expectation, is startling.

Azim Azhar said, can it really be excellent without an AI feature? And also, as I mentioned, clearly Gurman is in the minority when it comes to his thoughts, for example, on the new UI. Even investors who aren't as plugged into the tech scene are starting to see Apple's AI strategy as what it is, a crisis.

Andrew Choi, a portfolio manager at Parnassus Investments commented, It's hard to argue that Apple's lack of standing with AI isn't an existential risk. If it can paint a future where it's integrating and commoditizing AI, that would be compelling because otherwise, what is going to get people to buy their next phone for a lot more money?

Still, rather than a breathtaking conference rollout, Apple is trending on AI Twitter for a very different reason. They've just released a controversial new paper entitled The Illusion of Thinking, understanding the strengths and limitations of reasoning models via the lens of problem complexity. AI threader Ruben Haseed writes, Apple just proved AI reasoning models like Claude, DeepSeek R1, and O3 Mini don't actually reason at all. They just memorize patterns really well.

Now, Rubin actually went on to provide a lengthy explanation of the paper, but judging the way the likes fell off after these 13.4 million views, very few people made it past the first post. Now, for many who follow AI development, the notion that Apple would release an authoritative paper on the topic was perhaps somewhat ironic.

Henry Arith McQuine wrote, Be Apple, richest company in the world, every advantage imaginable. Go all in on AI, make countless promises. Get immediately lapped by anyone two years into the race, nothing to show for it. Give up, write a paper about how it's all fake and doesn't matter anyway.

Pliny the Liberator wrote, "I'm not reading a single AI research paper coming out of that giant stale donut in Cupertino until Siri can do a little bit more than create calendar events on the fourth try. If I were CEO of Apple and someone from my team put out a paper focused solely on documenting the limitations of current models, I'd fire everyone involved on the spot." Andrew White of Future House SF noted that this isn't even the first paper from Apple on the limitations to AI.

He writes, Apple's AI researchers have embraced a kind of anti-LLM cynic ethos, publishing multiple papers trying to argue that reasoning LLMs are somehow limited and cannot generalize. Apple also has the worst AI products. No idea what their quote-unquote strategy is here. Now on the flip side, the paper was absolutely jumped on by AI skeptics who believe the technology won't get better than it currently is.

Gary Marcus, who when it comes to AI is basically a real-life version of the well-actually meme, published his own lengthy screed on the paper, calling it a knockout blow for LLMs. He wrote, "...anyone who thinks LLMs are a direct route to the sort of AGI that could fundamentally transform society for the good is kidding themselves. This does not mean that the field of neural networks is dead or that deep learning is dead. LLMs are just one form of deep learning and maybe others, especially those that play nicer with symbols, will eventually thrive."

Time will tell, but this particular approach has limits that are clearer by the day. Now, Marcus has been declaring that AI development has hit a wall every few months since at least March of 2022, back when it was still referred to as deep learning. So that is important context that you can do what you will with.

Remarking on the state of the discourse, AI safety discusser extraordinaire Kat Woods wrote, "I hate it when people just read the titles of papers and think they understand the results. The Illusion of Thinking paper does not say LLMs don't reason. It says currently large reasoning models do reason, just not with 100% accuracy and not on very hard problems. This would be like saying, 'Human reasoning falls apart when placed in tribal situations, therefore humans don't reason.' It even says so in the abstract. People are just getting distracted by the clever title."

So with that in mind, let's talk about what the research actually set out to demonstrate. The study was designed to test the limits of a reasoning model by asking it to solve a number of puzzles, specifically a Tower of Hanoi puzzle. This puzzle features a number of differently sized disks stacked on a game board consisting of three poles.

The goal is to transfer all of the disks without stacking a larger disk on a smaller disk. The game has an algorithmic solution for any number of disks, but the number of steps increases exponentially as you add disks to the puzzle. The paper measured the point at which the reasoning models fail to reason through the steps and observed how the models fail.

The core finding was that CLAWD 3.7 with thinking enabled could easily complete a 6-disc game, struggled a little more with a 7-disc game, and had little ability to reason through the solution to a game with 8 or more discs. Similar results were found for O3 Mini High, and the results were consistent across other logic puzzles where complexity can be modulated. The abstract for the paper stated: "We found that reasoning models have limitations in exact computation. They fail to use explicit algorithms and reason inconsistently across puzzles."

Essentially, the big takeaway was that reasoning doesn't scale beyond a certain point even if there are resources left, with the notion being that simply getting the models to think longer won't yield better performance. There were a lot of issues with the methodology that the internet quickly went to task unpacking.

Lisan Al-Gaib, scaling 01, repeated the exact prompts used in the paper and found that the models were running up against token limits. The structured output required 10 tokens for each move, and the number of moves is known for this puzzle. Therefore, the models were running into their limits at predictable levels of complexity. They weren't hitting the limits of reasoning. They couldn't physically print out all of the moves while staying inside the output limits. Now, the most interesting part of this failure was that the models actually recognized that they couldn't reason through the solution with their current limits.

Instead of starting off the reasoning process in failing when the number of disks was too large, they recognized this fact and provided instructions for how to use the solution algorithm instead. For Claude, this behavior started at 8 disks, hence the sharp drop-off in performance. Lisan commented, All of this is just nonsense. But no, they didn't even bother looking at the outputs. The models literally recite the algorithm in their chains of thought, in plain text and in code. Basically, the takeaway from this analysis...

was that the Apple researchers weren't measuring the limits of reasoning models. They were kind of just using a ton of extra steps to measure the engineering limits that AI labs have imposed on the models. That's a fairly big problem when the AI research is being used to suggest that reasoning has hit a fundamental wall rather than a technical limitation.

Today's episode is brought to you by KPMG. In today's fiercely competitive market, unlocking AI's potential could help give you a competitive edge, foster growth, and drive new value. But here's the key. You don't need an AI strategy. You need to embed AI into your overall business strategy to truly power it up.

KPMG can show you how to integrate AI and AI agents into your business strategy in a way that truly works and is built on trusted AI principles and platforms. Check out real stories from KPMG to hear how AI is driving success with its clients at www.kpmg.us slash AI. Again, that's www.kpmg.us slash AI.

This episode is brought to you by Blitze. Now, I talk to a lot of technical and business leaders who are eager to implement cutting-edge AI, but instead of building competitive moats, their best engineers are stuck modernizing ancient codebases or updating frameworks just to keep the lights on. These projects, like migrating Java 17 to Java 21, often mean staffing a team for a year or more. And sure, co-pilots help, but we all know they hit context limits fast, especially on large legacy systems. Blitze flips the script.

Instead of engineers doing 80% of the work, Blitzy's autonomous platform handles the heavy lifting, processing millions of lines of code and making 80% of the required changes automatically. One major financial firm used Blitzy to modernize a 20 million line Java code base in just three and a half months, cutting 30,000 engineering hours and accelerating their entire roadmap.

Email jack at blitzy.com with modernize in the subject line for prioritized onboarding. Visit blitzy.com today before your competitors do. Today's episode is brought to you by Vanta. In today's business landscape, businesses can't just claim security, they have to prove it. Achieving compliance with a framework like SOC 2, ISO 27001, HIPAA, GDPR, and more is how businesses can demonstrate strong security practices.

The problem is that navigating security and compliance is time-consuming and complicated. It can take months of work and use up valuable time and resources. Vanta makes it easy and faster by automating compliance across 35+ frameworks. It gets you audit-ready in weeks instead of months and saves you up to 85% of associated costs. In fact, a recent IDC whitepaper found that Vanta customers achieved $535,000 per year in benefits, and the platform pays for itself in just three months.

The proof is in the numbers. More than 10,000 global companies trust Vanta. For a limited time, listeners get $1,000 off at vanta.com slash nlw. That's v-a-n-t-a dot com slash nlw for $1,000 off. Today's episode is brought to you by Superintelligent, specifically agent readiness audits. Everyone is trying to figure out what agent use cases are going to be most impactful for their business, and the agent readiness audit is the fastest and best way to do that.

We use voice agents to interview your leadership and team and process all of that information to provide an agent readiness score, a set of insights around that score, and a set of highly actionable recommendations on both organizational gaps and high-value agent use cases that you should pursue. Once you've figured out the right use cases, you can use our marketplace to find the right vendors and partners. And what it all adds up to is a faster, better agent strategy.

Check it out at bsuper.ai or email agents at bsuper.ai to learn more.

Careful reading of the paper, however, uncovers that the researchers had actually prevented the models from coding, which is fine if we're strictly talking about the limitations of scaling up reasoning. But if we're talking about model capabilities in general, and specifically model capabilities in practice, then access to coding tools, which is something they have access to, should be a part of the discussion.

Matthew Berman commented that access to tools really changes the math, writing, "...biggest weakness of Apple's paper showing large reasoning models might not actually be reasoning all that well is that they do not include the ability for models to write code to solve problems. State-of-the-art models failed the Tower of Hanoi puzzle at a complexity threshold of greater than eight disks when using natural language alone to solve it. However, ask it to write code to solve it, and it flawlessly does up to seemingly unlimited complexity."

Kevin Bryan, a professor of strategic management at the University of Toronto, remarked that this paper is really measuring self-imposed limits to reasoning rather than reasoning itself.

He wrote, We can of course program an LLM to spit out millions of tokens in response to good evening and use reinforcement learning to iterate creatively on all sorts of possible interpretations, then collate, then brainstorm more, etc. When the models don't do that, it's not because they can't. It's because we use post-training to stop them from doing something so crazy. This does mean that in some cases it should think longer. We know from things like code with Claude and internal benchmarks that performance strictly increases as we increase in tokens used for inference.

On Circa, every problem domain tried. But LLM companies can do this. You can't because the model you have access to tries not to overthink. Now as one case in point, you might remember when OpenAI tested O3 with essentially limitless compute and found a model that effectively beats the ArcAGI test. However, these runs cost millions of dollars, so the model that was finally released was constrained to a more reasonable amount of reasoning.

TL;DR on all of this is that paper is measuring engineering and cost constraints rather than detecting a scaling wall. Models predictably fail when they know they can't churn out enough tokens to present a full solution. This is actually the desired behavior. You don't want a reasoning model to spend hundreds of dollars failing to reach a full solution.

The failure case is also very telling. Rather than spinning their wheels on pointless reasoning that won't reach a conclusion, the models instead describe an algorithmic solution. That is categorically different to just giving up on a more complex problem as some of the commentary suggested was happening. TLDR, the paper ultimately says absolutely nothing about the fundamental limits of reasoning models. It just runs up against resource constraints in currently deployed AI systems. And yet, this is not even my biggest beef.

My biggest beef is who cares? If you tell me right now that O3 isn't actually reasoning, I'm going to look over at the copious amount of work that I have done with this tool over the last month, shrug my shoulders, and then I'm going to keep on prompting O3 to go do business in ways that wasn't possible before.

This gets to a bigger divide right now, where some people are looking at AI in the context of research and the long-term pursuit of AGI, and others are just focused on capabilities in the here and now. Broadly speaking, it's the research community on the one hand and the business community on the other.

Now, of course, these things do relate to one another. The research community needs its place because it's going to drive the advancements that ultimately manifest as better performance. But in the same way that I've said before that AGI is the least relevant term in all of AI for business people, this is sort of the same idea.

I don't care if my agent is an automated workflow, as long as it significantly increases my human leverage and upscales my valuable AI output. I don't care if my reasoning model is actually reasoning in air quotes, as long as it can do things my non-reasoning models can't.

Josh Gans, who's a professor of management at the University of Toronto, published a long piece basically articulating a version of what I'm saying. After explaining that reasoning models are actually doing a ton of incredible work in enterprise and academia, he commented, they work exactly as people explain they would work and did not work in some miraculous way that the hyper-concern around them generated. And if you worked with them, you would know all this.

Now, to the extent that you are looking for a steel man argument for why these issues actually do matter and that we in the business side of this and the applied side of this should care about some of these questions, machine learning scientist Francois Chalet commented, "...beyond the perhaps superficial semantic distinction between reasoning and pattern matching, there is a fundamental gap in the practical capabilities and behavior of these systems."

You don't create an invention machine by iterating on an automation machine. The reason we care about reasoning is because of what it enables. It's not about definitions, it's about capabilities. You can use pattern matching to emulate specific well-known skills, but you cannot use pattern matching to produce autonomous skill acquisition in new domains.

All of that is well taken. I just don't care, man. And for the vast majority of you who are listening now, also doesn't matter to you. At least not in the here and now. Maybe it does in terms of what we get to in the future. As Gann summed it up, "I don't care whether my tool is thinking or reasoning. I care how much it's helping, which is a very different thing." Sure, there is an intellectual question regarding cognition, but that's far removed from the transformational impact AI can have right now. Nathan Snell wrote, "I'm surprised Apple's research paper on LRM is getting so much attention.

LRM has limited reasoning capacity. Shocker? It's clear if you use it. Doesn't make it less valuable. He also said we're all thinking when he added, Also, is anyone else inherently skeptical about research put out by Apple related to AI? They don't exactly have a great track record there. And this is one of the, I think, sort of sad things for these researchers. This all feels to me like it might have been a case of very bad timing.

WWDC was gearing up to announce literally Squad Zero about AI, and Apple researchers dropped this paper that seemed to sort of self-servingly say that AI matters less than we all think it does. Essentially, the paper was a Rorschach test on AI.

For some reason, there is an entire sector of AI discourse in the economy that seems to be dedicated to turning Paul Krugman's internet is no more significant than a fax quote from 1998 into an entire career positioning. Author Ewan Morrison posted, AI has hit a wall. AI companies will try to hide this. Hundreds of billions have been spent on the wrong path.

Kevin Roos really sums it up when he writes, there is a strain of AI skepticism that's rooted in pretending like it's still 2021 and nobody can actually use this stuff for themselves.

It's survived for longer than I would have guessed. Look, when it comes down to it, I think it's important that researchers have great debates about all of these things. And I think it's great from the standpoint of what I want as a person in business who uses these tools for business, which is constantly improving models. The academic discussion and discourse that is so important is way upstream from business value, yes, but it is still part of the same stream.

Signal hilariously tweets,

Apple proves that this feathered aquatic robot that looks, walks, flies, and quacks like a duck may not actually be a duck. We're no closer to having robot ducks after all. What are we even doing here anymore? The answer, at least for people who are listening to this probably, is building really cool stuff, doing really cool things, being really excited about what capabilities AI has, and ultimately not caring all that much over whether you call it a duck or a feathered aquatic robot.

That's going to do it for today's AI Daily Brief. Until next time, peace.

No, Apple's New AI Paper Doesn't Undermine Reasoning Models 21:22 Share

The AI Daily Brief (Formerly The AI Breakdown): Artificial Intelligence News and Analysis

Deep Dive

Shownotes Transcript

No, Apple's New AI Paper Doesn't Undermine Reasoning Models