We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

GSMSymbolic paper - Iman Mirzadeh (Apple)

2025/3/19

Machine Learning Street Talk (MLST)

AI Deep Dive AI Chapters Transcript

People

Iman Mirzadeh

主

主持人

专注于电动车和能源领域的播客主持人和内容创作者。

Topics

Iman Mirzadeh: 我认为，如果要从本次演讲中汲取一条重要的信息，那就是要理解智力与成就之间的区别。当前领域非常关注成就、数字和准确性，却忽略了对智能系统的理解。这意味着要理解一个系统如何理解、如何推理，而不是将其与某个基准测试的特定数字联系起来。我们需要构建更好的抽象世界模型和知识表示。这并非易事，因为我们甚至对一些基本问题都缺乏答案。例如，AlphaZero的出现提升了国际象棋水平，并非因为人们单纯记忆招式，而是因为他们试图理解AlphaZero的工作原理。国际象棋大师们利用AI工具来发展理论、创造新的策略，而非单纯记忆招式。ImageNet等基准测试的饱和表明，仅仅达到高精度并不意味着问题解决，现实世界并非静态的。构建智能系统应该关注理解和推理，而非仅仅追求精确的数字结果。人们常认为LLM的推理能力不足是因为缺少某些功能或调整，或者仅仅需要访问工具，但这忽略了理解和创造新知识的重要性。人类也使用工具，但重要的是理解和创造新知识，而非仅仅完成任务或达到特定精度。解决复杂问题需要多种工具，这最终会归结为相同的问题：如何理解和推理。使用工具可以解决问题，但不一定能理解解决问题的过程。使用外部工具可以赢得比赛，但不一定能理解其中的原因和策略。AlphaZero的一些策略以前从未见过，人们需要理解这些策略背后的原因，而非仅仅关注其带来的优势。理解AlphaZero的策略能够改进国际象棋，这比单纯记忆招式更有意义。利用AI工具改进国际象棋，关键在于理解策略背后的原因，而非单纯追求高分。LLM的推理能力存在局限性，其脆弱性表明其并非真正的推理。将LLM看作是掌握大量分布的系统，提示只是调整这些分布，而非真正的理解。提示可以引导模型，但这并不意味着模型理解其背后的知识。目前的训练方法将所有内容视为分布，模型的目标是学习并最小化与数据分布的距离，这限制了模型的理解能力。目前的训练方法限制了模型对分布之外内容的理解能力。目前的训练方法导致模型容易出错，难以处理不同分布的数据。交叉熵损失函数只关注模型输出结果的正确性，而忽略了模型对底层概念的理解。交叉熵损失函数只关注模型输出结果的正确性，而忽略了模型对底层概念的理解。交叉熵损失函数无法保证模型理解数字和加法的概念。交叉熵损失函数只关注模型输出结果的正确性，而忽略了模型对底层概念的理解。目前的训练方法无法保证模型能够构建世界模型和概念。人工智能领域取得了很大进展，但研究方法和指标设计方面仍有改进空间。目前人工智能研究方法并非最佳，我们缺乏对系统工作原理的深入理解。人工智能研究应在理解问题基础上寻找解决方案，而非反之。阅读大量论文后，我们对人工智能系统的理解并没有显著提高。目前人工智能研究缺乏一个统一的理论框架来解释系统的工作原理。目前人工智能研究缺乏一个统一的理论框架来解释系统的工作原理。其他领域，例如物理学，拥有更完善的理论框架来指导研究。目前人工智能研究更关注解决方案，而忽略了对问题的理解。在寻找解决方案之前，应该先理解问题。目前对提示机制的研究缺乏统一的理论框架。目前对模型的研究缺乏统一的理论框架，这阻碍了模型的改进。目前的理论研究过于严格，与实践脱节，这阻碍了理论的快速发展。目前的理论研究过于严格，与实践脱节，这阻碍了理论的快速发展。人工智能领域面临的挑战包括同行评审过程的不足。人工智能领域面临的挑战包括同行评审过程的不足。如何桥接符号人工智能和连接主义是一个重要问题。可以考虑将符号模型和非符号模型结合起来，但不能将它们视为完全独立的系统。可以考虑将符号模型和非符号模型结合起来，但不能将它们视为完全独立的系统。将符号人工智能和连接主义结合起来，需要模型能够构建世界模型并更新自身的信念系统。模型需要具备判断信息正确性的能力。模型需要具备判断信息正确性的能力。模型需要具备自身的信念系统和知识表示，并能够根据新的信息更新自身。模型需要具备自身的信念系统和知识表示，并能够根据新的信息更新自身。模型的信念系统不应是孤立的，而应该是一个集成循环。人类并非总是进行推理，LLM也存在多种模式，有时进行表面统计，有时进行近似推理。LLM擅长插值，这给人造成其具有推理能力的错觉。LLM擅长插值，这给人造成其具有推理能力的错觉。LLM的插值能力在封闭领域表现良好，但在开放领域则存在局限性。LLM的成就和能力与智力是不同的概念，目前我们常常混淆两者。目前我们常常混淆人工智能系统的成就和智力。目前我们常常混淆人工智能系统的成就和智力。智力指的是系统的潜在能力和发展潜力，而非当前的成就。智力指的是系统的潜在能力和发展潜力，成就指的是系统在特定任务上的表现。系统在基准测试上的良好表现并不一定意味着其具有智力。智力指的是系统长期发展的能力。智力指的是系统长期发展的能力，而非当前的成就。智力可以用大脑质量与体重质量之间的关系来衡量，关键在于增长速度而非当前水平。智力指的是增长速度，而非当前水平。“伊曼登月测试”用来衡量智力，即从原始人到登月所需时间。当前的LLM在基准测试上的高分并不意味着其比原始人更聪明。智力指的是学习和成长的能力，而非在特定任务上的表现。智力指的是学习和成长的能力，而非在特定任务上的表现。衡量智力是一个难题，目前的基准测试容易饱和。衡量智力需要从定义智力的特性出发，这可能需要一个非客观的方法。衡量智力需要从定义智力的特性出发，这可能需要一个非客观的方法。衡量智力可以从系统在全新任务上的表现出发。衡量智力可以从系统在全新任务上的表现出发，例如学习新编程语言的速度。Gilles Gignac等人的论文对智力的定义进行了形式化的探讨，值得参考。主持人: 我们需要构建更好的抽象世界模型和知识表示。人类并非总是进行推理，LLM也存在多种模式，有时进行表面统计，有时进行近似推理。智力并非技能，而是适应新事物的能力。 supporting_evidences Iman Mirzadeh: 'To me it looks nearly impossible to build an intelligent system that operates without an abstract model of the environment and the world and knowledge.' Iman Mirzadeh: 'In image and computer vision, we had these benchmarks like ImageNet and all those benchmarks and we saturated them and we thought, okay, the vision is solved.' Iman Mirzadeh: 'number of examples. You have to build an agent that understands and reasons.' Iman Mirzadeh: 'So intelligence, by default, it means, like, by definition, it means about the capability of system and how it can grow and at some point eventually becomes capable.' Iman Mirzadeh: 'So how could we measure this? Because, you know, like Jolet had this formalism for measuring intelligence, but it wasn't computable.' Iman Mirzadeh: 'If you look back, obviously we can admit that kind of what these systems today are capable of kind of surprised the field and everyone, I think.' Iman Mirzadeh: 'So, yeah, I mean, about sampling, there are a couple of things. Like, sometimes sampling in general doesn't make sense.'

Deep Dive

Chapters

This chapter discusses the crucial distinction between intelligence and achievement in AI systems. Current AI research heavily emphasizes achievement metrics like accuracy, neglecting the fundamental understanding of intelligence, reasoning, and knowledge representation.

Overemphasis on achievement metrics in AI research.
Need for better abstract world models in AI systems.
Lack of basic answers to fundamental questions about intelligent systems.

Shownotes Transcript

Translations:

中文

I think if someone wants to take one important message from this talk, it would be understanding the difference between intelligence and achievement.

The field is currently focused very heavily on the achievement and the numbers and accuracies instead of trying to understand what does an intelligent system mean? What does it mean for a system to understand, to reason, instead of tying it to a certain number for a certain benchmark? Iman, I think you would agree that we need to have better abstract world models, right? We need to have better representations

How's that going to work? To me it looks nearly impossible to build an intelligent system that operates without an abstract model of the environment and the world and knowledge. But right now there are many questions that need to be answered before that. One of the issues that I have with the literature right now and also including me is that we don't even have basic answers to these questions.

after AlphaZero came. What happened is that the chess event became more popular and the quality of chess improved. Not because people used chess engine and just memorized moves, it's because they tried to understand what AlphaZero is doing, other chess engines are doing.

So, grandmasters actually use these tools a lot, but they don't go and memorize the moves by chess engines. What they do is that they develop theory. So they work on novel openings, they work on novel moves at certain positions, novel strategies. They learn from these tools, not in a way that just memorizing. They understand, they create knowledge, new knowledge, new theory.

In image and computer vision, we had these benchmarks like ImageNet and all those benchmarks and we saturated them and we thought, okay, the vision is solved. But now we see that self-driving car is not becoming a thing right now because it's very difficult. Like in real world, there isn't a specific frozen cut of reality that we've fixed.

number of examples. You have to build an agent that understands and reasons. So that's why I think it's important to focus on not the exact number.

MLST is sponsored by Tufa AI Labs. Now they are the deep seek based in Switzerland. They have an amazing team. You've seen many of the folks on the team. They acquired Minds AI, of course. They did a lot of great work on Arc. They're now working on O1 style models and reasoning and thinking and test time computation. The reason you want to work for them is you get loads of autonomy, you get visibility, you can publish your research. And also they are hiring, as well as ML engineers, they're hiring a chief scientist.

They really, really want to find the best possible person for this role and they're prepared to pay top dollar as a joining bonus. So if you're interested in working for them as an MO engineer or their chief scientist, get in touch with Benjamin Cruzier, go to twoforlabs.ai and see what happens.

you know we all accept there are there are limitations in reasoning in llm systems and some people say oh it's just because there's a missing feature a missing tweak we just need to tweak the transformers architecture and then they can do copying and counting and reasoning all those stuff other people say all we need to do is let them access tools right so for example um like i want to have a system that plays chess it can just use a chess computer what could possibly go wrong with that

So, yeah, that's one of the most common arguments I hear and debate with many friends and colleagues and people even at NeurIPS. And I don't personally have any issues with using tools. Humans also use tools. Everyone uses tools. But going back to our previous discussion,

part of discussion. It's not about solving a task. It's not about achieving certain accuracy on a task. It's about understanding. It's about creating novel knowledge, creating novel goals, achieving those goals. So there are two arguments I have against the tool use. So one is that

First of all, it's not just one tool, right? So there are like for chess. Yeah. If your only goal in the world is to do well on chess and beat some humans. Yes, you can look at, I don't know, a chess engine as a tool and just give moves to it and get moves and play chess and win.

So it won't be just one tool for your whole life. It will be many tools, right? So if you look at some reasoning tasks, right, like solving a math, logical problem, planning, so it involves many steps, many steps.

many states that you need to navigate and learn. And even with tools, you won't be able to use, you won't necessarily will use only one tool. You will need many tools. So it will be like, I don't know, at some point you will need to use five tools, 10 tools. And then the tools may become complicated. They might not necessarily be like,

You give input and you get output. It will be complex. So essentially, it will reduce to the same problem, right? What if I need to plan for 10 different actions? If you look and so at that point, you still face the same problem, right? So that's

one argument and for the chess also, there is this argument that if you use tool it's fine to use tool definitely, but you don't necessarily be able to know if the

system that uses external to understand that task. So looking at the chess engine, you could play, you could use a chess engine and play and win, but you won't necessarily understand why did you make those moves? Why there is this move in that position? What constitutes a good position, bad position planning? So

actually one because I'm like a huge fan of chess I watch chess competitions and so I follow what's going on in the chess world what happened is that after

AlphaZero came. So humans, instead of like, everyone thought like the chess as a sport and the competition will be kind of, will lose its importance because there is this system that wins any grandmaster in the world. But what happened is that the chess event became more popular and the quality of chess improved. Not because people used chess engine and just

uh memorized move is because they try to understand what uh alpha alpha zero is doing other chess engines are doing uh so actually they wrote a book on understanding the moves of alpha zero right so the famously alpha zero has this move that uh sometimes it tries to uh

from the left or right corner and these moves have not been like seen before right but you could say that I don't care this is a move that goes like gives me advantage and wins or you could understand why this move um

this kind of moves improve chess, right? So, grandmasters actually use these tools a lot, right? But they don't go and memorize the moves by chess engines. What they do is that they develop theory. So, they work on novel openings. They work on...

novel moves at certain position, novel strategies. They learn from these tools, but not in a way that just memorizing. They understand, they create a knowledge, new knowledge, new theory. And I think this distinction of how humans use chess engine versus how people think we should use chess engine is very important. So if someone claims that chess

chess engine like you you can use chess engine for your llm as an external tool i'm fine with that if you want to develop that system but my metric for measuring how good that system is is that if you start if you let that system or llm or any system use that chess engine after a while it comes up with a theory so one one

example would be that human like in chess there is this principle that in the opening phase you should have a control on the center because you have control over more squares it allows you to uh like in the next moves you it allows you to have more squares and it prevents your opponent to develop its pieces right so development of pieces controlling the center are the principles

of the chess. So if you start with any system and a chess engine, and after a while, it comes up with why it is important to control the center. Because the openings, like you could train a model on all the games, and it will learn as a distribution and statistics that the first moves are for white, e4, d4, or c4, right? Like the first, the center pawns. So

The reason behind that is that they want a control center, right? If it comes with that new knowledge that understands why it matters using a tool, it's fine. But if it's just, you know, using a chess engine and gives moves and back and doesn't have any understanding on what's going on, then all you did is that you got like high rating on chess without really knowing what's going on.

How puritanical should we be here? So, you know, LLMs, they act as if they are doing reasoning in many, many circumstances, but they're prompt dependent. You give them the wrong prompt, you put distractors in there, it brittleizes it, it doesn't work. So how do we interpret that? Do we say it's not reasoning because of the brittleness? The short answer is yes. So the way I look at the prompting and the reason I also don't necessarily...

think that exploring different prompting helps in terms of really understanding, like improving the systems, is that I look at the prompting, I look at LLMs as a system that knows a lot, lot, lot of distributions, right? And then the prompting to me looks like conditioning those distributions, nudging those distributions to maybe specifically steer the model towards some direction.

And so you could use prompt to do anything you want, almost, but that doesn't mean that the system is understanding what's going on behind the knowledge. So my problem with prompt

the classical way of what the way we train the system is that we look at everything as a distribution and when you learn a distribution by construction our last function is that you learn a distribution that minimal and minimize the distance between your distribution and the sum distribution that you infer from data right and once you learn that this week that's all the model cares

So, once you learn that distribution, you just stay in that distribution. And there are many problems with that. The first one is that by definition, by construction, these systems won't be able to understand what is beyond that distribution. By construction, we train the models to minimize the loss and stay within the boundary of a distribution.

And that's what leads to all the problems, right? You change something, system breaks. You want to measure how the models perform on different distributions, they can't do that. And all these problems, because that's how we trained our models. How do you expect the model to stay in a box? And then you ask it, what happens outside the box? And you constructed that system this way. So,

I think that's one of the issues. And another issue with the way we train, I think, our models, I don't have, by the way, answers to how to fix this right now. But another issue is that when the way we train with, for example, cross entropy loss, right? So let's say you are like,

you are teaching models to do arithmetics, right? What we do is like 2 plus 2 equals, assuming there are like 2 is a token, plus is a token. So these are the tokens. And then in the text, there is 4, right? So all the model has to do is learn after 2 plus 2 equal, it should output 4. And some context maybe. So it doesn't matter.

for the system that this four comes from understanding the whole natural numbers, the line of numbers, addition, can we do this or not? Or it's just memorization or maybe some other thing, right? So there is nothing in our loss function saying like,

It's important that you understand numbers and natural numbers as a concept, which later will be grown to, I don't know, rational numbers or like real numbers and all this concept. All you have to do, all I care is that after two plus two, you should give me four.

And because of that, we don't know if the model can come up with that understanding and building a world model or kind of concepts of what's going on. But yeah, I think these are the main issues with all those. These are the consequences of, I would say, all the things we are doing to train the systems.

So you said to me that you're overwhelmed by the lack of progress in, you know, the sphere of intelligence in AI research at the moment. Tell me more.

Oh, okay. Yeah, I mean, there's like, there are two sides to this story. One is that, obviously, I think we are living in a very important time and very exciting moment. And there are lots of progress being made. Lots of people are investing in AI right now. There's lots of research being done. So from that side, it's very exciting. On the other hand, I think like,

In my opinion, the way we are doing research in the field is not necessarily optimal. And the way we designed our metrics and the way the system works in general is not necessarily optimal. So, for example, right now, the current way of doing research, most of the research,

research is that people believe that research is always an incremental process and I don't disagree with that. You will always build on top of work of other people.

But my issue with how the research is being conducted right now is that we don't, by reading new papers and learning all the new papers, we still, compared to, I don't know, two years ago, three years ago, four years ago, we still don't have a better understanding of science.

how these systems work, right? We do not know that like whether we are moving towards the right direction or not. So for example, like most of the papers these days have a hypothesis is that like if you do this,

then this will happen. And then the Iran experiment showed that, yeah, if you do this, then this happens. Then there will be another paper saying that if you do this, then this happens. And then there will be other papers saying that all these papers had this assumption, and then if you have a different assumption, this will happen. And then still, after reading, I don't know, 10, 20 papers, you still don't have a coherent picture of how these systems work. What is the general hypothesis?

So I think what is, I think, good about many other fields like physics is that you have a hypothesis saying like, this is my model of, I don't know, a transformer, right? This is how I think a transformer works. And then after that, you say, okay, if my hypothesis is true, then there will be these consequences, right? And then I'm going to do a design and experiment and measure that.

But this is not how things work. The way people approach this problem is that they look for solutions before understanding the problem. So I think that's one of the issues that I hope at least people spend more time on understanding things rather than focusing on the solution and what we can do. I am always like...

I think it's more important to understand what is happening right now. What are the problems? What are the pros and cons of these systems? How do these systems work? And then once you have at least a hypothesis or mental model, like how these systems work, then you can build on top of that. But

otherwise you will maybe run into like papers or like research that think like instead of trying to understand how like prompting like you could say my model of prompting is that it will nudge the distributions it preconditions the model and then if there are a few distribution and a model is interpolating on that then this will happen right and then you could run a few well

like test this hypothesis, but like what happens is that someone explores that what happens if you keep a language model and even if it like increase your accuracy by 5% You still don't know what's going on right like you after reading that work that increased the performance of the model like a few percentage like all those work you you still don't have a better understanding of the

models and then that will be makes it very difficult to improve those systems right and then there are other I think aspects like I think like from the theory perspective I'm not an expert in theory at all but I think maybe our theory is also too rigorous I think for the moment because the problem is so hard and the current theory lacks

far behind the state of the practitioner. And then they want to stay rigorous, very rigorous and like develop it. But it's very hard if the theory doesn't become more relaxed. So it can at least be like have a faster progress.

So I think those are the main issues. And then from the other aspects, I think like it's given the amount of interest and investment in the AI literature, then there are many people working on the field, which is really, really great.

but there will be other challenges like the reviewing process, the peer reviewing process is really difficult and noisy. I don't have an opportunity solution for that to propose here, but I think we can at least

think about these problems. How can we improve those processes? And yeah, I think these are the main, I think, challenges of the AR literature and the AR research community right now in terms of research and how the research is being conducted. How can we bridge symbolic AI and connectionism? Yeah, that's... I think, I mean, we...

There are a couple of ways to think about it. Like, I think you could have a symbolic model. I'm not... Right now, I'm not saying that, like, this model, like, symbolic model versus non-symbolic models or deep learning is good, bad, or, like, think work. I'm thinking about more, like...

fundamental ways of like given any systems is there a way to understand if this system is understanding a concept is able to reason but in general I think it is fine to start and like integrate these two as a like maybe we start with a separate module and then

then maybe eventually, like by marrying these two fields, like it will become more integrated and improve over time. But it's very important to like not look at these things as two separate systems. Otherwise, it will lead to problems of, you know, using an external tool and all those things. So there should be some notions of understanding and like,

the model itself should be able to have a word model, should be able to understand if I need to update this or not. So one other thought experiment that I have is that, like, imagine right now that we are talking, I tell you the area of a square is PR to the half, like the power of half, I don't know, something like that.

p times pi times 2r, right? So you wouldn't accept my argument, right? Like you could prove mathematically that this is not true. But if I tell this the area of the area of square is pi r to the power of 3 to an LLM because there is no module in the model system that's saying like is this making sense? What do I think about the area of this circle?

it should ideally won't be, it shouldn't accept it. But we don't have that system. So I think that's one of the things that I think, one of the examples of why we need another component, why we are missing at least one component that after a while learns,

uh comes up with its own belief system and like knowledge representation and if when something comes it may update its belief system or may disagree i wouldn't agree that i don't know the area of circle is something else because i can't prove it to you right now that that's it if you have another proof i will read it but i will doubt that that proof is correct right so

That's very important. But there are other belief systems that I'm open to change. Like the best place to eat in Vancouver, I'm open to change and try your suggestion. So I think it's important that the system should not be separated and be looked at as two separate components. It should be an integrated loop.

Yeah, and I think it is interesting that you say that humans don't always reason, but we're capable of doing some privileged form of cognition that we might call reasoning. And also you could argue that the LLMs have a spectrum of modes. So sometimes they're doing surface statistics and maybe in certain circumstances they're doing at least some approximate version of reasoning. So there seems to be a spectrum. Tell me about that. Yeah, I think...

If you look back, obviously we can admit that kind of what these systems today are capable of kind of surprised the field and everyone, I think. But I think we kind of sometimes get confused

read too much into it and what these systems are capable of doing and the fact that like there some Claims say that the models can reason and like there's a spectrum I'm not sure about that actually that this models can system what they are good at I think is doing some sort of interpolation

So it's not that they're just memorizing things, they can't do anything beyond training data. They learn many things from different places and they learn different distributions and they do interpolate between these things that they've learned. So I think that's what gives, I think, the illusion of what these systems are capable of reasoning. And if the domain is very limited,

and closed, for example, a specific domain where the overall space is closed, the interpolation of the model seems enough to assume that these models are capable of performing in that domain. But I think in general, there is a very huge distinction between the achievement and capability of a system and the intelligence and other capabilities,

intrinsic capabilities of the system. So these two are different. But currently what we do is that we mix those two together. So for us right now it's like if a model performs good at a coding benchmark it means that it has reasoning. If it does good at

on math benchmark, it means it's doing some sort of reasoning. And when we mix this together, it makes arguments and understanding these systems very difficult. So I think it's important to discuss the distinction between intelligence and achievement.

So intelligence, by default, it means, like, by definition, it means about the capability of system and how it can grow and at some point eventually becomes capable. Not necessarily what...

it is how capable it is right now. So achievement is about measuring what the system is doing on a specific domain or task or benchmark. So these two are different. So if a system is intelligent, eventually it will be able to do well on some sort of benchmark. And

but necessarily the other way around is not correct. So if the model is doing well on a benchmark, it doesn't mean necessarily it is intelligent. But because it is hard to measure the intelligence in general, this is an open question, we set a benchmark and say an intelligent system should do good on this benchmark.

And then when a model or system gets good, we say this is doing reasoning. So these are different. I think intelligence is about...

is about how a system in the long run is able to perform well. So there was a good, very interesting picture in Ilya's talk at NeurIPS this year, and that figure showed the

there was a regression line between the body mass and brain mass and that figure showed like how the different species scale in that sense. So my take from that picture was that intelligence is not about where you are in that scale right now, it's about how like the slope of

that figure so if you are thinking about the scaling laws intelligence is the slope of scaling not what point you are at yes I love all of this I mean um

You probably know I'm a huge fan of Francois Chollet, and he's always at pains to say that intelligence is not skill. It's adaptation to novelty. It's skill acquisition efficiency. So there's something about the macro adaptability, which is important. And actually, when we were discussing this before, you came up with your own test for intelligence, which I'm going to call the Iman Moon test.

Which is, you know, basically, imagine we start from cavemen and we land on the moon and sort of how quickly can we do that? Yes. So, yeah, that came from a thought hypothesis, right? So just to, again, reiterate the distinction between intelligence and achievement, right? So imagine a caveman.

So if you go back in time and give MMLU or GSMHK to a caveman, I would argue that the performance would be near zero. And current models get near like 80%, 90% accuracy on these benchmarks, right? And do we really believe

the current systems that we have are more intelligent than a caveman. So intelligence is about the system that a human possesses, not how well it's doing on a benchmark, right? Even if you go even further, like some great figures in history and science, like Aristotle, right? If you give MLU to Aristotle, I don't think it will perform nearly as well as current LLMs.

But do we really believe those LLMs that we have are more intelligent? Because intelligence is not about how much you achieve on a certain task. It's about how well that system, that human, can, if it spends time, learn something and grow.

So how could we measure this? Because, you know, like Jolet had this formalism for measuring intelligence, but it wasn't computable. And benchmarks, they just get saturated, you know, they get good-hearted. How can we build a new suite of benchmarks? Yeah, that's... I think I don't have a...

concrete answer to that i'm thinking about it a lot but i don't think right now i do have an answer to that but we can maybe not we can at first start with something that is not necessarily objective but maybe we can start by asking what kind of system do we want what does intelligence means to us what does uh

achievement mean to us and then uh if we maybe start with a kind of axiomatic definition and characteristics then we can at first it won't be objective and i know it will be not concrete and open-ended but if you start thinking about these important problems we may come up with some uh uh some way of measuring it uh so it is uh

In general, measuring intelligence is very difficult. Even for humans, it's not trivial to measure intelligence of humans. But I think we can define some desired properties that we want from a system. So those desired properties might mean something like the system being able to perform well on novel tasks.

And then definition of novel itself is not easy. But I think we can start from here and then build on top of that and think more on that. So because there was a nice paper by Gilles Gignac et al. And it's called...

On definition of intelligence, I think that was a really good read. It's not circulated very well in the machine learning community because they have background in psychology and cognitive science, but it's a really, really nice paper. It defines...

very formally what is the definition of intelligence, what is the definition of artificial intelligence, what are kind of forms of intelligence we are looking for and we can start from there and think more on those problems and maybe we come up with something better.

So yeah, that's what I'm thinking on these ways. But in general, I think the direction should be towards measuring novel tasks. So for example, if you are training a model to do coding, at least what you could do is that...

For me, a better measure would be how fast that system can learn a new programming language from scratch. So a system like an LLM could do very well on Python or all the languages it's been trained on. If I create a new programming language and ask it to write a new language, a new program in that language, how fast it can learn that language and how good it can...

write programs in that language compared to a human that never seen a language before. Because I think one other things that we can do is that compare how humans learn and reason and compare it and contrast it with how these machines are learning and reasoning.

So, yeah, I think we can think towards that direction, but I don't have a concrete... Yeah, it's very slippery. There's a great paper by Pei Wang on defining intelligence, and he said that it's very anthropocentric and...

You know, in different fields, people use different techniques. So, you know, we could have basically a copy of a human brain. And that's the most anthropocentric. That's not particularly useful, is it? Or we could have notions of behavior or more abstract capabilities of being able to write Python programs.

or functions so having things that do planning and reasoning and all these abstract cognitive functions or maybe even having things that have certain principles like you know emergence or certain characteristics or something like that very very slippery but just before we move off that you cited this paper from psychology I'm very interested to know what

what their definition was. For Pei, his definition is very much about adaptivity. That seems to be one of the core things for him. So, yeah, actually that paper discusses lots of other works on studying and understanding and defining intelligence. I highly recommend that paper. So, yeah, the focus of that paper, the main emphasis is on novelty. So,

The definition of intelligence in that paper is that the maximum capability and capacity of a system to achieve a novel goal given some time. Obviously, you can define many different programs and systems that...

given infinite time eventually they could reach to a goal but the time is also important that's why I also mentioned the slope of scaling and why it matters. Do you think agency and autonomy comes into intelligence so you know basically the ability to set your own goals or do you think we should think of intelligence surely you know purely in the frame of this is a thing it has a goal and can it achieve it?

No, I think actually agency and the system having some concept of what it wants is very important. So I was reading a book recently. It's called How We Learn.

why humans learn better than machines for now by stanley's uh duan and in that book uh he discusses lots of great uh uh topics and how humans learn how machines learn and it comes from neuroscience background so lots of new insights at least for me and one of the things that uh

seemed really interesting to me was that that one of the pillars of learning for humans is active engagement and essentially you some humans like uh have very uh

it's very hard for humans to just observe and don't get involved and just watch something to learn compared to having them set their own goals, being able to actively engage in the environment, explore, maybe later exploit and all these things. So without that, that's,

That's actually, according to that book, is one of the necessary conditions. It's not something that is good to have. It's something that you must have. So I think that's one of the important topics. And I think for me, the implication was that if you want to build a really intelligent system, supervised learning won't be enough.

So you need an agent to ask for what should I do next and be able to ask questions. It's not like you just observe and say, okay, I learned this, I learned that. It should be something like, now I learned this, what do I want to learn? Maybe I haven't understood this well enough. I need to explore more. Let's explore this. This is fascinating. So...

Controlling the center is an abstract category, right? It's an analogy. And Douglas Hofstadter said that, you know, a concept is a bag of analogies, essentially. So there's

Moving to the center is something that we all understand at some abstract level, but what does that actually mean? I think many concepts in chess are quite fuzzy, right? So what that actually means in many, many different situations is slightly different. So the question is, what's the difference between AlphaZero, it discovered this move 37, and if you play it millions of times, this move 37 in its behavior space emerges as a mode.

so you could argue that it is doing reasoning but it's doing an emergentist form of reasoning so rather than having some notion of in the center its behavioral profile acts as if it has a notion of in the center and how is that so different from the way our brains work yeah i think um so again like uh that if

So there's this thing like AlphaZero, AlphaGo and all these systems are able to understand the environment they're put in, right? So they explore the environment, they play a lot and they are able to eventually explore most of the kind of positions in the Go or Chess or many of them.

eventually they will be able to take rational decisions because they understand the value of each state and do that but that doesn't mean they come up with this concept so for for

For a chess engine, it is that this move improves the value of this position, right? But it doesn't necessarily transfer to other aspects of life, right? So in chess, for example, you could argue that controlling center means controlling important or strategic positions, right? You could...

Once you learn this, you could also apply it to other aspects, right? This is not just about four or five squares in the center of a chessboard. It's about importance of specific positions in different parts. So it could be about controlling an important position.

a road in a country controlling an important passage. So that form of abstraction is, I think, and the way we store knowledge and learn is what allows us to scale very fast. Once we learn this in chess, we are able to scale it, like use it across other domains.

but systems like AlphaGo or AlphaZero are not able to do that because the world to them is just that and they don't have this kind of abstract representations and knowledge. So I think I'm not saying that learning abstract abstractions is necessarily the only way that you could build an intelligent system but certainly it's helping humans to scale way faster and improve faster than machines.

On the matter of generalizing out of distribution, so what we're starting to see is, I mean, obviously chain of thought is used a lot, which means that you can take something which is trained on a distribution and you can kind of like manipulate, I suppose you're doing some kind of directed retrieval to create compositions in some fixed way. So there's chain of thought prompting. Some people are doing program induction.

And it certainly seems to be the case that with a few examples, maybe with Chain of Thought, you can get a language model to induce a very rich and diverse set of programs. And many of those programs seemed to have some kind of abstraction. Another thing we're seeing is transduction.

which is where you do some kind of active fine-tuning. So you take the test instances and then you kind of like modify the existing model a little bit so that you get generalization in this domain. So there's loads and loads of approaches that seem to be working quite well for making these things do better OOD. What do you think about that? Yeah, okay. So yes, so you could do like...

There are lots of ways to explore this, right? Like chain of thought, program synthesis, all these topics. So my question, which I'm trying to think about like these days a lot about, is that

All these systems, all these methods, all these directions are built on some assumptions. And the assumption is that there is something going on in the model, some kind of understanding that we need to improve that, right?

I'm not quite sure if right now there is that if that assumption is correct. Right. So I'm not saying that it's correct or incorrect. I'm not sure about it. So if that assumption is incorrect, then all the things you are doing, chain of thought, all the other things.

methods for improving this kind of reasoning may not be helpful, right? So I'm like, what I'm thinking about, like, is that like, like taking a step back and really asking this question, like, is there anything there, like beyond

some interpolation between distributions right if there is then sure we can continue and build up on top of that if there's not then what are we doing like uh why are we trying to uh do all these concept like methods that it's very difficult by the way like like uh because we are also doing it in a kind of ad hoc way there is no

coherent model of what the transformer model is doing at scale. So it's very...

Very tricky, I think. So right now I'm leaning towards that the models don't have a correct representation and the way we build these models by construction is limiting. So I don't really see how other, like extending this would be helpful. Like there was an old example of like,

If you want to land on moon, you can do it with an airplane, right? So there are two questions. If you have an airplane and you want to land on moon, then you could improve like the speed of plane. You could make it lighter. You could like make it go faster.

faster or all these things uh but it doesn't necessarily help you land on the moon so you could also work on like how would uh like wings should look like how would like uh the what kind of uh uh

runway should we use to have a faster takeoff all those things but that doesn't necessarily help you with your goal right so I'm saying like I'm not saying that it is impossible like right now I'm saying that like we should ask a question is this the right vessel is it like does it make sense so yeah in general it's I think we should take a step back and think harder on the problem

Onto scaling laws. So, you know, Noam Brown is talking about this new test time scaling law. You've said that it's very, very difficult to convince people that scaling laws, you know, don't work because you'll always have a group of people that say, oh, you know, just over the next hill, just, you know, if we 10x the parameters, 10x the compute, then we'll finally get there. How do you convince people that they're wrong?

Yeah, I mean, I can't really convince because the issue with the scaling law is that if I spend millions or billions of dollars and train a model, like, I don't know, 10 trillion parameters on scaling,

100 trillion tokens and come and show you, hey, this is the model and it can do even simple mathematics beyond what it's been trained on. It can create new knowledge. And then someone can claim that, hey, this is not the right scale. If you scale it to 50 trillion parameters, there will something emerge. And then because my argument is not theoretical, I can say, no, you're wrong.

And because I won't be able to train that model at 53 billion and spend a few billion dollars, I can't say you're wrong. So there will be always this debate that something will emerge. And by the way, I don't have issue with scaling, right? Like scaling as in a sense that like I'm not saying that it will hurt or it won't be helpful at all. I'm saying like if you are thinking about the scaling,

that you should focus on the slope of your scaling. You shouldn't care about where you are right now, right? So if you want to compare two different systems or like one architecture with another, you shouldn't care about like...

what happens at the 100 trillion or will something emerge? You should say, look how fast these models will be able to learn and do novel things. So right now, I think in that sense, the slope is near zero for all the architectures and methods we have, but we can focus on the slope rather than the points and endpoints.

On the slope then, how does the slope of human scaling compare to LLM scaling? Yeah, that's one of the interesting things. We weren't able to do that, but it would be interesting to... I think there are many studies, I'm not an expert in the field, but there are many studies showing that how humans versus models react to novel environments, right? So...

If we compare those, I think I don't have any number, but I think there are many studies showing that humans can adapt to new environment. They can learn new environment very easily. Much easier than at least LLMs. So I think the slope of human is faster, like better than the slope of LLMs. But it could be that there will be a better environment

species with better scaling, better slope. And I look forward to it. Very cool. So, Iman, I think you would agree that we need to have better abstract world models, right? We need to have better representations. How's that going to work? So I think, yeah, I mean, to me, it looks nearly impossible to build an intelligent system

that operates without an abstract model of the environment and the world and knowledge. But right now, there are many questions that need to be answered before that. And we don't even have... One of the issues that I have with the literature right now, and also including me, is that we don't even have basic...

answers to these questions. So imagine you want to learn a function, right? So

One way to represent that function is representing with table. Input x and output is y. And then imagine there is another way of representing the function is representing it in a way like this is a polynomial. Like y equals x squared plus some number, something like that. And

We don't know, like right now, there is no objective measure and we don't even have an answer to that which one is better, right? You might come and say like the second one where it has a form, it's abstract or something like that. This one is better. But we don't know, like why would someone claim this one is better than the other, right? I also believe the second one is better, but...

we don't have a like we couldn't quantify this right the second one is certainly not like someone can claim that the second one is compressed in a compressed form you don't need a table you just need a polynomial but i don't think that's a right that's necessarily correct because

In order to represent something with polynomial, it's not about the number of characters in a string. It's about concepts, right? So in order to define a polynomial, you need to

understand what does a function mean what does function x squared mean x to the power of n and then that requires understanding what does continuity mean what does a real number mean like all these concepts so if you want to encode everything that it takes to reach to y equals x squared plus some number then i that may be even if being being larger than

the table, right? And then if your only goal again is achieving something, referring back to our previous discussion, then these two are equal, right? You give input and both of them give you output and that one is even faster. You just have a lookup.

So we don't have a way of comparing two representations. So if I want to sum up, is that in order to answer all the questions you asked, we need to start with understanding

what like how to compare two representations once we do have then if you have a good measure of comparing two representations uh then we can build on top of that uh but we don't currently have that actually uh i was uh talking to a friend like how do we compare these two representation and i said like i don't have a measure but similar to the idea of like at some point in time someone um

someone came up with the idea of formalizing the concept of generalization. Before, 200 years ago, generalization wasn't a concept. It was a word, but... So it says one representation could be more beautiful than the other. Like a

A polynomial is more beautiful than a table, but we don't have any formal way of defining what constitutes something being more beautiful. But we are thinking about these problems all the time.

I guess the core here is whether deep in your bones you are a connectionist and you believe that it's possible in principle for this kind of thing to work. Because obviously, when we'll talk about your paper in a minute, you've designed a symbolic test to prove that certain types of reasoning can't happen in, or don't happen in an LLM.

In 1988, there was this famous connectionism critique by Fodor and Polishin, and they argued that these systems don't have systematicity. They don't have compositional generalization. They don't have invertibilities. They can't explain why they did things. They can't canalize their knowledge in the way that you're talking about. But other folks like Smilov,

Smolenski, get his name right, in 1990, and certainly people like Benjo, they argue that, yeah, it's not a problem. You know, all of this symbol use can just emerge, you know, at some level of complexity or whatever. So do you think that in the future we can make LLMs do this kind of symbolic reasoning? Oh, yeah.

I think when we are talking about symbols, like the notion of symbol is not, like we have to define the notion of symbol. Like is symbol something concrete and predefined or something like, do you agree that maybe do you accept the like middle activation of a model, a deep learning model can represent a symbol or not? So if you agree on that, that could,

be looked at as a formal symbol uh then uh i think yes the model will be able to do that otherwise if you think that symbolic is like the the form of uh like separate external system then no i think i'm more living towards like uh symbols being emerged as a part of uh uh

computation inside. Oh, that's very interesting. Yeah, because I suppose there's a spectrum of views. I mean, there are some people who think that just pure connectionism, symbol use can emerge. There are people who are quite sort of like in the middle. I think I'm in that camp, which is that we can build LLM systems. So maybe...

sort of agent LLMs that talk with each other or maybe neuro symbolic architectures where the LLMs can use tools or some combination of the above. But you're in the camp that we could in principle just have a pure play neural network and it could do, you know, symbol use. Yeah, I mean, yeah, right now I don't have like any...

I don't see it like in theory at least, right? Like I don't see in theory why such a system won't be able to

create that kind of symbols theoretically but like it could at some point in like create symbols and do computations on those symbols i'm not saying like this is the only way we shouldn't have anything like not very experienced in that sense uh but i don't have uh this kind of thing the same goes with like other like like when we when i discuss like architectures right so i don't

I don't know if the like maybe transformer is the right architecture or not But right now in theory at least given the context and assuming the model is generating tokens because it becomes true and complete I don't have any reason in theory to believe that's a limitation right now. It could be that other Architectures are better like we can develop better architectures. It's just that right now. I don't see that as a limitation

Okay, but you do think we need to have Turing completeness for symbol use so we could somehow come up with a neural network which is Turing complete? Yes. Okay.

We're not there yet, but maybe. I think when we're defining intelligence, at some point, being able to do Turing complex operation is a necessary condition. Yes, I completely agree with that. We should talk about your GSM symbolic paper. So this is a landmark paper. It did the rounds on socials. Millions of people in my community said that I had to interview you because it was amazing. Sketch it out for me.

Oh, okay. So a little bit of background. We were working on understanding reasoning and we were exploring a few ideas about improving the reasoning of the models in terms of increasing the amount of computation a model does per output token. And for that, we needed to evaluate the model and also...

be able to have a robust evaluation and then eventually have maybe better training data. So we started with the evaluation like creating a small sample from GSM Symbolic, just templates and then

Just to do a sanity check, we ran the experiments and see just like how close these numbers are to GSM8K. And we observed that there is a huge, for some models, there is a very huge gap. For example, Phi2 had like 20% gap, 14% gap, something like that. And then there's also huge variation, right?

So, yeah, we kind of got sidetracked and explored this immediately to understand what's going on, right? And one of the...

second like after we had the initial GSM symbolic the second thing I was working on was the GSM no op version where you try to trick the model essentially and that you add one clause

to the question where that clause carries no logical or like arithmetic operation. It's like if you completely ignore that clause, you will be fine, right? So that's why it's called GSM no op. And then that was the second one. In the paper, we have a different presentation, but like we explored this second and then we observed a very huge problem

performance drop. So then we came back and tried to understand like something in between like we try to be more like try to create like easier benchmarks than GSM-no-op by just adding one clause that carries operation and then adding two clauses that carries operation so you can't ignore that and see how the models performing that and to like

Among all those benchmarks and experiments that we designed, the surprising thing to me at least is that they did the variance of the performance we observed across the model. So I wish we could at least do some sort of cross-evaluation across students of how students would perform if you change the words in the model.

problem without changing the numbers right like if you instead of saying amy has three apples you can say john has three bananas right they're the same like in terms of logical reasoning and we i wish we could do that uh and i think i hope someone else explore these things like how would human student perform if you just change orange to banana yeah compared to an llm and

So to me, the most fascinating part is the variation in the sense that why does this gap exist, even though you only change the number. So we have a benchmark called GSM names, where we only change the proper names of the question. We don't change the numbers or add anything to the question. So, yeah. The frontier model, I mean, to be honest, they still did...

quite well. I mean, I would expect them to just like drop off. Did you know Sabaro Kamahati has done some experiments with planning and he did this mystery blocks world where he just changed the name of the symbols to like random things and it dropped off a cliff and the O1 model kind of like, you know, made it up to non-trivial performance again, but there was a huge decline. And I guess reading your paper, I was still somewhat surprised

intrigued that the frontier models they did obviously like much worse but but it was still pretty like non-trivial performance there are two things like personally i i don't know like uh i think uh overall the geosemite k should be very simple benchmark given

There is also this factor that we don't exactly know on what kind of data these models have been trained. It could be, I'm not saying it is, but it could be the case that these models are also being trained on some sort of synthetic data generated from data.

questions similar to math 4 problems like GSMHK and we also know that all the companies that are trying to build LLMs they are also getting lots of human created question answers in math and other domains right so

We don't exactly know, that's one of the issues with not being transparent about what kind of data at least the model has been trained on. But it could be the case that the frontier models have access to better data quality and data format similar to GSM 8K. And

It could be that they have unlocked some kind of emergent ability. But if that were true, then at least... I'm not a believer in benchmarks, but there are, again, different benchmarks that are similar to GSMHK, slightly more difficult. And in those benchmarks, I think someone tried a similar idea...

that we did on GeoSense Symbolic on Math dataset, Hendrix Math, and then they observed a larger performance drop for frontier models as well. But overall, it's, again, for me, it's not about performance. It's not about, like, I prefer a model that

if you make the question difficult, the performance drops 10%, but there is no variation because if it understands the question, it understands the question to a model that drops like 1%, but there's a large variation. So to me, it's not about the accuracy number. It's about why would a system that understands is able to do a set of logical steps

tries to do different sets of logical steps and gets it wrong if you change orange to banana in a question right and that probably is coming from because we train this system to learn a distribution there is no concept of in this question it doesn't matter what kind of object that is it's

It's about the number of objects, right? So these two for us seems trivial but for a system that is being trained to predict the exact object in the question during the training it's non-trivial. So yeah, I think overall in all these systems I don't think they truly understand this concept.

You could change any, like, I wasn't doing a search on those kind of things. But you could, like, if you spend enough time, any question that they get them wrong, I truly believe if you spend enough time, you could change that question to a form that the models, including frontier models, get them wrong. But unfortunately, the field looks at these things as an accuracy number.

not necessarily what does it mean for a number to drop or what does it mean for a variation to increase. And they kind of now, I think the field is trying to move from GSMHK because the performance is reaching like 95% near 90%.

100% and now I think like yesterday in the workshop there was like oh we have a new difficult benchmark called frontier math and this has been designed so that's like now frontier models get 10% now let's increase that 10 to 90 I will I like my problem with benchmarks is that they always is they're like a cut of reality and once you freeze that you could change the system in a way that's

Indirectly you change the system, but it has an impact on the performance. So eventually that benchmark will be always saturated, but it doesn't, at least to me, it doesn't matter. That system, unless we change something fundamental, that system may get 99% on that benchmark, won't be able to create new knowledge, won't be able to understand what's going on. And to me, that's what matters.

So in image and computer vision, we had these benchmarks like ImageNet and all those benchmarks and we saturated them and we thought, okay, the vision is solved. But now we see that self-driving car is not becoming a thing right now because it's very difficult. Like in real world, there isn't a specific frozen cut of reality that we fixed ourselves.

number of examples, reality will change. You have to build an agent that understands and reasons. So that's why I think it's important to focus on not the exact number. Yes, and...

benchmarking is a big problem so as you say I'm sure many of the frontier models have basically memorized GSM 8k and certainly we should be moving towards more of a generative benchmark type thing where you know of course it's not deterministic but we have some generative system and we sample from it enough times and we report some kind of average or something like that but you did show some very interesting stuff so first of all there was I think a change on the no op data set where you actually eight shot it which

which means the model should be able to kind of filter away the distractors, but it didn't, which was very, very curious. But more broadly, though, isn't it interesting that you sample from these models a whole bunch of times and you see this huge variation? And what's the implication? Is the implication that when we use language models for doing reasoning, we should actually be kind of like sampling 100 times and taking the average result or something? I mean,

Because many of us use language models and we just sample once. We just assume that, oh, it's doing reasoning. That's the right answer. And we don't really think about, well, if I asked it another hundred times, it would give me a bunch of different answers.

Yeah, I mean, so, yeah, I mean, about sampling, there are a couple of things. Like, sometimes sampling in general doesn't make sense. So, again, back to our example, 2 plus 2 equals, there's no sampling. It should be 4, right? Like, if you increase the temperature, it could become 5. But why do you want to do that? Like, randomly picking a number doesn't make sense at all in that sense. That's why, for our study, we always...

greedy decoding no sampling because like if you are doing arithmetics or you are doing reasoning it has to be

But in general, the other argument against sampling and majority voting and all those kind of things is the example I remember you had an episode where there was an example like if you let 10,000 drunken people go to...

after the bar go to home eventually some of them may reach the home but that doesn't mean they are able to like understand what's going on like given enough samples you will eventually uh

reach to destination probably but that doesn't mean that that's the issue so actually there was a work that argued exactly the same thing like they said like if you measure the GSM 8k and math accuracy and you sample the model 100 times it will be like 20 percent better but for the same argument I don't think that's what we should study right

Cool. Well, Iman, it's been an absolute honor to have you on MLST. Thank you so much for joining us. Thank you. It was great talking to you.

GSMSymbolic paper - Iman Mirzadeh (Apple) 01:11:23 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

GSMSymbolic paper - Iman Mirzadeh (Apple)