We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Subbarao Kambhampati - Do o1 models search?

2025/1/23

Machine Learning Street Talk (MLST)

AI Deep Dive AI Chapters Transcript

People

Subbarao Kambhampati

Tim Scarfe

Topics

Subbarao Kambhampati: 我认为O1模型可能使用了类似AlphaGo的强化学习方法，它包含一个大型LLM和一个小型LLM。小型LLM负责生成提示增强，大型LLM则根据这些增强提示生成答案。这个过程类似于AlphaGo的蒙特卡洛树搜索，通过学习伪动作的Q值来提高推理能力。O1模型的训练过程包括LLM训练和一个非常昂贵的训练后阶段，在这个阶段，模型学习如何生成最佳的提示增强。在推理过程中，O1模型会生成大量的推理标记，用户需要为这些标记付费，这导致了高昂的成本。尽管O1模型在规划基准测试中表现优异，但它仍然存在局限性，例如无法解决大型问题，并且可能出错。总的来说，O1模型是一个基于LLM的近似推理器，它结合了强化学习和蒙特卡洛树搜索方法，在一定程度上提高了LLM的推理能力，但其高昂的成本和潜在的错误仍然是需要关注的问题。 Tim Scarfe: 我对O1模型的推理机制以及它与传统LLM的区别很感兴趣。特别是关于O1模型是否真正进行推理，还是仅仅进行检索的问题。此外，我也关注O1模型的成本效益，以及它在实际应用中的局限性。

Deep Dive

Chapters

This chapter explores the concept of "fractal intelligence" in LLMs, where their performance is unpredictable. It discusses the limitations of current reasoning models and explores different approaches to enhance their reasoning capabilities, including inference time scaling and prompt augmentation techniques like chain of thought.

LLMs exhibit "fractal intelligence," meaning their performance is unpredictable.
Inference time scaling and prompt augmentation are explored as methods to improve reasoning.
Chain of thought prompting, with its variations, shows promise but faces limitations.

Shownotes Transcript

Translations:

中文

So much so that we now have phrases like fractal intelligence. You know, in fact, I think Andhra Kalpati basically was saying LLMs have fractal intelligence. What the fractal intelligence is basically we don't know when they work, they work. When they don't, they don't. That's fractal intelligence. And that sort of shows, which is good. Still, we had nothing like this before. But part of the science of LLMs has to be

to say something more than fractal intelligence, saying here is the level to which you can depend on their results. So I'm not, so there are in reasoning, in logic, there are ways where limit, you know, of basically formally characterizing the limits of reasoning, like limited depth, limited look ahead, reasoning and so on. None of them seem to work for LLMs.

The question then is what would work? We have to figure that out. The bitter lesson is over and efficiency is going to matter. And I completely agree with that. I've been arguing this for a long time too, that think about the following thing. The first time when we sent a man, humans to the moon, cost was not a consideration. We want to show that we can do it. NASA was the one which is doing it.

The second time and the third, and for the space as well as the moon. The second and third time, etc., may be okay. But by now, it's Elon Musk sending people to space and supposedly possibly to the Mars too because the cost matters.

right essentially it's once it's been done then you start caring about you know the cost that you're paying and computer science is actually quite a bit about the unsexy parts of cost just as it is about doing things that haven't been done before there are people who say well if it's not retrieval then it is reason so what say you

Reminds me of this old Monty Python thing where this guy has, I think, a life of Brian. He does something that looks like, you know, if this is to prove that some woman is a witch, right? You know, if she is made of wood and she floats on water, then... Who do you know she is a witch? She looks like a witch! Bring her forward.

You know, random connections and then saying that she's a witch and you say QAD, that looks like reasoning because it's not just retrieving, you know, something she's a witch from. But we know that it's not sound reasoning. So, Tufalabs is a new AI research lab I'm studying in Zurich.

In a way it is a Swiss version of DeepSeq and first we want to investigate LLM systems and search methods applied to them similar to O1. And so we want to investigate reverse engineer and explore the techniques ourselves. MLST is sponsored by SenseML which is the compute platform specifically optimized for AI workloads.

They support all of the latest open source language models out of the box, like Llama for example. You can just choose the pricing points, choose the model that you want. It spins up, it's elastic auto scale. You can pay on consumption essentially, or you can have a model which is always working or it can be freeze-dried when you're not using it. So what are you waiting for? Go to sentml.ai and sign up now. Microsoft essentially

Cannot any longer control open AI if in fact AGI has been achieved that is one way they could avoid being Beholden to Microsoft, but now they're trying to say we'll remove it so that we'll get more money from Microsoft That's kind of I don't know what that says are they looking for money or they have realized AGI is not actually going to come anyway So why bother with that clause? So much has changed since our last conversation at ICML

Can you give us a bit of a rundown of what's happened? When we were talking in Vienna, I think we were talking about reasoning abilities of large language models. Especially I think of large language models as the autoregressive token by token prediction models, which are pre-trained for that and they also do that in the inference time.

And it was clear, I think, as we were talking about at that time, that those, as from my perspective, did not have the reasoning abilities. They're amazing in supporting creative work where they can give you ideas and you can run with them, and they will give you answers as soon as you hit return, but they're not guaranteed to be correct. One of the interesting questions, of course, is reasoning tends to have a higher complexity in terms of the time needed and time

Are there ways of actually changing LLM sort of substrates to do that? And a couple of things happened. And I would think, I would basically, I mean, obviously we'll get to 01 in a second because that's the thing that, the bigger thing that happened. But an interesting way of looking at what that whole direction is, what's been called, they're like two parts, inference time scaling and

post-training. First ideas that were tried-- and in fact, we talked about this when I talked about LLM Modulo-- is if to the extent LLMs are essentially quickly generating candidates but with no guarantees,

Maybe you can make them generate tons and tons of candidates and then either do majority voting or self-consistency or something like that to see if you have the better answer. And how do you check the better answer? There are like a whole series of them. There might be external verifiers.

There might be like LLMs themselves trying to partially verify there are problems with that we talked about, but they have tried that too. So that's one type of inference time scaling. A related idea there, an interesting, very interesting idea there is it's been known from the day one that if you are trying to give a reasoning task to LLM as a prompt,

and then it gives a completion and you check its completion whether it contains the solution. The probability of that happening in general can be made higher if you can find the right kind of prompt augmentation. So in addition to your reasoning thing, some magical tokens that you add and that seems to increase the probability. This has been seen in multiple scenarios.

Originally, this idea has been bandied about as chain of thought. And the first, very first version of that essentially is the zeroth order chain of thought where the magical token will always be the same one irrespective of the task and the LLM, like let's think step by step. And that sort of worked partly because, you know, the human data had those specific, you know, those types of tokens. And so LLM like outputs that and then that jogs its path

pattern matching things to actually pick up with other solutions and so on, that would be the thing. And then came the chain of thought task specific one, the one Jason V and co did. Their humans give task specific advice as to how to solve the problem and then hope that LLMs will actually solve it.

This can be connected with inference time scaling because you are adding chain of thought and also are essentially making it generate multiple candidates and then actually picking from them.

Chain of thought by itself has again problems just as LM verification has problems, chain of thought has problems. In fact, this NeurIPS we had a paper called chain of thoughtlessness that we'll talk about later. But basically by itself it has problems. But as part of the toolbox of increasing the time spent during the inference time before you're blurting out one answer,

chain of thought together with the sort of picking from many samples has shown some promise. One variation of that, and in fact that is something that I've been pushing more recently is

Originally, Chain of Thought was kind of confused with it might be anthropomorphic. That, in fact, we tend to sort of tell ourselves, okay, let me do it this way, et cetera. And people are hoping that LLMs are doing the same thing. Mostly, they were just kind of imitating whatever, like, you know, let's think step-by-step data that they have found in the training data. But somehow people thought if you kind of make them think

imitate human thinking, then maybe they will do better. That is the first two ideas and neither of them actually went that far. Another idea essentially is you realize that it's just a magical tokens that you are trying to add and you just have to figure out what's the right magical token. It's sort of a squalane function. You're trying to figure out a task specific, LLM specific magical token that increases the probability. This is a learning problem. It's an extra learning problem.

There are two general approaches that have been tried. The first approach essentially was to say, so LLM basically before giving the answer, it has to tell itself a few things. Some, you know, like step-by-step is the one that makes sense to us, but it can actually give itself a gobbledygook string, another gobbledygook string, and then that kind of probes its conditional probability of completion in such a way that it might actually come up with the correct solution.

the question then is where are these tokens coming from

And one idea that people had, previous first ideas were humans will supply these tokens by chain of thought advice. That wasn't going anywhere. One other idea that they had essentially is that if you, for example, have a class of problems for which there is an actual systematic, no, actually before going that, OpenAI did the following thing, which is maybe we will ask humans to solve specific problems while thinking aloud.

So there's actually a paper about one and a half years back saying let's think step by step. And then basically this went under this whole issue of process supervision and people were actually being asked to record what they're telling themselves, etc. This is like the worst form of psychology, unfortunately, because we don't actually know how we think. But they tried this and one of the things is it's extremely costly. They wound up, you know, my joke is,

they improved the GNP of Nigeria because Nigerian, you know, Turkers were being asked to solve tons and tons of these problems and then think aloud. That was very costly.

And then a separate similar idea was there are bunches of problems for which there are systematic solvers. Like for example, for arithmetic, there is arithmetic solvers. For search problems, there are A star search sorts of things. And for planning, you have planners. In general, any systematic solver would be manipulating some data structures until certain termination condition is reached and then it outputs the solution. Imagine you make it

output the trace of the data structure manipulation operations. All you needed, hopefully, was some extra kind of tokens that are coming out before the solution. So this stuff can be thought of as a derivation. And the idea that people had was, let's train the LLM with a huge number of these synthetic derivational traces and the solution.

And now remember, this only works for problems for which actually there are synthetic solvers and systematic solvers. And you're just trying to kind of make it be solvable in a general sense without having to call those. That was the idea. And there are like a couple of

Three or four efforts that have gone, there's a search form from Facebook, Meta, and there's a stream of search, and just recently, last week, there is a Google defined paper which also talks about internal versus external planning to do multi board game solving. And all of these essentially use variations of this idea.

So you have to realize that all they're doing is now LLM has to, before outputting the solution, has to output some additional tokens that will jog its memory to hopefully output a better solution. This is the hope. And so basically people, this was their idea. They tried it and sort of sometimes it actually works. It improves performance. There is no good reason to systematically say this would be making sense because it's almost like if you

kind of you're trying to teach your kids how to reasoning at very small kids and then you sort of do some hand movements and then think like that and then give the answer you would see the junior also doing these hand movements and think like this and give the wrong answer

llms can do that they're essentially imitating your whatever the derivational pieces which may not even actually make sense but sometimes they have shown some promise this has basically become an i you know kind of motion most recent uh idea called inference time scaling where essentially you do this and you also do this multiple uh suggestions and then pick from them etc

This comes very close to what I think O1 is doing, but with a big difference from here. Again, as you know, O1, nobody knows, and it's become like, you know, we all sit around the ring and suppose, and I like to say, no, Brown sits in the middle and knows. But basically, they don't want to tell what they're doing. But my guess, I mean, everybody has a guess, and my best guess as to what O1 might be doing is O2.

Going again with this prompt augmentation idea, but the question of course is where are these prompt augmentations coming from? We talked about first one prompt augmentation for everything. Second is human given prompt augmentation, which is chain of thought. The third is the synthetic derivational trace that sort of gives these tokens and maybe you will try to say this back.

None of them really make too much sense. A much better idea is if you are saying, what should I be telling myself to improve my outcome?

It's a kind of a RL problem. It's a kind of a reinforcement learning problem. Imagine like an AlphaGo agent. It's sitting there and thinking, what action should I do one after other such that my win probability increases? So it does a whole bunch of these board actions. And then at some point of time, it gets a signal saying you won the game or you lost the game. And then you do this a gazillion times. You can then bring this

reasoning back through the sequential decisions computing their q values like is at under what board positions are what um you know actions worth doing you know that's the q value now if you take the alpha go analogy and put it to llms the llm board position is essentially the context window with the prompt and all the other stuff that you have put on

And the action is this token that you are generating. So to make things simple, I would like to think of it as there's a big LLM. Let's think of a GPT-4. There might be a small LLM which has a reduced vocabulary. All it's trying to do is give jogging, give like these prompt augmentations that

it tries throws it out and then this will then be given to this other llm in terms of its context it gives extensions and then it tries one more and at some point of time it checks if the solution you know is correct now how does he get the solution you can have you could have actually generated huge numbers of synthetic examples beforehand

Again, using solvers, this is pretty much known that OpenAI did this. It's no longer human solving problems because that's too costly. This is systematic solvers solving the planning problem, constraint satisfaction problems, and various sorts of problems, and for which they have the problem and the answer, and

and then open the LLM is trying to solve it. LLM plus this prompt augmentation engine is trying to solve it. And then if it happens to reach the correct solution, then you can propagate the thing back. This is RL in a pseudo moves. It's not the actions are, if the prompt is about go, the actions are not go actions. They are just essentially these tokens that the prompt augmentation tokens.

One nice thing, of course, is instead of learning the Q values, one of the things you can do is you can essentially learn the... You can change the weights of the smaller LLM in the right ways such that it puts out the right kinds of tokens given the context window. If you do this, you have got an approximate Q values. And then this is the...

pre-training phase. In the pre-training, there's the LLM training followed by this humongously costly, you know, they are not telling us how costly it is, a humongously costly post-training phase, which way they spent billions of dollars. At that point, you have O1 model, which is now ready for the inference time.

And at the inference time, once again, they're doing inference time scaling where except this is now they have the Q values. You can improve the Q values by online MCT kinds of approaches, the kind of thing that AlphaGo does. That's where we actually can see they're doing it because they charge you for this reasoning tokens. If you run O1,

It basically takes the prompt, it gives the answer. In the old GPT-4, the amount of money that you have to pay them is proportional to the number of input prompt tokens plus four times the number of output prompt tokens. In the case of O1, it actually does this whole bunch of stuff that's telling itself. Basically, this pseudo moves that it was equivalently proven and selling itself. It never shows that to you.

But they are all counted as output tokens. So, you know, you have, let's say, 50 input tokens, 100 output tokens, and maybe 5,000 reasoning tokens. And so you suddenly start paying a lot more. So one of the funny things that happened essentially is when we started playing with O1 Preview when it came out, in two days we spent like $8,000.

And then so in fact I had to get like special permission from the university because they normally don't reimburse beyond a certain thing unless you have like a separate permission and so on. But that's basically one of the ways this works out.

The interesting thing, of course, is now the way we describe this, it's not, it is based on an LLM, but significant amount of additional things have been done, right? Essentially, you are essentially doing something like an AlphaGo style post-training phase, followed by an AlphaGo style sort of a MCT online computation. And at that point,

Actually, I would think it could make sense. And not surprisingly, in fact, in our results, we found that for the normal plan bench, it does much better than the state of the art, you know, LLMs, including cloud and so on and so forth.

but then of course then you can go the next level it has its own issues we can still talk about the fact that it doesn't scale beyond the larger problems it can make mistakes it has problems with unsolvability there is no guarantees about the solution but it's now makes more sense to me again i don't know this is i think this makes a reasonable way one can be working and if it is the way it's working

The first time I can make sense of how reasoning can emerge because you are at least having the pseudo actions whose Q values you are learning. Nobody ever said RL cannot do reasoning. RL can do reasoning. It's just that

Now you basically, it's like sort of an interesting thing where I keep using the stone soup analogy. Stones, you can make soup with stones if you start adding carrots and, you know, tomatoes and all that stuff. At that point of time, it will still taste like soup. The question, of course, is who gets the credit? And, you know, that's kind of an interesting question that we would think about. But that is like the long arc of what happened, in my view, in the last decade.

only four months or something since we discussed and it also one of the other interesting things is part of the mystique of llms was you you'd write the prompt you hit return you get the answer and it doesn't cost you too much that was where everybody was using it oh one

Basically, with its, you know, of course, the post-training itself is extremely costly, but they are not charging us for that. But they're charging us for the reasoning tokens, which you never see, but you pay for it. You just have to take their word that a huge number of reasoning tokens were generated, and they're going to make you pay for that. And as far as I could tell, at least in academia, very few people have actually been doing experiments, evaluating, because it actually costs a lot.

essentially. Because, you know, and there's basically people are still going with the

autoregressive llms because they're cheap you know so one of the interesting things is you know you can do reasoning but the usual computational complexity issues that we politely forgot in the era of autoregressive llms and you are hoping that somehow you know complexity will disappear will come back because if you want to improve accuracy there is you have to actually do reasoning and this is pseudo-mo reasoning in my view but still it costs and

That becomes an interesting question of when is it useful to use a general purpose system versus a, you know, sort of a hybrid general special purpose system versus an extremely specialized solver. Something that we haven't talked about before, but now it will become costlier.

at least for the industries you know in fact there is this whole movement about compound ai systems and that's basically the kind of thing that people think about very shortly after 01 was released you um quickly as you were just saying you spent eight thousand dollars you put a paper together called planning in strawberry fields evaluating and improving planning and scheduling capabilities of lrm01 lrm

So, yeah, you basically said they are positioned as approximate reasoners rather than mere approximate retrievers. We don't know the actual details of what they're doing. So there are two parts. One is what is objectively verifiable, which is we did test O1 on the same plan bench problems.

and they did quite well on Blocksworld. I mean, by, I think, Claude, already there were 66. These things were like 99 or something. They basically saturated. More impressively, they did better on the Mystery domain, which, and then more, and we, given what I explained to you earlier about the possibility that they are,

training themselves with the synthetic data. Maybe they have unintentionally trained on the mystery domain, which we have available outside. So we actually generated

truly new random mystery domains. It has lower performance, but it's still not like the 0.5% of the old ones. It goes up to, I don't remember the exact numbers, up to 20, 23% on some of these problems, which is obviously a good sign that they're actually able to solve this. The other part as to why are they approximate reasoners than retrievers is

It's based a lot more on my reconstruction of what they could potentially be doing, which is they're sort of doing reinforcement learning based post-training as well as online Q value update.

and using pseudo action most i call pseudo action most because there is you could do rl for just normal go or any specific board games this one is just language games where the game basically is there's a language context window and there's a prompt augmentation and there's a x you know new content new completion and then one more prompt augmentation this is what they call this uh

string of chains of thought, but that's basically adding bunches of cramped augmentations

and then see what happens at the end. And then if it winds up being correct, in the sense if it winds up containing the correct solution for your training data, then that's sort of like AlphaGo getting victory, win signal after a bunch of moves. And then it just needs to do credit blame assignment for the moves. And that's what RL is essentially good at doing. And if you're doing that, it's...

It's a reasoning, and it's approximate reasoning because it's not actually problem-specific actions. It is problem-independent, these language prompt actions.

Is it possible that you might be wrong about that? Is it possible that we're giving them too much credit and what they're actually doing is just this massive generation of trajectories all in a single forward pass? So maybe they do something like process supervision. So they do some clever RL pre-training stuff, but...

So obviously, again, this is the sad part of the way O1 thing is. In fact, by the way, I must tell you a funny thing that I was talking to somebody who said they were having some conversations with the OpenAI guys, trying to sound them out as to what O1 might be doing. And at some point of time, one of them said that, I think you may have to wait until the Chinese replicate what we did to actually figure out what we did.

That's the level at which science of OpenAI has gone to. But the point, the only reason, it is possible that I might be giving a lot more credit to the sophistication of the method they might be using. The reason I still think that is likely to be the case is, as I said in the earlier description of how things shifted from LLM to inference time scaling to this sort of O1 style method,

The general inference time scaling methods are not comparable. Just inference time scaling hasn't been as good. And again, the question, the other very important thing that you have to keep in mind is

While open takes more time, it doesn't take hours, right? Basically, online computation, a second in the online computation time is way more expensive from a business perspective than like days and months in the...

pre-training phase. And so some of the inference time scaling people actually spend a lot more time than O1 does, and they still are not getting, as far as I know, in general to that level of accuracy, which basically makes me think that unless you do significant amount of post-training to get approximate Q values up front, you can't improve just by MCT. So think in terms of, again, alpha-go analogy.

If you only did MCT,

it will take much, much longer per move before you can get any level of accuracy, you know, any level of confidence. But one of the things that AlphaGo does is it does a humongous amount of pre-training phase where it learns an approximate policy, which it's then kind of rolling out to improve the Q value estimates that it has. So that's possibly the reason why I think it makes sense. And of course, I also think that

The normal inference time scaling methods don't seem to make too much sense to me. The one closest to pure MCT method that I have seen is this paper from Alibaba called Marco O1. They have this Marco Polo group or something and they called it Marco O1. And Marco O1 actually essentially trains itself on chain of thought data, which is like basically derivational data.

And then on top of it, it does like an online MCT style computation to improve the Q values further. They are much smaller and they're not as impressive in terms of the performance gains as O1. So those are the reasons I think the full picture requires post-training as well as inference time. The thing that you and I see is the inference time.

But the thing that OpenAI can spend tons and tons of money is on the post-training, you know, which is before they even, you know, actually deploy the model. And that's where it is getting this approximate Q values is my guess. Again, as I said, it's a strange thing to be involved in. You know, I mean, we should be looking for...

looking for the secrets of the nature because nature won't tell us, but we are now looking for the secrets of open AI because open AI won't tell us. So hopefully there are many efforts already in trying to replicate this sort of a thing and so we'll know more. But as of now, that's the thing. I cannot be sure exactly what they're doing.

Everything that they have said publicly is consistent with my hypothesis about the only thing I can say. There is nothing that's inconsistent with my model of, I mean, my speculation of what O1 is working in that strawberry paper. There's an appendix where I wrote down the speculation, where we wrote down that. And that is still consistent with everything they have said.

you know which is the only thing i can say yeah i mean i like the sound of it i mean it makes me more excited about using it because it makes me feel that there's more sophistication behind the system but a lot of this comes down to reasoning and i'd love to hear your definition of reasoning but there are people who say well if it's not retrieval then it is reasoning so what say you so let's actually let's look at it first the first part and second part itself

The definition of reasoning itself is a good place to start from. I think I know that this whole AGI crowd basically sort of tries to say if kind of AI is going to be like humans. The problem is we don't have a good definition of what human reasoning is. But since Greeks on our civilization went forward,

not by saying what humans, how do we define what humans do, but by defining what are sound reasoning patterns. Aristotle, syllogisms, logic, probabilistic logic, you know, the entire computer science, the entire civilization depended on

on having formal notions of reasoning which there is a correctness, there is, you know, incorrect thing, et cetera. I mean, you know, it reminds me of this old Monty Python thing where this guy has, I think, Life of Brian. He does slideshow

Something that looks like, to prove that someone is a witch, if she is made of wood and she floats on water, then she must be like a duck. Random connections and then saying that she's a witch and you say QED, that looks like reasoning because it's not just retrieving something, she's a witch. But we know that it's not sound reasoning.

And so in general, I prefer to think in terms of, because ultimately these systems are going to be deployed, whether you like it or not. And civilization didn't depend on

whether people, fallible humans can make mistakes and then we can look the other way around and we actually have to have guarantees at some level or other about the soundness of reasoning and the completeness of reasoning. And so I go back to essentially definitions of reasoning from logic and so on. So basically the formal definitions of reasoning. I try to avoid

getting into this question of what is human reasoning because that is a big mess. Cognitive scientists don't know it. We don't know it. Psychologists don't know it. So I try to just give it a wide berth. So that's the far as far as what I believe in reasoning. So that's why we looked at planning problems for which there is a correct solution. Constraint satisfaction problem is the correct solution. And if you are able to do, if you say the system is a reasoning system that can be deployed, it should have

some guarantees. Now you can say that humans can make mistakes, but one of the things I keep saying is if humans, if you are being paid to make decisions and make mistakes, there are penalties for you. You can in the end be put in jail until we figure out who to put in jail and how to put in jail when AI systems make mistakes that they have no actual guarantees over. We are better off thinking in terms of formal definitions of reasoning and then seeing to what extent are AI systems coming close to it.

This has basically been very connected to how AI has developed up until now anyway. Now, this discussion also brings back to this issue of retrieval versus reasoning. I think you are talking about a couple of these papers that

keep coming out basically trying to say that, look, LLMs are not exactly retrieving anything that they've been said. They're not just memorizing and retrieving, so they must be doing something else. And I would say, well, Monty Python logic is not actually retrieving anything. He puts together a whole bunch of things, but that's not reasoning either. So there is between retrieval and what I would consider reasoning can be a whole entire universe of things that

still won't be considered reasoning as far as I'm concerned because there are no sorts of guarantees. And so from the beginning, we knew that if you again go back to those many of these sorts of papers, these claims go back to essentially the autoregressive LLMs because by the way, the researchers are still very busy. I think we are one of the few papers on O1. We have like this, you know, evaluation on O1 is also being presented in this New Rips workshop.

But most people are still trying to make sense of autoregressive LLMs themselves because that is still, as we talked about last time too, I still think it's a very impressive system one. We never had system one in human civilization and trying to understand what they're doing is useful. And so they go back to that and they'll say, look, they're not actually doing exact retrieval and they're doing so something else and we'll call this something else reasoning. That's not a...

First of all, everybody, we always knew that LLMs are not databases. So they don't retrieve, essentially. They actually have a hard time memorizing and retrieving. When they memorize, it's not by deliberation. It doesn't happen deliberately. It happens fortuitously. It's surprising that sometimes they wind up memorizing long passages because essentially everybody agrees that there are some kind of N-gram models rather than databases.

in the way they are trade okay so given that it's very clear that they will never retrieve and the fact that they are not retrieving should not be seen as an indication that

It can be seen as indication that they are not retrieving, but we knew this already. But the part that people seem to hint at is since they are not doing retrieving, maybe they are doing reasoning. No, that's not making sense because, again, you have to subject it to what you would consider as the evaluations for sound reasoning procedures. And they fail just as well, as easily as before. So if you come back to this chain of thought paper that

I was mentioning that we just presented at NeurIPS, right?

In the case of chain of thought in the JSON-V style chain of thought papers, what you, the chain of thought idea, what you do is, let's say you take something, last letter concatenation, which is this really small toy problem. You give like three, like K words, you know, N words, and you, the system is supposed to take the last letter of each of these words, concatenate them into a string, right? So, so large, big rows. So E,

GE is the one that you're supposed to output. That's basically the last string, right? And what they were saying, what they said was if you just told LM, you know, the name, the prompt saying, you know, you're supposed to take the last letters and concatenate them and give the answer, and then they test it, its performance is not as good. And if they didn't tell it, here is how, here are some examples of three-letter last-letter concatenation problems.

and then four-letter, last-letter concatenation problems, a couple of these examples, and then ask their questions, it improves performance. That looks like reasoning. Somehow it is able to follow the procedure. The problem in, I think we talked about the last time too, is the problem with Ersag-Mansur.

empirical sciences, you shouldn't stop when you get the answers that you're hoping for. You should see how to break your own hypothesis. What they didn't ask is they gave examples of three, four word examples, and then they tested on three, four words.

But if you expect the system to be doing any kind of a reasoning, any kind of a procedure following, once I tell you what last letter concatenation is and give you an example, you will do it for 20, about 30, etc. It's just mechanically taking the last letter and concatenating. What we show is if you increase the number of words, the performance just plummets close to zero.

And this also happens in planning problems, not surprisingly. It happens in last-letter concatenation, it happens in planning problems, which shows that, yes, it's doing something which seems to have improved its performance in...

the size of the problems for which you gave the examples for and its pattern matching of some kind is helping in there but it's not in any way generalized reasoning that would just generalize with respect to length for example and so one interesting way i've been thinking about this is it's sort of sort of glasses

you know nowhere near full versus glass already is wet you know that's sort of a optimism versus pacemism so people tend to think that since it's basically at least solving the three four blocks uh three four word problems with higher accuracy because i gave this chain of thought that's sort of showing reasoning abilities but the question is

We don't have a good understanding of what the boundary is where it will actually go correctly.

The question then is what would work? We have to figure that out. But instead of that, we basically, once in a while, there are these papers saying, look, we actually probed the, like using mechanistic interpretability techniques, we probed and we found that LLMs basically are not acting like they're doing retrieval. But that's kind of understood already, you know, and I think it's still, the mechanistic interpretability stuff is very interesting. I think it may actually be

be part of the solution to figuring out what LLMs are doing. But the argument that since it's not retrieval, it must be something like reasoning is still quite unsatisfactory to me because that reasoning is what I'm saying is not reasoning because all my papers are saying whatever it is that they were doing before you did your mechanistic interpretability study, they're still doing before that, even before you did that. And they still have these limitations before as well as after your study.

And we don't actually know how to characterize what it is that they are doing. And that's the part where we are stuck right now.

Is it possible that everyone is right? And what I mean by that is I spoke with some deep mind guys earlier in the week. There's a great paper about, you know, softmax needs glasses. Talking about, you know, how it kind of sometimes we need directed attention for doing reasoning. Sometimes we don't. There was another great paper I spoke to the guys and talking about just utter limitations of transformers doing counting and copying.

And Laura Ruiz, I'm speaking with her on Sunday, so she's got this paper out where she's looked at reasoning traces. And sometimes they are just doing, you know, they're retrieving facts from documents. Sometimes they're doing kind of procedural, you know, information generation, which you might liken to a reasoning process. And sometimes

I guess it's a little bit like this fractal intelligence thing, that it might be the case that possibly in certain circumstances, these models are doing something which we would think is reasoning. And sometimes they're doing retrieval and sometimes they're doing something else.

Yeah, no, so actually I think Laura Ruiz's paper is one of the ones I had in mind when I was describing earlier about this issue of mechanistic interpretability. I think it's a good paper in terms of they have developed an interesting set of techniques to actually see what is actually going on in the way LLMs are outputting their tokens, but

The thing that is unsatisfactory to me is, yes, basically two things. One is, first of all, everybody knew that LLMs are actually not doing retrieval alone. That was kind of well-known way before, right? And so there is nobody who believes that LLMs are just doing retrieval. And the question is, you know, what else are they doing? And is there any clean...

any clean characterization of what they're doing. That I did not see. I actually looked at that paper. I think they've done good work, but I am not yet... I'm still hoping that there would be an interesting characterization. There are lots and lots of groups that are trying to look for a characterization of what this fractal intelligence might be right now. But we haven't gone further than that. In terms of...

everybody might be right. I mean, the sense that there could be this whole blind man and the elephant, you know, phenomenon in play to some extent. And that part is possible because we are actually trying to

piece through large number of parts of this puzzle, including the reasoning part, including what are they even trying to do, including what sorts of techniques seem to improve their accuracy and so on. But I think that's part of science. And basically, my sense is eternal discontent is part of science. I actually am much more worried about being too optimistic that we figured it out than I am about...

being somewhat more discontent that we haven't yet figured it out. And so I want to err on that side, not because I think we know more than before when GPT-3 came out, but on the other hand, and I think both of us, all the camps know. I mean, the people who thought GPT-3 is AGI know that that's not the case. And the camp that thought GPT-3 is just stochastic parrot has to know that it's more than that. Okay.

Okay, by now. So that is collective improvement in our intelligence, but still, there are still large number of pieces that we have. Yeah. I mean, on Laura's paper, she was using influence functions. I'm not sure if that would be classed as classical interpretability or mechinterp, but I think mechinterp is largely about finding circuits in neural networks. And even that's an interesting discussion. To me, it's like more of a...

The general idea of figuring out a way of probing the inside of what LLMs are doing, I think of that as mechanistic interpretability. I mean, there are very specific techniques that have shown great promise, such as the autoencoder stuff, etc. But I think all of these, to me, are essentially trying to interpret...

what they're doing at the circuit level and try to make sense of their external behavior to me that so they're like two ways of making sense of what you know what lms are doing one is just external evaluation that happened already and we know that they're not doing any kind of guaranteeable reasoning and that's basically enough results showing they seem to do

promising things in some cases and they also result showing this seem to be very brittle that you change a prompt a little bit you change the problem specification a little bit they'll die again we are talking about autoregressive llms not o1 sorts of things that's a whole entire thing that we haven't yet started making doing the same sort of you know analysis but you know once you figure those out my my sense is that trying to actually get a sense of

just from outside versus also try to do probing of the internal circuits. If you start doing internal circuits, I think of that generally in my view as the mechanistic interpretability style. Okay. But isn't it interesting though that she found that

Code and math-based procedural documents appear disproportionately influential for tasks requiring reasoning. Larger models show an even stronger reliance on general procedural data for reasoning. The presence of code data in pre-training mix seems to offer abstract reasoning patterns that the model can generalize from. I mean, these are interesting observations. Actually, again,

I don't want to make this as a very specific critique of a particular paper just because that's not fair for them as well as me. But I do want to basically sort of say that there is a distinction between factual tasks and reasoning tasks.

NLMs have been used for both and I think the factual, they have troubles in both. For the factuality, I would think the only sorts of things that will improve them is things like the rag-style techniques where you just give the factual data and ask it to summarize. For the reasoning stuff,

you basically for arithmetic and so on in a large it's not to some extent i would expect that these are the kinds of things where the exact results don't exist and so i would also be equally

you know, troubled by the fact that people have shown that when, if you take something like LLM multiplications, this is before, way before all this LARA's work, etc. You know, they tend to be correct in multiplications for popular digits and less correct for non-popular digits. There's this sort of, you know, mind-blowing that there are digits that are popular versus non-popular. But that sort of is an interesting point that

the llm final performance is a complex combination of the data that they have been trained on and some additional pattern matching abilities that they are using on top but that's not sound reasoning so it basically we still don't quite know where it breaks but

It's this fact that it gets to be correct for popular digits and not for some other digits. That's a particularly interesting thing to me and that sort of shows. By the way, while we're on that subject, some work has shown that even with Owen, we looked at Owen more on the planning side, but some people, I think Tom McCoy did some more work and tried, basically, these are the ones who did the Caesar cipher sort of thing, the Embers thing, and

They basically also found that O1 does better on some of those things, but they also still found that there are data dependencies in the sense its accuracy was higher in the regions where there was higher pre-training data.

which again makes i think it's still consistent with my view of what i think one might be doing there is an llm which was pre-trained on like some corpus and there is this smaller llm which is sort of generating this you know pseudo action tokens uh that will make it output things and one of the interesting things is actually the difference i'm told again this is also we don't know for sure

i'm told that when the original oven models came there was a oven mini and oven preview and the difference i'm told was one of them i think the oven mini was using the smaller llm as the base llm and oven preview was using the larger llm as the basal so i don't know they didn't say this second part but i would assume that if i have like a you know pseudo action generator model

If he's working on a bigger LLM, which has a higher capacity, so it can generate more interesting completions, versus a smaller LLM that has less interesting completions, that makes a difference in terms of how the level to which the RL-based training can get your accuracy up. Yeah. I've noticed some interesting things. So I've now paid for O1 Pro.

I was very skeptical with 01. So as you say, the base model is an even weaker version of GPT-4. So GPT-4, I hate that model. I hate the style of it. I think it's dumb. And I must admit, it's mostly because I'm sort of anthropomorphizing it because I hate the style. So I think it's dumb. You know, we're very, humans are very brittle, even on the RLHF.

You know we like assertiveness we like complexity you know there's certain styles that we like and we don't actually see the content but without that's one side don't like the model and 01 preview and mini it doesn't really want to think so most of the time it won't think and you get an even dumber answer than you would do with gbt 4 row however 01 pro

the vibes are different. So it thinks more and it gives you something which is qualitatively completely on a different level. It doesn't look like dumb chat GPT anymore. It feels very, very different.

But there are still some issues with it. So certainly for situations where you are dealing with ambiguity, doing programming or something like that, I actually like having a DUMMA model because it's a didactic exchange, right? I'm saying, no, you misunderstood that. Let's do this. Let's do that. We're working on this thing together. What O1 does is it says, well, on the one hand, you can do this. And on the other hand, you can do that. It gives you a range of options.

you know, and I'm like, well, wouldn't it be better just to either, you know, go on, dance with the model or just better specify what you wanted in the first place? So,

Again, two issues. First of all, the O1 Pro just came, I think, last week, right? And it was the exam week for me, and we haven't spent time yet, you know, spending time. We haven't spent any money yet on the O1 Preview. I mean, O1 Pro, I mean, I played from outside, but we haven't done any API-level studies, which is the kind of thing that we did with O1 Preview. But one thing that, you know, I've looked at the Twitter, you know, exchanges about people

the usual suspects trying the various things on them etc and that two things that jumped at me is one of the things we saw in own preview is exactly the kind of thing you are saying and it looks like oven is still doing it which is they are good at digging to try and explain the answer they why their answer they gave is the correct answer one of the funny things was um

I use this one particular three-block stacking example, which is unsolvable. And in fact, this showed up in the New York Times as an example of why GPT-4O actually fails on that.

When O1 Preview came, Noam Brown actually in his long tweet, one of the things was Rao said this in ACL talk that this problem can't be solved and O1 Preview actually does solve this instance. This is good. Now, people have actually said that O1 gets the wrong answer.

and people, multiple people actually I've seen this and people have posted the screenshots, it gets the wrong answer but it argues with you as to why the answer that it is giving is still possibly correct. So this particular problem involves essentially like there is no way of actually solving it without moving C and it turns out that

it gives an answer where actually C moves because of gravity, it will fall down. And then it tries to argue with you that there are games where people will say that unless you are intentionally moving C, if the natural process makes it fall, it's not considered moving, which is like a very interesting thing that we have seen in O1 preview too, when it will, when we'll give it unsolvable instances, which by the way,

um normal llms just die with unsolvable instances because they've been rlhf to death and so they think that if you give a problem to them there must be an answer so basically they'll give you something and if for most unsolvable problems basically this is why this was an unsolvable instance that i showed it to 4o before oh one preview actually solves more of them correctly that's a credit to it that's why it's actually a more approximate reasoning model lrm in my view than llm

But on the other hand, when it actually basically gives a solution for an unsolvable instance, it will argue with you that it is still actually right because... And so I made this joke in the Strawberry paper that we have gone from hallucinations to gaslighting. So it actually tries to argue that you are...

Just like what you're saying, this is on the one hand what you want to do might be worthwhile doing, but on the other hand, this is the reason why I'm doing this. And in fact, I think this guy, Colin Fraser, I believe, one of these guys on Twitter who keeps playing with these models, and he said he gave...

the surgeon problem, the classical surgeon, you know, the boy getting into an accident one, and which O1 Pro said the surgeon, this basically does all this whole thing. This is a classical puzzle that brings gender stereotypes into account, etc., etc., and

And then gives the answer that the right way to think about it is. And so this is that this is a puzzle where he makes the change that the mother and the boy are driving.

And the mother dies and the doctor says, I can't operate on the boy. And so it actually changes the puzzle. And still O1 apparently says, we should basically realize that the doctor is the second mother of the boy. And it will try to argue that position. So interestingly, overthinking is actually kind of a problem

And actually trying to dig down. And so one of the interesting questions that we don't know, again, we haven't played with this, is to what extent is its explanation and its reasoning connected?

You know that in humans, this is actually, I mean, I'm not trying to anthropomorphize what it's doing. It's just if there are two different phases, right? If the phase one, it comes up with a solution. In phase two, if it needs to explain, if it doesn't have to look at what it did to get to the solution, the explanation is just dig my heels and try to say the solution is correct. And people tend to do that. Sometimes we'll come to some solution and then we'll try to come up with an explanation as to why, what might be right.

This is something that LLMs had this problem anyway to begin with because they completely assume these are completely different things and I'm always worried about LLM explanations. LRMs seem to be even more sophisticated at this sometimes, but it's only mostly anecdotal. I haven't really done systematic studies on this. So one, I don't have like any visceral opinions about any of these models because

To be honest, I don't use them in my day-to-day life. Most of the time, I write English well enough that I haven't yet seen an LLM that does a better job of things than I do. And I haven't yet found useful things where I would need LLM's help. I mean, maybe I will do at some point of time LLM's and LLM's. So I don't have anecdotal experiences of the kind that you have. I mean, I'm mostly focused on

specific systematic studies with multiple instances of planning problems. And we extended the plan bench to look at unsolvability, we look at longer length problems, look at scheduling problems, et cetera, to evaluate. Those are the ones that I have a better sense as to what Owen can and cannot do. - Yes. I must admit, I've updated a little bit. So I was always in the same camp as you when we thought of them as approximate retrievers.

I now am starting to see something. Yes, I think again my point is there are two different ways of thinking about it. One is it's not the LLMs which became that. So how do you define LLMs has to be some discussion that we should have. I mean that's why I keep actually talking about the stone soup metaphor, not because I want to play down the importance of O1, it's a great thing,

But you do have to decide who do you want to give credit to. If you are arguing part of your and definitely my reservations about reasoning abilities of

LLMs were, they were auto-regressive teacher force training things. And that was true from GPT-2.5 all the way to GPT-4.0. And OpenAI knows this. OpenAI knows it enough that they no longer call it, this is not GPT-01. You know that it's called 01. It's like a completely different model and they know that it's not GPT-01.

All you can say is that it was done by some of the same people that also developed LLMs. But we can't define LLMs to be whatever it is that OpenAI is producing. I mean, we have to have theoretical definitions. And my sense is autoregressive LLMs still have all the problems, but all the advantages because they're very fast. They're like amazing fast system ones.

And O1 is a reasoning model because it actually adds the reasoning post-training as well as reasoning inference, which nobody said will not be doable. You know, it's great still that they are able to do it in a very general sense, but I don't think there was any argument that AI systems were able to do reasoning, right? After all, AlphaGo is basically a reasoning system. It was just a deep and narrow reasoning system.

And the question was, is it more general, broader, but not as shallow as LLMs, as LRMs, and which is a good step in the right direction. But it doesn't change what I thought about LLMs, which is the autoregressive models. They are different. And in fact, they have advantages that O1 lacks.

For example, the cost of LLMs can actually be much lower. It is indeed much lower. So one of the studies, one of the things that we learned in the strawberry paper, for example,

the planning in strawberry fields paper, is that in some cases, if you are giving, basically you have to think of computer science is eventually about efficiency and cost too, right? So if you're giving a particular instance of the problem to O1 and you pay this many dollars,

versus you give the same instance to the LLM with a verifier in this inference time scaling approach, what I would call LLM modulo, which is a general approach that we have been pushing. The LLM modulo approach, where it uses a autoregressive LLM to generate many candidates and an external verifier, or even an LLM-based verifier, or a learned verifier to check, can actually be cheaper than O1 just doing one candidate with the same accuracy.

That becomes interesting because then, you know, part of the interesting thing about human civilization is on one hand, we are general purpose, you know, reasoners. But on the other hand, we also know that every job requires a tool. And we do that too. We basically, you know, the fact that, you know,

basically we doing everything that like a particular specialized tool does can be extremely inefficient in terms of the time that we are spending that is going to be the case for these reasoning models too to some extent because O1 actually costs quite a bit right now how much when it's going to change is anybody's guess but

That sort of brings up, in fact, there was a Shep Huckwriter, the LSTM guy. That's great. So you should ask him too. So yesterday I was in his talk and so he basically made this, one of the slides basically was the bitter lesson is over and efficiency is going to matter. And I completely agree with that. I've been arguing this for a long time too, that think about the following thing.

The first time when we sent a man and humans to the moon, cost was not a consideration. We want to show that we can do it. NASA was the one just doing it. The second time and for the space as well as the moon. The second and third time, etc., may be okay. But by now, it's Elon Musk sending people to space and supposedly possibly to the Mars too because the cost matters.

Essentially, once it's been done, then you start caring about the cost that you're paying. Computer science is actually quite a bit about the unsexy parts of cost, just as it is about doing things that haven't been done before. We are now in the second phase where we are actually going to care about...

Basically, how much am I spending in terms of the pre-training cost, in terms of the inference cost, etc., and are there better approaches that I can be using? This has been the case with computer science before too, and it became less of an issue for a while because LLMs were just system ones because there's no at all inference time cost.

Even though the pre-training was very costly, inference time was very cheap. And so we didn't have to worry about it. Now we will worry about it. So one of the funny things, the elephant in the room for our plan bench problems on O1 preview was that

the normal classical planners that are meant to solve these problems, solve them in like fraction, such a small fraction of the cost. They work on our laptops and solve all the problems with 100 percent guarantees. So the question is, I realize they're completely specialized only for that problem. Then on the other hand, you have this very general purpose thing which has cost as well as inaccuracies.

We start worrying about the trade-off. What level in this generality cost spectrum are you going to find home? That is going to be a very important thing. And I think that's sort of what Shabha Prater was hinting at when he said, you know, bitter lesson part is over, that you do actually need to worry about

the cost you are spending to actually achieve a goal. The first time you are achieving that goal, nobody cares about the cost because it's never been done. So you're doing it, you get all the credit. But umpteenth time it's being done because it becomes like a normal day-to-day thing, then the efficiency aspects matter.

A few things on that. I mean, first of all, with O1 Pro, I think it's worth $200 a month and you can call it a hundred times a day. Of course, API is very, very expensive, but I'm already spending over a thousand dollars a month on Clawed and Sonnet 3.5. But you raise an interesting point. I mean, first of all, the utility of an O1 model, it's a bit of a weird model, right? It's useful in certain specific circumstances. And if

if anything because of the verbosity and the distractors and the context it's not really a model that you want to be using most of the time but that raises the sort of the pragmatism and the architecture and the efficiency thing that you're speaking to so I spoke with some guys this morning and

they have built a kind of neuro evolution approach to designing multi-agent systems. You know, at the moment we hack in the tool use, you know, oh, do we use a debate pattern? Do we have a small model and we prompt it a lot? Or do we use a, you know, a bigger model? And we're all just hacking together these multi-agent architectures. And some of those architectures will even be doing the kinds of things that you're speaking about. So rather than it trying to convince you that it got the right answer, there might be a supervisor agent which does some reflexive

There might be another agent which generates the planning symbolic code and runs it on a tool. So we're building these big complicated things. And I think that's the process that we need to figure out now is building the systems that actually use this technology in the best way. Yeah. So I think the thing I sort of agree, but one thing that I want to point out that a distinction is that there's two notions of use of these kinds of models. When you do a subscription model, $20 or $200,

I would argue that that is by definition human in the loop with the model being an assistant to you. And it's a very different way of evaluation where you were unhappy with the previous model because it was wasting more of your time and it's not worth it. For you, this one was helping in whatever you were doing and you are happy with that. That's one particular type. In general, I've actually...

I've always thought, and I think we talked about it the last time too, that large language models and large reasoning models now too, they're all, there's no question that they are intelligence amplifiers. There's like no question on that part, okay? I mean, if you want to use it, you use it, and people are able to find uses for that, that's great. The part that I'm actually not talking more about, and that's been most of our work, is really

there would be scenarios where these become the end-user facing systems.

where they'll make the decisions. They will just say, "This is the answer, and then you're going to execute this plan." So the robot will execute this plan. Or, "This is the travel plan for which I'll buy the tickets." You don't get to come back in and say, "Oh, I don't like this travel plan." That's what you do in the subscription model. But the one that I'm talking about, basically the API access is basically what people, all the startups who are trying to build additional tools,

on top of these models, they are going to give specific autonomous functionality. And there, that's where I'm talking about the actual computational cost versus benefit for a certain level of accuracy at the end user time. Both of these are, they're very different kinds of users and I actually have no question at all in my mind

that all LLMs and definitely also LRMs are just great intelligence amplifiers. But that's not what my worry is, that the whole thing has always been, my worry has always been that people are trying to put this in the end user facing situations where they'll actually come make the decisions and some executor just executes it without pushing back.

And when that happens, the guarantees matter in terms of the brittleness of the reasoning matters. If you are in the loop, you would never, you know, it's like if you have an assistant and the assistant, you may fire the assistant if they are giving mostly bad ideas, but you will never blindly just use the assistant's ideas, right? So you will always be, the buck stops with you. That's a very different way of using LLMs.

then LLMs are the ones that the patient talks to. There's no doctor between the LLM, LRM, and the patient. In which case, their accuracy matters and their cost in getting to a certain level of accuracy matters. And these are two very different uses, and I'm much more interested in the second use than the first use. Can I push back just a tiny bit? So first of all, I completely agree with you that these things used autonomously, they don't work. They don't work for all of the reasons that you said, but...

that's not how they're being used and they're not being used like that because they don't work but what we are seeing is that all of the successful reimagining of applications with language models are completely interactive so they have a human in the loop and the human is supervising augmenting redirecting and so on

The next step that we haven't seen yet but we're starting to see is having autonomous agent-based systems with multiple levels of reflection, checking and so on. For example, it could be a bunch of agents generating programs, it could be contributing to a library of programs, the programs are being supervised not just by you but other users of the application and the whole thing just grows and it's a living ecosystem. So there's some diffused form of human supervised verification

and maybe in the future you know the humans might be increasingly taken out of the the front plane i think that's a very sane way of using but i'm afraid that's not the only way that's being used and in fact most of the people so actually the two issues one is if that's the only way i'm very happy because it's like it's a tool and you would use it and the onus is still on you finally the buck stops with you because you are in the loop right

But most of the imagined uses, at least from where I sit and the kind of startups that I hear from and the kind of papers that I'm even reading, they're all about autonomous uses. And that's where I'm actually looking at the fact that there is more promise than before. It was very brittle before. It's less brittle now.

But it is less brittle at the expense of cost. And it's actually interesting that the evaluation strategies for both of these are quite different. Evaluating assistive technologies is very different

from evaluating autonomous technologies. And it's not that one is, SD technology evaluation is not any easier, in fact. I mean, basically, you can say the evaluation is just if people are buying it and they keep paying for the subscription, that's a proof that people seem to be getting some value out of it.

But it's actually pretty hard to correctly evaluate assistive technologies and that's a whole entire area. In fact, most of the people who are worried about misuses of LLMs, they're not the ones including like Francois, for example, Francois Chollet and his Arc thing, etc. Ultimately, all of this is we are interested

Irrespective of whether you believe AGI is coming next week or next decade or next century, everybody in AI eventually wants this autonomous abilities to actually make intelligent action with guarantees.

And that is basically where I think we will get there. But prematurely saying whatever currently is there is already working is the one that a bunch of us are worried about. And that's what we are pushing back on. With the humans in the loop, it's a completely different thing. And even for the code generation right now, they're like,

the two different uses essentially. There is also use of code generation techniques where it tries to kind of improve accuracy to the level that humans don't have to. It's not just an idea generation for the human. If it's idea generation, it's great because somebody else's job is online. It's not, you know, there's still like a buck stops with the actual programmer in the thing. So I think that the autonomous one is the one that I care about at any rate and that's the one that I'm worried about

the premature declarations of they're already autonomously intelligent. But I'm generally very happy that this technology exists as a human in the loop.

technology. And it's kind of interesting from me sitting here to hear you say that you actually, as a user, I mean, you seem to be more of a regular user than I ever have been, you know, of the LRMs and LRMs, that it's kind of interesting. It means something to me when you say that you like O1 more than you ever liked O1 Preview and you kind of were okay with GPT-4 maybe, but now like O1 a little more and that sort of

you basically are getting value out of it, but you still can always, you have the red switch. You can decide not to take its answer, you know. - O1 Pro. - Yeah, okay, O1 Pro, yeah, okay, yeah. - The only difference is that there seems to be, when it thinks for a long time, there's a qualitative improvement, you know.

I wanted to get your take on something else. So we're seeing, I mean, you had your LLM modulo architecture and then we've got this huge approach of test time, this kind of greenblatting approach. So you greenblat the model and you get it to generate loads and loads of Python functions. And in a way that this is...

sort of thing that we like because we like yeah yeah yeah green bladding okay fine yeah oh yeah um but we're we're seeing that in lots and lots of different ways so doing loads and loads of inference and then you know we've got this python functions and maybe we do um you know library learning and remixing and you know we're in the world of code so we're using we're using code we're generating an explicit function we can verify it we love that you know we're in a very happy place but now we're seeing an interesting ship so certainly on arc and on several other papers

people are moving towards this idea of transductive active fine-tuning. And that simply means, rather than generating an explicit Python function,

and doing it loads and loads of times. Let's just generate the solution directly just using the neural network. And this is a step away because we like programs because, you know, programs are Turing completes and we understand what they mean and everything. And now there's a whole load of people that say, actually, the neural network can just do whatever the program does. Let's just let the neural network output the solution directly. What do you think about that?

So to be honest, I haven't followed that work as closely. So my answer is somewhat more generic. I would be surprised. I mean, I would have the same bias that, in fact, there's an old saying that why write programs when you can write programs write programs? That's the version that we are talking about. Basically, you want to generate higher-level code that generates the solutions. This has always been the...

the conceit of computer science so i am surprised i don't actually know specifically the work that you are referring to in terms of just going back and directly going for the solutions because honestly

In the context of inference time scaling, one interesting question is you generate loads and loads of candidates. The candidates can be the direct solution candidates or the code candidates, either which way. And then you still have to have verifier. If it's code, you need to have code verifier. If you have solution, you need to have solution verifier. And one of the interesting questions is where are these verifiers coming from?

And there has actually been one of the more effective ideas that we've been pursuing is you can essentially generate verifiers. They're like

Of course, there are symbolic verifiers that might be there for specific things. And we can use that in LLM modular-style frameworks. But you could also use learned verifiers where essentially you just basically learn to do discriminatively what is a solution versus what is not a solution. A third idea is generate the code for the verifier and then correct it.

And that's actually, in fact, it's still for at least in our case, it seems to be promising. We are working on some things that are going to come out soon. But, you know, basically, I still think that and especially in the context of LLMs, in the context of LLMs. OK, so it's like, again, it's a very different thing. If you're not having LLMs in the loop at all, it's a different question. But if the LLMs are there, one of the things they're actually good at doing is like outputting, you know,

Basically, they can output code as well as solutions, in which case the code can output lots and lots of classes of solutions can be verified by the code. If you correct it once, then it will work for a longer time in essence.

I would still think that at least for the inference time scaling verifiers case, that seems to be still a good idea. I don't quite know the specific context from which what you're saying this trans people are saying,

that the transductive directly guessing solutions would help. I'm not sure whether they still have LLM in the loop, or are they just saying we'll just directly train a separate neural network? - Well, I'll sketch it out. So SolvingArc, they have two LLAMA 8 billion models, and one is generating Python programs, and they greenblatt it. The other one is trained separately just to output the answer grid directly. - Okay.

And in both cases, they do, you know, inference time compute. So either generating lots of Python programs or doing active fine tuning of the direct solution one by augmenting the test time examples. And what they found is like on the Venn diagram of the, you know, their success rate, they find that for some problems, the program works really well, you know, like the green black approach. And for some problems, you know, certainly things like mosaics and spatial perceptual type stuff

the transduction works really, really well. And this is kind of weird because if you think about the space of functions that the neural network could reason about, they should be the same. So I don't know whether it's just because of limitations in the neural network or characteristics of the problem or something that you see. To me, interestingly, to me, again, it depends very much on the space of functions

solution configurations versus space of code configurations, there are many problems where solutions might be of less quote-unquote syntactic complexity than... And so a neural network that can kind of guess a string may not be able to guess something that looks like a syntactically correct Python program, right? LLMs actually can do that.

later and so it is interesting that if you can do that and if you still go back to neural network to actually directly guessing the solution it being a more useful step I begin the stuff that we are doing for the verification thing is still in the you know initial stages you know and we haven't actually checked this kind of a trade-off whether it would exist so you know I have no more insights specifically on why that might be happening

Wonderful. What are you doing at the conference this week? That's fun. So I'm just here today and I think we did this Chain of Thoughtlessness paper and then I kind of said that it's like mostly when we wrote it, it was like

it can't follow procedures, so I should be able to show it. But now actually I explained the whole thing the way I explained to you here in the beginning, essentially go from prompt augmentation. I think that like Schopenhauer said, life must be lived forwards but only makes sense backwards. Papers also only make sense backwards. After a while of writing, you actually look at it and say, what I really want to say is the reason chain of thought,

is not a great idea is because you really want to think in terms of prompt augmentations and humans coming in the loop becomes less important. So that's what we did. And then I'm actually going to this compound systems thing and had a great time. They're like 16,000 people and, you know,

running into lots of old friends and so on. Yeah, one of the best moments from the last interview is when you were talking about that paper saying that, you know, they can catch, you can teach someone to catch two fish or three fish or four. Yeah, that's, yeah, yeah. Yeah, I mean, that's basically because it doesn't quite know how to generalize. And so I kind of made that thing that, yeah, essentially,

Because you have to kind of give it examples for four-word problems, again give it examples for seven-word problems, again give examples for nine-word problems, et cetera, and then try to improve it. Whereas the conceit people think is when people, when you say this, they'll say, oh, it must be doing procedure generalization. The interesting thing, again, is I think we had this conversation last time, too, that the way I look at this, I mean, I'm skeptical only...

Because of just having some additional background. And one of the things is McCarthy, John McCarthy, who was the founding fathers, I mean, the guy who coined the name artificial intelligence, basically said the holy grail of AI is an advice taker program.

And advice taking is AI complete. And if Chain of Thought is able to make LLM stake advice, that would be pretty impressive. And I kind of went in thinking that there has to be holes there. And so that is where that fish, one fish, two fish thing comes in. But the more interesting thing is, I think

de-anthropomorphizing LLMs and trying to think of them as these basically these alien entities for which arbitrary prompt augmentations will can generate good behavior. By the way, one example of this that people should be thinking about is if you think of jailbreaks on LLMs,

Jailbreaks are you give a normal prompt and you give this particular carefully constructed learned sequence. You know, Zico Coulter's original paper, his group's original paper shows that sequence makes no sense to humans. But it will make most LLMs provide a deterministic behavior like saying got you or something of that kind. And essentially that should tell us

that they're not seeing language and so the prompt augmentations don't have to make sense to humans in the loop and and and that's okay and because in some sense looking at things the only change of thought that sort of made sense to humans was giving this false impression that somehow llms are doing things like we do but that's not the way it is you know so

might as well just go with what they can do and optimize directly, which is what the inference time scaling and post-training methods seem to be doing. - Yeah, the one thing I get stuck on is we can criticize individual LLMs. I mean, yeah, they are approximate retrieval engines,

My co-host, Keith Duggar, he's always at pains to point out theoretically that they're not too incomplete. You know, they're finite state automata and all of this kind of stuff. But the thing is, it all breaks down when you talk about LLM systems. So even with the chain of thought thing, right, I could have another supervisor model that could generalize the prompt to go to five fish, six fish, and so on. So we can easily build systems that overcome all of these criticisms. So at some point, does it just seem like,

We're making criticisms that can be easily... No, no, no. Actually, it's a very good point. So in fact, after this, going to this compound systems meetup, I'm completely a big believer in that whole direction. But there are some people who don't want to believe that. The usual LLM aficionados don't. In fact, by the way, it's a very interesting thing that OpenAI was at pains to point out that O1 preview was a model, not a system.

It's not me and you saying it. It's them saying it. They would like to say there is this one-size-fits-all model that will do it. And so it is reasonable to take their word for that. But parallelly, I also like the compound systems work. And it makes an LLM modulo is a compound system. And that's what basically it improves on all the limitations of LLMs and a set of limitations of LLMs.

I'm completely fine with it. You know, again, it doesn't matter to me as long as I can give guarantees and in a safety critical scenario, I'm fine with it. I don't have that bias. But if you are saying a single model will do it, I will take you at your word and then see whether or not that's true. That's a fair thing, it seems to me.

Why do you think, you know, Google have completely embraced, you know, hybrid systems. OpenAI, they're really clinging on to this single model that does everything. I think they're slowly changing that, but I think there was a reason, there's a, I think, I mean, to some extent, I can understand. And in the sense, it would be the sort of

Thing is this anthropomorphization again. We only have one brain. It's not that we have a brain for eating and a brain for... Just one brain, right? And so it would be nice if what we are trying to do would somehow basically be this one size fits all, this general system. But at the same time, there's also this issue of whatever I do, I want to provide guarantees so that it can be used in safety critical systems. And

So the problem is the modern AI and neuroscience and cognitive science, they're not one and the same, right? I mean, everybody understands that. Essentially, neural networks themselves are not really...

that well connected to brain and essentially they're like biologically implausible and LLMs are definitely not but there's nothing wrong with that just like we say you know the planes don't have to flap their wings you know so these are but we don't essentially try to make sense of planes and birds you

in the same sentence, because they both fly, but other than that, the mechanics are different, the things that the flight equations are not at all exactly the same things. That's going to be more of the case with the LLMs too. And as long as we realize that, it would be good. But I think OpenAI, I think originally they were hoping, my sense is, a bunch of these people are hoping that we will just get

two birds with one shot. We'll get AI systems as well as understand how the brain works. But I don't really think that part nobody really believes, honestly. I mean, you might use these systems to improve our understanding in actually doing neuroscience. In fact, I think, what's his name?

Sung Kim, I think, that basically says, you know, obviously these systems help in actually doing neuroscience research, but they're not actually telling you how brain necessarily works. So that might, that's just a speculation that might explain why, you know, OpenAI and some of these people who are, you know, sticking to... But I mean, the kind of conversations I've been having on the sidelines in the conference already...

the companies they're the startups etc they're already sort of going much more into these hybrid systems much more into these compound systems and and it's like you know that would basically not be a single system but open ai also is slowly coming up with these fine tuning models they have this rl fine tuning stuff for your specific kinds of scenarios etc so it would be interesting to see um but i think just going back to your original idea

I think compound systems is a very different, they're basically the individual role that the LLMs have to play is much less demanding.

In fact, one of the fun things is we can do LLM modulo with normal LLMs or LRM modulo with instead of LLM, I call O1. And so it will be the generation of candidates is costlier. And we actually show in the strawberry paper that we can improve further the performance of O1 preview on some of the problems, even though we couldn't change how much time it takes to think, etc. We can just by...

calling it multiple times with the correct you know better criticisms of the you know instance the problems answers it gave we could improve its performance accuracy quite significantly so that is still using them in a system you know lrms themselves can be used in a system but i think oh one open ai itself just wants to call it just models up until now let's see what happens

Subbarao Kambhampati - Do o1 models search? 01:32:13 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

Subbarao Kambhampati - Do o1 models search?