We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Clement Bonnet - Can Latent Program Networks Solve Abstract Reasoning?

2025/2/19

Machine Learning Street Talk (MLST)

AI Deep Dive AI Chapters Transcript

People

Clement Bonnet

Topics

Clement Bonnet: 我提出了一种新颖的方法来解决ARC挑战，该方法将输入-输出对编码到潜在空间中，并使用搜索算法优化该表示，然后为新的输入解码输出。这种端到端架构使用VAE损失，包括重建损失和先验损失。与依赖于微调LLM或在推理时生成样本的方法不同，我的方法在测试时进行搜索，通过学习程序的流形到单一连贯的潜在空间中，实现高效的测试时适应。现有许多方法使用测试时训练，这是一种参数高效的微调方法，但效率低下且难以组合低级基元。我提出的方法通过嵌入程序到更连贯和压缩的表示中，可以进行更容易、更高效的搜索。Arc 挑战对神经网络具有鲁棒性，因为测试任务与训练分布中的任务差异很大，并且保证不存在于互联网上。神经网络无法学习 Arc 的原因仅仅是因为互联网上没有大量的类似 Arc 的数据。神经网络无法解决 Arc 测试任务的原因是测试分布与训练分布相差甚远，无法进行极端泛化，需要一种全新的知识组合方式来完成任务。以往的方法倾向于生成程序，而我的方法直接生成解决方案，这与直觉相反，但效果良好。我的方法是基于归纳法，而直接预测答案的方法是基于转导法，两者之间存在细微的差别。归纳法和转导法之间的区别在于是否压缩表示，压缩允许更有效的搜索。压缩允许更有效的搜索，我的架构构建了程序的压缩表示，以便有效地进行搜索。我的潜在程序网络（LPN）搜索属于测试时训练方法的范畴，将潜在空间视为输入条件参数空间，通过搜索该空间来寻找更好的数据解释。我的架构包含编码器、搜索组件和解码器，编码器将输入-输出对嵌入到程序的分布中，搜索组件优化潜在表示，解码器生成新输入的输出。由于输入-输出对可以由无限多个程序解释，因此使用变分框架将程序编码到分布中。该架构的新颖之处在于中间的搜索组件，该组件通过优化潜在表示来改进潜在程序。该架构的归纳部分在于寻找解释整个任务的小型潜在程序。该架构使用 VAE 损失进行端到端训练，包括重建损失和先验损失。VAE 的变分方面非常重要，因为它有助于保持潜在空间的结构。为了防止直接在潜在空间中编码输出，训练表示以解码不同输入-输出对的输出，训练过程与测试过程类似。在训练过程中进行梯度搜索步骤，可以使最终的潜在空间更易于在推理时进行搜索。在训练过程中进行梯度搜索步骤，可以使最终的潜在空间更易于在推理时进行搜索。在训练过程中进行梯度搜索步骤，可以使最终的潜在空间更易于在推理时进行搜索。即使没有预训练的LLM或先验知识，该架构也能在评估集上取得一定的性能。该架构使用从零开始训练的 vanilla transformer 来编码输入-输出网格，参数量大约为 4000 万。该架构从零开始训练，即使没有完全收敛，也能在 Michael Hoddle 的 RE-ARC 数据集上取得不错的效果。该架构通过学习将 400 个训练任务嵌入到潜在空间中，并通过插值潜在空间来预测新的任务。RE-ARC 数据集包含大约一亿个数据点，这使得 transformer 能够学习 2D 网格的表示。RE-ARC 数据集的先验知识在于教 transformer 如何理解 2D 网格，这符合开发人员感知的泛化假设。RE-ARC 数据集的先验知识在于教 transformer 如何理解 2D 网格，这符合开发人员感知的泛化假设。该方法的局限性在于无法处理任意复杂的组合程序，只能处理程序的叠加，而不是组合。通过使用多个线程或同时进行多个操作，可以实现组合性。为了实现组合性，需要一个展开的计算图，将多个解决方案组合在一起。为了实现组合性，可以使用多个前向传递，将一个线程的输出作为另一个线程的输入。在潜在空间中找到比初始猜测更好的程序，是一个搜索问题。潜在空间很平滑，可以使用一阶方法进行搜索。该架构对噪声输入的鲁棒性尚未进行测试，但已经尝试了改变架构大小。架构大小和潜在空间大小之间存在权衡，解码器的容量必须足够大才能学习程序。该系统在处理转导任务时表现不佳，这与 Ellis 和 Wending 的论文结果一致。转导任务在潜在空间中没有对应的点，因为不需要归纳。我相信深度学习和程序合成相结合可以解决剩余的难题。我不认为大型语言模型比人类聪明，它们只是在某些任务上表现出色。我相信深度学习架构可以解决大部分任务，但一些难题需要不同的方法。我相信深度学习模型缺乏高度的创造力。大型语言模型的创造力是指数级的，需要指数级更多的样本才能产生有创造力的结果。程序合成和符号可以帮助深度学习架构变得更有创造力。我认为集体智能和个体智能在创造力方面存在差异。人类解决问题时不会在脑海中生成数百万个程序，而是进行更高效的合成。解决 Arc 问题时，我通常会提出几个假设，然后对其进行测试和改进，这是一种高效的搜索空间缩减方法。如果该解决方案被大规模扩展，潜在空间可能会变得不太具有解释性，难以搜索。寻找小型表示来解释输出结果，可以是一种解决合成和不确定性问题的方法。大型语言模型的潜在空间可能过于高维且局部性强，难以进行组合和泛化。我相信寻找小的表示来解释输出结果非常重要。 Matthew Macfarlane: 作为访谈者，我没有提出具体的观点，而是引导访谈，并对Clement Bonnet 的观点进行提问和讨论。

Deep Dive

Shownotes Transcript

Translations:

中文

The ArcGIS benchmark, Abstraction and Resilience Corpus, is a program synthesis benchmark whose goal is to assess how AI systems can adapt to novelty at this time. If we knew the kind of distribution of tasks that are in the test sets and you would fine-tune the LLM on it, you would solve it. The thing with transduction tasks is that there is no point in latent space that would correspond to your transductive task because by definition there's no induction necessary there.

they're exponentially not creative. So you would need exponentially more samples to make them creative. I solve a task, have some intuition of what the task might look like in a tenth of a second. I see the structures, I see regularities very quickly. And then I realize, oh no, this is not as easy as this. Intuition was happening in large-angle models that they have this very

intricate uh you know very spurious highly dimensional you know vector functions that can be recombined in some narrow way i'm a big believer in like trying to find you know small representations that could explain your your outputs now how do you how do you build these spaces how do you search through them so my real name is clement clement bonnet um i'll go by clem i

So whatever, Clement, Clem, Bonnet, whatever is easier for you. Clem, welcome to MLST. It's so nice to have you here. Thank you so much for having me. So I've just been reading your paper and Cholet told me yesterday that it's one of his favorite approaches to the Arc Challenge. Can you tell the audience about the approach? The ArcAGM benchmark, Abstraction and Resilience Corpus, is a program synthesis benchmark whose goal is to assess

AI systems can adapt to novelty at test time. So it's a very interesting benchmark because pre-trained LLMs perform very poorly on it because the tasks that are seen at test time are very different from their training sets. So the whole assumption we've had with this architecture that I'm going to introduce is to build in search in architecture such that you're going to do test time search

And so the way we do it is that we embed programs into a latent space that we train such that it is good to search inside.

So Tufalabs is a new AI research lab. I'm starting in Zurich. It is funded from PASS Ventures, involving AI as well. And so we are Swiss version of DeepSeq. So a small group of people, very, very motivated, very hardworking. And we try to do some AI research starting with LLM and O1 style models. What we're looking for now is chief scientists and also research engineers. You can check out positions at tufalabs.ai.

There have been many approaches to the Arc Challenge. And even going back to the days of Dreamcoder, there's Kevin Ellis and he had this DSL. Well, maybe even before that, we could just search a DSL, which is exponential time. Ellis had a neurally guided search where we train a neural network. And then these days, everyone is embracing this kind of greenblatt approach where you just

you just make an LLM generate programs and yours is completely different to all of the other types of test time adaptation. You're like...

you're embedding the programs in a latent space and then you're searching that latent space and you're not actually creating programs, you're generating the solutions directly. - Yes, exactly. So really the assumption we had was that we want to do search at this time and we want to do search in a program space. And to do that efficiently, we try to embed, you know,

commercially hard programs into a continuous latent space that then we can search using any search method that we want. And so by learning the manifold of programs into a single coherent latent space, we allow efficient test time adaptation. So there has been a lot of methods. The top performing methods are using test time training.

which is this parameter efficient fine-tuning given a new test input. You search in your parameter space parameters that would better perform on your test input. And so that's a very effective way to do this kind of recombination synthesis at test time. We argue that it's a very inefficient way to do so.

because the parameter space is huge and it is not obvious that you can recombine low-level primitives in a compositional way.

So, I mean, yeah, it is to be seen whether these approaches can solve Arc AGI. But we believe that by embedding programs into a more coherent and compressed representation, we can do easier, more efficient search. It might be worth meditating just for a second on why the Arc challenge is so impervious to neural networks. I mean, Cholet wanted it to be that way. Why is that? Oh, Cholet designed it to be...

robust to memorization. And why it is robust to memorization is because it is the, so the input output task are so different from whatever has been seen in a training distribution. It is quite guaranteed that they don't exist anywhere on the internet. It's a private test set that is, you know, kept completely hidden.

And therefore it is argued that the task similarity is not high enough for LLMs to kind of generalize your shots. I'll just push on just a tiny bit though. So many of the sentences that we've just spoken in this conversation don't exist on the internet either. So what's the difference between that and an ARC program?

So I believe our discussion can be, you know, embedding into a rather small latent space that the LLMs have learned to actually emulate and kind of, you know, compose in a way. The ArcGIS task of

are truly novel. So they're only based on the core human knowledge priors, but they compose these priors in arbitrary ways that I believe are not anywhere on the internet and that the pre-trained large-scale models cannot make sense of. Otherwise you would get zero-shot generalization, which we don't. This is an interesting thing just to linger on those.

Are you saying that the reason why neural networks can't learn ARC is simply because there isn't loads of ARC-like data on the internet? Because I think at some point people might have thought there's something intrinsically about the ARC, you know, because it's a discrete

program, you know, distribution or whatever that it, you know, neural networks can't learn it, but you're saying they can learn it. It's just because there's no data out there. So if we knew the kind of distribution of tasks that are in the test sets and you would, you know, fine tune the LLM on it, if you were finding any transactive, any neural network architecture on the distribution that includes the test set data, you would solve it. It's machine learning training, but the reason why we can't, we can't

we can't solve this test task, it's because the test distribution is quite far away from the training solution. So it's not able to do extreme generalization and trying to come up with this really novel way of combining knowledge to perform the task.

MLST is sponsored by Sentinel, which is the compute platform specifically optimized for AI workloads. They support all of the latest open source language models out of the box, like Lama, for example. You can pay on consumption, essentially, or you can have a model which is always working or it can be freeze-dried when you're not using it.

All of the models that they deploy support the OpenAI API specification out of the box, which means it's just a one-line change in your application to switch over to Sentinel and start saving money and make your application go faster. This gets to the next interesting bit then. So I guess the reason why perhaps a few years ago if you asked me was it possible in principle at all to learn ARC, I might have said no because

Many people, even Kevin Ellis, for example, he loves the idea of generating programs and that's what Greenblatt did. We feel intuitively that we should generate a program because then when it runs on a computer, programs can do this kind of compositionality and various forms of generalization and so on. So intuitively, we think that if we generate the actual answer without going via program intermediary, then it won't be as good. But is that not the case?

So I'm assuming you're referring to doing transduction versus induction. So going through a compressed program rotation, which would be induction-based versus directly trying to predict the answer. Yeah, maybe we should define induction and transduction because I think even before you get to transduction...

You could train a neural network just to generate the solution directly? Yes. And without even using transduction? That'd be using transduction though, if you predict the output data directly from like reading out the inputs. To me, I define transduction as generating a prediction directly from the data rather than being a statement about using the solution space.

For example, Jonas Hobbiter, his stuff is transduction because given some input examples, he does some retrieval and then he fine tunes the model and then he does a prediction. So he's building a new model for every single datum that is received.

but we could build a massive neural network which directly predicts the solution space, which is still an inductive model. - Yeah, so I mean, that's where the fine line between induction and transition, I think, comes in. Also, you have the kernel trick, which turns any induction method into a transition method.

I don't know if the distinction between induction and transducer actually matters. I think what matters to me is whether you compress report stations. So obviously when you have an inductive method that comes up with, whether it's a pattern program or maybe some kind of latent program, a small report station that is supposed to emulate your output. I would call this an induction method because there is this kind of reportation compression bottleneck.

And if you're not trying to do that, if you're trying to do some kind of shallow recombination of just transductive test time training, then I would put this in the group of transductive methods. And for me, it's all about the compression.

because compression allows you for a more efficient search. In architecture, we build this very compressed representation of programs in which we can effectively search. And I believe when you have methods that generate Python programs, I believe these are very similar methods of generating a very compressed L-bit program in Python that can generate an output. I was going to ask you why a kernel method is not

The goal of linear regression is to come up with this linear function that best explains your data. And then once you have this function, you can then use it for inference using a new input-output pair, a new input. When you solve this linear regression of the shape y equals ax,

you have a solution of the form xt , xt , something. And so you can actually write the output as a pure function of the training data and your new input by just multiplying matrices. So I think that's one representation of the kernel trick. It's just, although you have this very low representation

function, which is here, linear function, you can actually rewrite the whole thing as like one computation. You could, but isn't the kernel matrix is like, you know, the training data, it's the inner product of the data, basically, giving you like this positive semi-definite matrix. So is that not the inductive function? Yes. So essentially it's like saying wipe all effects and then, you know, if you write it out, like, it

kind of unroll whatever computational graph you have, then the whole thing is like a chart. So I think that's where the equivalence between transduction and induction comes in to me, at least how I see it. So that's why I don't think it's a very useful distinction here. I don't know. I really like the distinction. To me, it's about creating a model from the data rather than reusing an existing model that I... So for example, we could make your model transductive

if we fine-tuned the encoder based on some augmentations of the data. But I would argue your system is not inductive now because we're doing some kind of gradient optimization through the model, but we're not changing the model, therefore it's inductive. - Yeah, we're not changing the model, that's the whole thing. Although I do think that, so our latent program network, LPN search,

is actually I think fits in a spectrum of test time training methods. So if you see the latent space as you know an input conditioned parameter space in a way, we are training this space you know we are like searching through this space using you know zero order or first order optimization methods to find a you know better explanation for your data and it's quite similar to you know searching through your parameter space doing parameter fine tuning

at test time. So I do think they fit in the same spectrum. I'm curious to explore methods where this adaptation at test time would be as efficient as possible if we want to create systems that adapt online very efficiently.

So I really like the SAM training. I do think that it's probably a very suboptimal method, searching in this vast parameter space. So can you talk me through the architecture? So as I said, we want to embed programs into a latent space. So to do so, we have first an encoder that takes input-output pairs independently and then embeds them into a program similar to a VAE architecture, actually, a variational autoencoder.

But if you think about it, an input-output actually can be explained by an infinity of programs, right, of varying description length or complexity. So we would want to find a minimum description-length program. But so because you can explain input-output pair with an infinite amount of programs, we use a variational framework. So we encode these programs into a distribution, actually, of written programs. So we embed these input-output pairs into distribution of programs.

And then we have this search component in the middle, which I'm going to explain right after. And then once, which is all about refining this latency to be a bit better. And then once you have a better latency, you try to decode it with your decoder.

and so your decoder is actually going to take a new a new input it's going to take this latent program and then generate an output for it so it's literally applying a program so so i'm saying programs here but i but i mean actually latent vectors we all in the geometric space here and um and so this um so the true novelty i think of this architecture is actually this um middle um stage here

and how it works that the we expect the encoder to actually output a first guess of what the program looks like you know a bit as a human intuition when solving an arc task of oh i do believe there's something about shapes here and like we should like move them around so that you know it's like first guess and then we we use um optimization methods to refine this latent to find um

actually a different point in your latent space, in your latent program space, that would better explain your data. So you have this kind of inner loop with that, using the decoder and the input-output pairs, so your specification, to refine, find a better latent in your latent space that would actually, for each of the input-output pairs, would generate the right output.

And once you find this further away point in latent space that better explains these input-output pairs, you're rather confident that you can apply it to your new test inputs. So I think that's where the induction part here comes in, because we're really trying to find this small remote station, small latent program that explains your whole task.

arc let's say it typically comes with about three um specific when we say specification we just mean like the you get a few input example pairs and you can place all of them through the encoder and the encoder will give you three different points in in the latent space you average those together and then you perform some um gradient steps to improve

the latent to a point where you get a good solution for all of the three examples, something like that. Yes, yes, it's exactly that. So we must find a way to recombine these different latent distributions for these different input-output pairs. Ideally, each input-output pair should actually, through the encoder, should generate a roughly similar latent distribution because they actually correspond to the same inherent task.

And so the vanilla way of doing so is to actually compute a mean in latent space and then use that as a starting point for your search. I do believe there are better ways to do so. We could see the whole aggregation as a mixture of distributions and probably like, I don't know, maybe sample with a small temperature or something. I do believe there are better ways to aggregate these kind of distributions, you know, first

proof of concept we've done was this mean aggregation, which ends up working pretty well. And so the nice thing about this architecture is that it's trained end to end using a VARCHAL VAE loss, so the ELBO loss. So this means it is decomposing to two terms, the reconstruction loss. So given a latent program that is refined, try to reproduce your input output pair, given all the other input output pairs in your specification.

So you have a reconstruction loss and then you have a prior loss. Make sure your latent space looks roughly Gaussian. So this is the classic prior loss from the VAE and it actually encourages the latent space to be very structured.

We did try without the VA framework, so the pure auto-encoding framework. So no prior on what the latent space should look like. And we end up with very unstructured, spiky spaces where most programs tend to be as far away as possible from each other, making search impossible and making the whole latent space rather useless. So I think the version aspects of it is very important.

trying to keep a Gaussian compressed representation of your program space. Yeah, we should double click on that a little bit. So we want to have a well-formed space that doesn't degenerate and doesn't memorize solutions. And I think you also use the KL divergence. Just reading from your paper a little bit here. So you said that you want to prevent encoding the output directly in the latent space. So you encode an input-output pair to the latent space, but train this representation to decode the output of a different input-output

pair to prevent memorization. So that was one of the tricks that you used. Yeah, so actually it's not really a trick. So with most of these latent space ideas, usually when they don't

perform them on program-sensitive benchmarks like Arc. There's usually this problem of you want to learn the compressed rotation of whatever your input/output space is and then try to decode it. But the obvious thing is that you may actually end up just compressing your output in your space. You would kind of leak your output in that space and so it's easier to reconstruct. It would be a shortcut that ML training would lead.

and it would be very useless because you wouldn't learn anything that connects the input to the output. You would just learn to map the output to a latent space. And so the way we deal with this is just we consider during training, we use a very similar setup as we would have during tests.

which is we have access to n input output pairs and we want to predict an n plus one output so actually during training if we have um if we assume n input output pairs for each input output pair we are going to um embed into the latent space the n minus one other pairs to predict that pair and

and kind of do that for all the pairs in parallel. Very cool. Now, the other great trick you used is you actually do like gradient steps, so search steps during the training process because it actually makes the final kind of latent space more amenable to search at inference time. Was that something that you thought to do intuitively or you did it to help it because it wasn't working?

So we've always wanted to train using search. And the reason why we want that is that

If you think about it during training, your encoder will, well, even at the end of training, your encoder is not perfect. The whole assumption is that it provides a, you know, a best guess of what the task is. And so if you're not allowed to kind of refine this guess, well, the decoder can be as good as what the guess from the encoder is. So you end up with a latent space. Well, you would end up with a latent space that doesn't, cannot really encode many programs.

It would just encode kind of the regression to the encoder's best guess distribution. So what we end up doing is that we actually activate this search method during training. It brings significant overheads at training time. So we actually propose some kind of pre-training without it and then kind of fine tuning with it.

But it doesn't matter. The whole point is that you, by switching it on during training, using some kind of random local search or gradient ascent first-order optimization search, we train the latent space to be good at being searched. So the best guess ends up being actually quite bad, but very close to a very good guess. It's the whole point behind these meta-learning methods like MAML and so on.

Qualitatively, tell me about the latent space. So you've looked at it. Do you find that different directions correspond to different types of arc problems or whatever? What do you see? So we haven't analyzed the latent space that much by comparing different arc tasks yet. I think it's a very interesting thing to do. I think in the paper, we do bring up a plot of kind of a TSA new partition of the different programs that are learned.

And we, so during training, we definitely see this kind of blob of like all the, so all the input output pairs that are not really well understood, neither by the decoder or by the encoder, end up being in this kind of bag of, you know, un-learned programs and programs that start being learned start to like, you know, cluster and, and okay.

occupy the space. So we kind of scaled this architecture for Arc very recently. We got some non-zero yet non-sota results. And I think it's interesting because although we don't get very, very high performance, we do get it without any priors, so without using any pre-trained LLMs, without looking at the evaluation set or whatever, by just training on the 400 train test distributions.

So getting something like 10% on the evaluation sets, which is in itself actually a harder distribution of tasks, was quite impressive, I think, from the first initial scaling curves we've had.

Can you explain that in a bit more detail? So you've got an encoder and a decoder and they're transformer models. And you've got the, is it the re-arc tasks from Michael Hoddle? And what kind of transformers were they and what kind of training did you do? Yeah, so the LPN architecture assumes this encoder and decoder arc and modules. And to train this architecture and arc, we really used vanilla transformers from scratch to encode the input-output grids.

So really you could use any generated model, any sequence model.

We decided to use self-attention transformers that flatten the input-output grids into sequences of 900 values because these are all 30x30 grids. We used some kind of roughly smart 2D positional encoding to make sense of spatial structure, but that's roughly it. And then we trained rather small transformers, I think they're about 20 million parameters each,

roughly 40 million in total on encoding and then generating autoregressively this input-output pairs. We didn't tune the architecture that much. We had to do something around the positional encoding to make sense. But otherwise, it's a very vanilla transformer, pretty normal architecture. We really focused on trying to get

a good latent space of programs and how to efficiently search it during training and at test time. And we experimented with the Transformer architectures at first and we're like, okay, we're good to go. Let's just focus on what we think is interesting. This blows my mind because most people that use Transformers fine tune them.

they take ones that have been trained on internet data and they fine tune them. And I guess like the intuition for that is we're told it's very difficult to train transformers and they only start doing good things when you train them with loads of data for a very long time. So you started from scratch. I think you said you didn't even train it to convergence just with like Michael Hoddle's RE-ARC dataset. So hardly anything and it still worked quite well. Yes. So the whole point for our training was to

learn to embed these 400 trained tasks in your latent space. And if we do so in a rather structured way, then we hope that by interpolating this latent space during search at this time, for instance, giving a new input-output pair, give a new task, sorry, then you would find programs that are not seen during training, but represent kind of interpolation,

probably superposition of programs seen during training. Finding a transformer on single input output pairs from a single program is a very bad idea. There's not enough data for the transformer to make sense of, you know, position, relative positions and so on.

So we use Michael Hoddle's very useful library of react transformations. So the data Michael Hoddle created is just a distribution of the, for each of the 400 tasks, it's a distribution of input-output pairs following that program.

So the program is constant, so there's not a lot more information that we extract. Really, the program is constant. It's just the input distribution is now a rather broad distribution instead of being a single input. And so this allows training transformers to make sense of 2D grids in a way that they would not when they initialize from scratch. We do show some kind of like some training curves on the

on re:ARC. We actually didn't have enough compute to train until convergence, but we showed some increase of accuracy on the train sets. We do believe there are some bottlenecks in architecture that make learning very slow or that wouldn't converge to 100%. So yeah, we're figuring this out.

And in this RE-ARC dataset, so 400 tasks blown up with this dataset generation that Michael does, how many program examples do you have?

It's in order of a hundred million data points. Oh, it's a lot then? It's quite a lot. Yes. We do sample a lot these programs. I think transformers are really bad to make sense of random sequences of pixels, like when they train from scratch, of course. So that's why we really have to do this intense training of making sense of these input-output grids, which is something a bit silly, but we believe RKGI is actually, as François Souleil said, a developer aware.

benchmark. And so we really wanted to have no priors from anyone on the internet. So do not use any pre-trained LLMs and really to train architecture on the very little data that matters to the task. Isn't the re:Arc dataset generator a prior though? Like a pretty big... I mean, to me that would invalidate the developer-aware generalization. The priors contained in re:Arc

are all about when you actually have an input output pair with two objects, while you could have the same with like three objects. So you have the prior of counting, for instance. I do believe in the task you also have these other priors of objectiveness and so on. For us, it's very much the priors used contain in RearCAD very much to teach transformers how to make sense of 2D grids because the 2D grid space is huge. It's 10 power 900 for each grid.

Obviously, most of these 10 power 900 grids don't make any sense. They're kind of white nose. We use this distribution to kind of fit in some of the human priors into architecture.

which I believe is in line with the developer word generalization assumption. Yeah, we can debate that. How did it generalize to different grid sizes? I mean, of course you're using Transformer, so you just, you turn it into a sequence, right? But did you notice any weird like overfitting on sequence size? No. The way we deal with the shapes is very naive. We actually predict the shape first.

So let's say you have a 10 by 10, 10 by 11 grids. We actually predict it's going to be 10 rows, 11 columns. And then we have the 900 tokens corresponding to the padded 30 by 30 grids, most of which is actually padded and ignore the transformer. But so we end up with actually sequences of 902 tokens values per grid. That's how we predict shapes. Yeah, I think a lot of...

things to say about how to optimize this. We don't think it's an optimal way to do so, but it's probably enough for Arc or enough given that you have a lot of data to train on. Yeah, I suppose this might be an example of why generating an explicit Python program might be better because it could generalize to different grid sizes. Definitely. So Python programs, grounded programs, executed by an interpreter.

would make sense of all these priors that are not contained into a random initialized transformer. So yeah, I mean, I do believe real programs are also a very smart way to do induction, to compress your rotations. One of the problems, of course, with program search is the combinatorial

expansion, commercial explosion of the search through your program space. And so we really try to compress that by learning the manifold of programs that we hope would be easier to search. But obviously it comes with some caveats and potentially you can't represent all programs into this latent space. So yeah, I think there's interesting ways to maybe try and think about combining both ideas.

But there are pros and cons for program search and latent program search, I believe. Yeah, I'm just thinking back to when I first interviewed Chole years ago. He always said there are type 1 and type 2 problems, and type 2 problems are not interpolative. But we're taking it as granted that this is actually interpolative after all. So we have this latent space, and it's got all of these modes in, and we do this test time computation. And the more test time computation we do, the better the result.

And at the moment, if I understand, you're using like a first order kind of like gradient search.

So first of all, talk me through that. So you do more search at inference time, you get better results. Yes, but just before that, I do believe type two problems would not work with our methods. So I do believe one of the big limitations of learning a continuous latent space is actually you can't fit arbitrary complex commutator problems, sorry, programs in it. I think you can do some things like superposition of programs, but I don't think you can do composition of programs.

which is obviously what we're trying to look into as a follow-up work. So yeah, I do believe that you would not be able to get composition by searching through this latent space. I think you can get some composition, like maybe just a couple steps of composition, but at the end of the day it's a fixed continuous latent space, so it's unclear whether you can really compose objects with it.

Now, going back to the search- Well, just before we get there, you've really touched on an interesting point. This gets to the core of it. People like Gary Marcus and what Sholay was intuiting is the lack of compositionality. You're saying, "Okay, there is no compositionality in the latent space, but we could actually come up with some kind of a

a compositional inference time method that gives us compositionality. So if you think about it, we could have multiple search threads through the latent space and we could have some system two program that like composes the things together and we could get the compositionality but still use your method. Is that kind of what you're saying? I think that's a promising approach, a promising way to think about it. I do believe that single threads search in this single latent space would never bring you composition.

But something like having multiple threads, as you said, or some kind of multiple things happening at the same time being kind of synthesized thereafter could be a way to get composition. So I think what we wanted to do is kind of a POC of how to try fit as many programs as we can into Slaydon space. I think now it opens up lots of possibilities to extend this further. Then we have the matter of

If we were generating symbolic programs, it feels like we could compose them together. For example, we could learn some iterative steps or functions that just get composed together. Whereas in this case, we have solution predictions directly. So let's say right now...

I've got three solutions that I found that represent some kind of, you know, three different types of transformation that need to be composed together. So now we need to have a function that can somehow take those three solutions and combine them together. Yeah, so I do think you would need some kind of

an unrolled computational graph. So I don't think you can easily look at these three solutions and kind of synthesize them in one shot. Intuitively, it would be equivalent to actually having only one like better thread or something.

So if you want to try to explore different regions of your space and then try to combine them, maybe you would need to do multiple forwards or something. So maybe one thread gives you some kind of primitive program and then you apply the second thread on the output of the first primitive and so on. This way you would get a recursive unrolled computational graph.

I mean, yeah, I think it's up to, it's a research question how to think about these things. Yeah, or how to train that would be difficult. Even maybe like you could have an intermediate stage where the decoder could take in multiple inputs and it could learn to compose them together to a fixed depth because, you know, like language,

it's not actually that compositional right you know like we only use a depth in some languages don't use depth but you know english we have a depth of up to like three or four or five or something so you might find that most of the you know the colloquial arc problems that we'd ever want to solve would have a fixed depth and we could somehow encode that in a neural network yes and actually the arc solutions using brute force program search you know have a death varying from one to

Usually something like 12, 15 and sometimes up to 50. But for most tasks, actually, you can solve them using a rather small length. So yeah, very... Yeah. And we're pragmatic people. I love the theoretical arguments about compositionality and symbolic AI and so on. But in the real world, we don't actually need to have this infinite...

Going back to the search methods, our assumption is that there exists a different latent program in your latent space that is better than your first guess. The question is, how do you find that program?

So it's just a matter of, it's a question of search now. And we actually tried two things. We tried some kind of zero order random search, you know, local search on your first program. And then we actually realized that the latent space itself that, you know, was learned during training ended up being very smooth. And so when I say smooth, it's smooth with respect to the decoder likelihood of decoding the right inputs.

So you want to kind of maximize the likelihood of taking your right outputs. So this later space ended up being very smooth, which allowed us to use first order methods. But I do believe that, you know, potentially with different tasks or if we train for longer, you could have some more discrepancy and, you know, kind of local optima in this space.

which would maybe bring to the table methods like evolutionary strategies, zero order methods, or maybe a combination of the two. But for our use case, it worked very well to use gradient descent. I don't think it's the best method, but yeah. Okay. So at the moment, it's quite smooth. It's quite amenable to simple gradient ascent methods or whatever. That's all cool.

What about its resilience to, I don't know whether you tried putting noisy inputs in and, or another thing is like you could almost under train the architecture, but you know, do more searching. Like is there a relationship that you've explored that? So we haven't tried, you know, putting noise and shooting ourselves in the foot. Maybe we should, but we have tried scanning architecture size though.

So obviously if your transformers are too small, well for sure they can't represent your, they can't decode your output very well. So yeah, there is a, there's

there's kind of a trade-off um or at least it goes you know architecture size and like latent space size go like hand in hand so yeah we we did see that when the architecture size is too small well you can't even make sense of the sequences very well and hence you can't decode your program very well no matter how big your search space is so we need to actually when we train this architecture it's there's not enough like tuning to do but we need to make sure that the decoder

you know, decoder's capacity is high enough such that the, and your latent space is big enough such that there exists a point in latent space that's, you know, corresponds to the right program. And given that point, you have enough capacity to execute that program.

which is not the case when you use a you know python program so you're sure your interpreter will execute it well uh you just need to like find it so yeah you have this like seconds uh constraint here of like making sure the decoder can learn that program and we actually have preliminary results um showing uh that they were not in the paper actually but showing uh that the

it's not obvious that the decoder can actually learn all the tasks. There's also a paper mentioning this issue. So it boils down to what priors you fit into your architecture in terms of 2D spatial representation and so on. So yeah, it's a trade-off how much time you want to spend on optimizing your architecture to solve a specific benchmark versus how much you want to

improve the method itself, i.e. like the searching latent programs. Very cool. So Ellis and Wending, they just released this paper sort of like comparing induction and transduction and they had this ensemble approach where they kind of did both at the same time. They've got this beautiful Venn diagram that shows that sometimes induction with Python programs works really well, sometimes solution space works really well. Have you kind of like done analysis on what types of arc

you know, tasks work well and don't work well on your system. Yes, so we don't have a very strong analysis on this, but from what we've seen, most of the, you know, bag of programs that are not learned often correspond to transductive tasks, transduction tasks. So I think it's very much aligned with their paper.

It was a great paper highlighting this. Actually, when it came out, maybe a week or so before the ARC price deadline, we thought, oh, it's actually what's happening. We struggle learning this transactive task. And yeah, I think there are obvious ways to kind of

use this assembling technique also for architecture or for other architectures. The thing with transduction task is that there is no point in latent space that would correspond to your transductive task because by definition there's no induction necessary there. So it's all in the decoder.

And I think there are ways to like help the decoder with that essentially. We just try to, you know, remain as benchmark agnostic as possible, trying to not overfeed to ARK. But I think there are obvious ways to improve. Is there a story that you're kind of like,

You're not a symbolist. I've always been a connectionist and I'm kind of new to like trying to improve connectionist architectures to tackle this like maybe 1%, 10% task that people are really into these days of like reasoning, search. I do believe in a future where, you know, maybe probably most of your tasks can be tackled by, you know, kind of one shots, deep learning based architectures like LLMs, but actually some of the, you know, hard, like last mile hard, like 1% task

need something a bit different. So yeah, I'm a strong believer in deep learning and program synthesis kind of being merged to achieve that last percent of task. So you're not amenable to this school of thought that these things are just smarter than us, we just let it do the thing? I think we can't comprehend arbitrarily combinatorial sequences of symbols.

And so that's where, you know, connectionism like comes in as very useful because we try to learn, you know, manifolds of complexities. I think it can bring you, you know, very far and can only bring you so far.

And that's what I mean by I think that deep learning architectures can solve something like 90%, 99% of the task, but there will all be some tasks that are probably discrete inherently or tasks that require long horizon planning and reasoning, things that we're trying to tackle with LLMs and other deep learning architectures nowadays, but very recently actually. And I do believe that other architectures could do this much more efficiently.

I think it's connected to creativity as well. I believe deep learning models are not highly creative.

I will not say they're not creative because I, well, as you can see with going back to Arc again, like solutions like Ryan Greenblatt's solution of sampling a lot of programs and you kind of see this logarithmic scaling curves as you sample more and more programs. Well, this means that there is, if you like plot the distribution of possible programs sampled by an LLM doing outputting codes, well, you have some non-zero mass

with long tails on very creative and good solutions. And you only get these solutions if you sample a million times. So this way, I can't say that they're not creative because, well, if you sample high enough, they will create these things. But I do believe that they're exponentially not creative. So you would need exponentially more samples to make them creative.

I think that's where symbols can actually maybe help by doing symbolic program synthesis or what we call programs, maybe just sequence of symbols. We could fight this creativity issue by very easily going into a commutative space that is completely novel and that you can't really keep track of, you can't amortize.

and you know, fighting things like, well, going into things like computational irreducibility. So yeah, I do think a program synthesis could help, and symbols, could help deep learning architectures to become more creative. Don't know if that makes sense. No, it does. I'm very interested in creativity. I must admit, I think I might have actually changed my position on it because I took a very almost human chauvinist approach to creativity that, you know, there's a creativity gap in LLMs that we have access to a sort of

source of creativity that LLMs don't. And so many people I've interviewed in Europe are just telling me the opposite. In fact, Sabaro Kambahati said the opposite in June at ICML. He said, no, no, no, these things are very creative. And part of that is because creativity isn't just about novelty. It's also about interestingness. It's about things that we find interesting given our cultural biases and interests and so on.

So they seem to capture a lot of that. And then, okay, you can say, well, we can greenblatt it and yeah, we have to sample 30,000 times to get something interesting. But is that so different to us? Like we're this huge collective intelligence and we're all coming up with ideas and most of them are rubbish. And every now and then we have this flash of inspiration.

And perhaps we're embedded and situated, we have access to all of this entropy and whatnot, but is it really so different? Yeah, I think there's something to say about collective intelligence. I haven't thought that much about collective intelligence in general, so Nacho can comment very well on that. But if you focus for a second on individual intelligence,

It seems obvious that you don't sample a million programs in your mind when trying to solve an arc task or when trying to come up with something creative. There is some kind of synthesis that is much more efficient happening in your brain.

And I think I'm interested in trying to get closer to this kind of synthesis. Now, I do believe there's something, you know, very interesting things to say about collective intelligence, which brings much more, you know, entropy and a lot more of the sampling that we talked about. Yeah, so maybe both schools, you know, like could come up with interesting breakthrough. When you solve an arc challenge, are you being creative? Well, the way I think about it, when I solve, when I try to solve an arc task,

So again, I have some intuition of what the task might look like in like, you know, a tenth of a second. I see the structures, I see regularities very quickly. And then I realize, oh no, I mean, this is not as easy as this. And then I try to figure out, I propose a few hypothesis.

maybe two, three, four, and I refine them. And then the fifth is actually correct. Check-in matches all specification and I just apply it and I get it. So this is maybe on half of the task. On the other half, it takes me a bit more time, but that's the idea. I do believe we have this kind of like

few introspection, few hypothesis that we propose and this hypothesis seem to be quite creative in a way that they could be very different from each other and I feel like they would recombine priors in a way that is quite combinatorial and compositional.

I mean, I need to stay humble in how I solve the task because many of them actually struggle with. But I think it's this idea of coming up with a few hypothesis, testing them, but not coming up with a million hypothesis. So it's how do you cut down this search space in an efficient way? There's probably much more than that in human cognition. I think getting this would be a good approximation at least. If we scaled your solution up ridiculously, right? So much more training data, much bigger model and so on.

and we trained it for longer, what do you think would happen? So I think it could scale quite a bit, especially if we scale the latent space. I do believe it would be very... So I think one limitation is how much you can actually search in that space. I would expect the space to be actually not that linear, not that interpolative, not that smooth.

And so if you end up scaling tasks and problems, yeah, I would end up this space actually be harder to search. So it's going to be about how much search you can do during training. You would need to scale this up probably, keeping everything the same in this architecture. And so it would actually be quite costly, I think, to scale.

both the kind of encoder decoder and the latent space as well. And sorry, and the search in this latent space. So I think there are better ways we could iterate on like this idea. I do believe coming up with, you know, a small representation of whatever task, whatever query you have,

can be a very useful way to do synthesis, to adapt to deal with epistemic uncertainty as well. I do not believe we can estimate epistemic uncertainty currently, but I think searching through this space could help resolve some uncertainty. That's very insightful. So you're saying when we actually scale this up, the latent space might become less interpretive.

Could we do meta optimization or could we have a different type of latent space or just sort of spitballing here, but some kind of topological space or some kind of graph representation or what else could we do? I think it's not just about scaling the latent dimension of your space, which has been the case with large language models. I think what's

from my intuition was happening in large-ranging range models, that they have these very intricate, very spurious, highly dimensional vector functions that can be recombined in some narrow way. And it's very useful, but it's very local. You're trying to find the abstraction locally that is a little bit better.

And I think it doesn't give you all these things around composition and auto distribution generalization. So yeah, I'm a big believer in trying to find small representations that could explain your outputs. So for natural language would be a query and then the answer to a query.

Now, how do you build these spaces? How do you search through them? I think these are all open questions. I've seen lots of interest in the direction lately, so it's very exciting. Yeah, amazing. Well, congratulations on the work and your recognition from Cholet. Completely blank slate, unlike any of the other approaches. Very interesting indeed. So yeah, keep up the good work and thank you for coming on. Thank you so much, Tim. Thank you for having me.

Clement Bonnet - Can Latent Program Networks Solve Abstract Reasoning? 51:26 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

Clement Bonnet - Can Latent Program Networks Solve Abstract Reasoning?