We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Francois Chollet - ARC reflections - NeurIPS 2024

Francois Chollet - ARC reflections - NeurIPS 2024

2025/1/9
logo of podcast Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

AI Deep Dive AI Insights AI Chapters Transcript
People
F
François Chollet
主持人
专注于电动车和能源领域的播客主持人和内容创作者。
Topics
François Chollet: 本次ARC竞赛结果表明,单纯依靠扩大模型规模和数据量无法实现AGI,需要结合系统2推理能力,例如程序合成和测试时训练等方法。他认为人类认知是直觉和推理的混合,深度学习模型需要用符号元素增强,而非完全替代。他详细阐述了两种推理方式:一种是记忆和应用,另一种是面对新事物时重新组合认知构件。他认为深度学习模型的关键在于能否适应新事物,并预测未来编程将广泛采用从输入输出对编程的方式。他还展望了未来需要一种全新的架构来实现终身分布式学习,AI实例并行解决问题,并从中抽象出新的构建块。他分析了ARC竞赛中两种主要方法:深度学习引导的程序合成和测试时训练,并比较了这两种方法的优缺点。他认为测试时训练虽然提高了泛化能力,但其监督并非完全由人类完成,仍然属于挑战范围之内。他还探讨了人类如何适应新事物,认为人类可能并非通过梯度下降来适应新事物,而更接近于函数组合。他认为可以通过改进架构来克服Transformer的局限性,但自主适应新事物需要一个完全自主的架构搜索或生成机制。他还分析了2020年和2024年ARC竞赛的结果,指出基准测试存在缺陷,部分测试集容易被蛮力破解,并解释了ARC2的开发目标,以及如何改进评估方法以避免信息泄露。他深入探讨了程序归纳和基于LLM的转导方法,指出两者解决的是不同的任务集,感知型任务更适合转导方法,而算法型任务更适合程序归纳方法。他还建议将归纳和转导方法结合,先尝试归纳,如果失败再使用转导。他认为使用相同的模型解决不同类型的ARC任务,可以学习到更好的表示,并起到正则化的作用。他还介绍了Clement Bonnet提出的学习程序潜在空间的方法,并在测试时进行梯度下降搜索,并建议将潜在程序解码回符号形式,进行局部离散搜索。他认为深度学习引导的程序合成方法更有效,并建议将程序视为算子图,而非简单的标记序列。他还指出人类解决ARC难题的方式是先建立一个模型,再用这个模型来约束搜索空间,从而减少搜索的必要性。他还探讨了大型语言模型在程序合成中的作用,认为大型语言模型可以用来指导搜索过程,而不是仅仅生成代码。他分析了测试时计算量与性能之间的关系,指出两者存在对数关系,比较不同系统时需要考虑计算预算。他还探讨了人类在解决复杂问题方面的效率,认为未来的AGI也需要达到同样的能效水平。他还讨论了编程语言的选择,认为无论使用何种编程语言,都应该能够从数据中学习函数,并将其转化为可重用的构建块。他还阐述了其对人类认知和意识的理解,认为人类的认知是模糊模式识别的迭代过程,而意识是保证这一过程自洽性的机制。他还认为所有系统2处理都涉及意识,因为显式的逐步推理需要意识,而无意识的认知过程缺乏自洽性。最后,他还谈到了他未来的研究方向,将专注于程序合成,特别是规划引导的程序合成,并认为需要不断开发新的基准测试来推动AGI的发展。 主持人: 引导话题,提出问题,并对Chollet的观点进行总结和补充。

Deep Dive

Key Insights

What was the accuracy improvement in the ARC-AGI Prize competition in 2024?

The accuracy in the ARC-AGI Prize competition in 2024 rose from 33% to 55.5% on a private evaluation set.

What are the two main approaches that succeeded in the ARC-AGI Prize competition?

The two main successful approaches were deep learning-guided program synthesis and test-time training, where models directly predict solutions based on task descriptions.

Why is test-time training considered a breakthrough in generalization power?

Test-time training allows models to fine-tune on demonstration pairs at inference time, unlocking higher generalization levels and enabling accuracy improvements from below 10% to over 55%.

What is the significance of the logarithmic relationship between compute and accuracy in AI benchmarks?

The logarithmic relationship indicates that while more compute can improve performance, better ideas provide significantly more leverage, as seen in solutions achieving 55% accuracy with $10 of compute versus $10,000.

What are the key differences between induction and transduction in ARC tasks?

Induction involves writing programs to map input to output grids, while transduction directly predicts output grids. Induction is formally verifiable, whereas transduction relies on guessing, making induction more reliable for generalization.

What is the role of consciousness in system 2 reasoning according to François Chollet?

Consciousness acts as a self-consistency mechanism in system 2 reasoning, ensuring that iterated pattern recognition remains consistent with past iterations, preventing divergence and hallucination.

What is François Chollet's view on the future of programming with AGI?

Chollet envisions a future where programming is democratized, and users can describe what they want to automate in natural language, with the computer generating the necessary programs iteratively through collaboration.

What is the main flaw in the ARC benchmark, and how is ARC-2 addressing it?

The ARC benchmark is flawed due to task redundancy and overfitting. ARC-2 addresses this by increasing task diversity, introducing semi-private test sets, and ensuring difficulty calibration across evaluation sets.

What is the role of deep learning-guided program synthesis in François Chollet's approach to AGI?

Deep learning-guided program synthesis combines intuition and pattern recognition with discrete reasoning, allowing models to iteratively construct symbolic programs through guided search, which Chollet believes is closer to how humans reason.

How does François Chollet define reasoning in the context of AI?

Chollet defines reasoning in two ways: applying memorized patterns (e.g., algorithms) and recomposing cognitive building blocks to solve novel problems. The latter, which involves adapting to novelty, is more critical for AI advancement.

Shownotes Transcript

François Chollet discusses the outcomes of the ARC-AGI (Abstraction and Reasoning Corpus) Prize competition in 2024, where accuracy rose from 33% to 55.5% on a private evaluation set.

SPONSOR MESSAGES:


CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments.

https://centml.ai/pricing/

Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. Are you interested in working on reasoning, or getting involved in their events?

They are hosting an event in Zurich on January 9th with the ARChitects, join if you can.

Goto https://tufalabs.ai/


Read about the recent result on o3 with ARC here (Chollet knew about it at the time of the interview but wasn't allowed to say):

https://arcprize.org/blog/oai-o3-pub-breakthrough

TOC:

  1. Introduction and Opening

[00:00:00] 1.1 Deep Learning vs. Symbolic Reasoning: François’s Long-Standing Hybrid View

[00:00:48] 1.2 “Why Do They Call You a Symbolist?” – Addressing Misconceptions

[00:01:31] 1.3 Defining Reasoning

  1. ARC Competition 2024 Results and Evolution

[00:07:26] 3.1 ARC Prize 2024: Reflecting on the Narrative Shift Toward System 2

[00:10:29] 3.2 Comparing Private Leaderboard vs. Public Leaderboard Solutions

[00:13:17] 3.3 Two Winning Approaches: Deep Learning–Guided Program Synthesis and Test-Time Training

  1. Transduction vs. Induction in ARC

[00:16:04] 4.1 Test-Time Training, Overfitting Concerns, and Developer-Aware Generalization

[00:19:35] 4.2 Gradient Descent Adaptation vs. Discrete Program Search

  1. ARC-2 Development and Future Directions

[00:23:51] 5.1 Ensemble Methods, Benchmark Flaws, and the Need for ARC-2

[00:25:35] 5.2 Human-Level Performance Metrics and Private Test Sets

[00:29:44] 5.3 Task Diversity, Redundancy Issues, and Expanded Evaluation Methodology

  1. Program Synthesis Approaches

[00:30:18] 6.1 Induction vs. Transduction

[00:32:11] 6.2 Challenges of Writing Algorithms for Perceptual vs. Algorithmic Tasks

[00:34:23] 6.3 Combining Induction and Transduction

[00:37:05] 6.4 Multi-View Insight and Overfitting Regulation

  1. Latent Space and Graph-Based Synthesis

[00:38:17] 7.1 Clément Bonnet’s Latent Program Search Approach

[00:40:10] 7.2 Decoding to Symbolic Form and Local Discrete Search

[00:41:15] 7.3 Graph of Operators vs. Token-by-Token Code Generation

[00:45:50] 7.4 Iterative Program Graph Modifications and Reusable Functions

  1. Compute Efficiency and Lifelong Learning

[00:48:05] 8.1 Symbolic Process for Architecture Generation

[00:50:33] 8.2 Logarithmic Relationship of Compute and Accuracy

[00:52:20] 8.3 Learning New Building Blocks for Future Tasks

  1. AI Reasoning and Future Development

[00:53:15] 9.1 Consciousness as a Self-Consistency Mechanism in Iterative Reasoning

[00:56:30] 9.2 Reconciling Symbolic and Connectionist Views

[01:00:13] 9.3 System 2 Reasoning - Awareness and Consistency

[01:03:05] 9.4 Novel Problem Solving, Abstraction, and Reusability

  1. Program Synthesis and Research Lab

[01:05:53] 10.1 François Leaving Google to Focus on Program Synthesis

[01:09:55] 10.2 Democratizing Programming and Natural Language Instruction

  1. Frontier Models and O1 Architecture

[01:14:38] 11.1 Search-Based Chain of Thought vs. Standard Forward Pass

[01:16:55] 11.2 o1’s Natural Language Program Generation and Test-Time Compute Scaling

[01:19:35] 11.3 Logarithmic Gains with Deeper Search

  1. ARC Evaluation and Human Intelligence

[01:22:55] 12.1 LLMs as Guessing Machines and Agent Reliability Issues

[01:25:02] 12.2 ARC-2 Human Testing and Correlation with g-Factor

[01:26:16] 12.3 Closing Remarks and Future Directions

SHOWNOTES PDF:

https://www.dropbox.com/scl/fi/ujaai0ewpdnsosc5mc30k/CholletNeurips.pdf?rlkey=s68dp432vefpj2z0dp5wmzqz6&st=hazphyx5&dl=0