We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode ARC Prize v2 Launch! (Francois Chollet and Mike Knoop)

ARC Prize v2 Launch! (Francois Chollet and Mike Knoop)

2025/3/24
logo of podcast Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

AI Chapters Transcript
Chapters
ARC v2, the latest version of the benchmark, is designed to challenge frontier AI reasoning systems. Unlike its predecessor, ARC v2 is calibrated with human difficulty, ensuring that tasks are solvable by humans but remain extremely difficult for current AI models. The ArcPrize 2025 contest uses this benchmark.
  • ARC v2 launch and benchmark architecture
  • Human-AI capability analysis
  • OpenAI's initial performance results
  • ArcPrize 2025 contest details

Shownotes Transcript

We are joined by Francois Chollet and Mike Knoop, to launch the new version of the ARC prize! In version 2, the challenges have been calibrated with humans such that at least 2 humans could solve each task in a reasonable task, but also adversarially selected so that frontier reasoning models can't solve them. The best LLMs today get negligible performance on this challenge.

https://arcprize.org/

SPONSOR MESSAGES:


Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. They are hiring a Chief Engineer and ML engineers. Events in Zurich.

Goto https://tufalabs.ai/


TRANSCRIPT:

https://www.dropbox.com/scl/fi/0v9o8xcpppdwnkntj59oi/ARCv2.pdf?rlkey=luqb6f141976vra6zdtptv5uj&dl=0

TOC:

  1. ARC v2 Core Design & Objectives

[00:00:00] 1.1 ARC v2 Launch and Benchmark Architecture

[00:03:16] 1.2 Test-Time Optimization and AGI Assessment

[00:06:24] 1.3 Human-AI Capability Analysis

[00:13:02] 1.4 OpenAI o3 Initial Performance Results

  1. ARC Technical Evolution

    [00:17:20] 2.1 ARC-v1 to ARC-v2 Design Improvements

    [00:21:12] 2.2 Human Validation Methodology

    [00:26:05] 2.3 Task Design and Gaming Prevention

    [00:29:11] 2.4 Intelligence Measurement Framework

  2. O3 Performance & Future Challenges

    [00:38:50] 3.1 O3 Comprehensive Performance Analysis

    [00:43:40] 3.2 System Limitations and Failure Modes

    [00:49:30] 3.3 Program Synthesis Applications

    [00:53:00] 3.4 Future Development Roadmap

REFS:

[00:00:15] On the Measure of Intelligence, François Chollet

https://arxiv.org/abs/1911.01547

[00:06:45] ARC Prize Foundation, François Chollet, Mike Knoop

https://arcprize.org/

[00:12:50] OpenAI o3 model performance on ARC v1, ARC Prize Team

https://arcprize.org/blog/oai-o3-pub-breakthrough

[00:18:30] Chain-of-Thought Prompting Elicits Reasoning in Large Language Models, Jason Wei et al.

https://arxiv.org/abs/2201.11903

[00:21:45] ARC-v2 benchmark tasks, Mike Knoop

https://arcprize.org/blog/introducing-arc-agi-public-leaderboard

[00:26:05] ARC Prize 2024: Technical Report, Francois Chollet et al.

https://arxiv.org/html/2412.04604v2

[00:32:45] ARC Prize 2024 Technical Report, Francois Chollet, Mike Knoop, Gregory Kamradt

https://arxiv.org/abs/2412.04604

[00:48:55] The Bitter Lesson, Rich Sutton

http://www.incompleteideas.net/IncIdeas/BitterLesson.html

[00:53:30] Decoding strategies in neural text generation, Sina Zarrieß

https://www.mdpi.com/2078-2489/12/9/355/pdf