Many people helped us a great deal in developing the questions and ideas in this post, including people at CHAI, MATS, various other places in Berkeley, and Aether. To all of them: Thank you very much! Any mistakes are our own. Foundation model agents - systems like AutoGPT and Devin that equip foundations models with planning, memory, tool use, and other affordances to perform autonomous tasks - seem to have immense implications for AI capabilities and safety. As such, I (Rohan) am planning to do foundation model agent safety research. Following the spirit of an earlier post I wrote, I thought it would be fun and valuable write as many interesting questions as I could about foundation model agent safety. I shared these questions with my collaborators, and Govind wrote a bunch more questions that he is interested in. This post includes questions from both of us. [...]
Outline:
(01:14) Rohan
(01:28) Basics and Current Status
(03:16) Chain-of-Thought (CoT) Interpretability
(08:02) Goals
(10:18) Forecasting (Technical and Sociological)
(16:43) Broad Conceptual Safety Questions
(21:50) Miscellaneous
(25:21) Govind
(25:24) OpenAI o1 and other RL CoT Agents
(26:30) Linguistic Drift, Neuralese, and Steganography
(27:32) Agentic Performance
(28:57) Forecasting
First published: October 28th, 2024
---
Narrated by TYPE III AUDIO).