We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“~80 Interesting Questions about Foundation Model Agent Safety” by RohanS, Govind Pimpale

2024/10/28

Many people helped us a great deal in developing the questions and ideas in this post, including people at CHAI, MATS, various other places in Berkeley, and Aether. To all of them: Thank you very much! Any mistakes are our own. Foundation model agents - systems like AutoGPT and Devin that equip foundations models with planning, memory, tool use, and other affordances to perform autonomous tasks - seem to have immense implications for AI capabilities and safety. As such, I (Rohan) am planning to do foundation model agent safety research. Following the spirit of an earlier post I wrote, I thought it would be fun and valuable write as many interesting questions as I could about foundation model agent safety. I shared these questions with my collaborators, and Govind wrote a bunch more questions that he is interested in. This post includes questions from both of us. [...]

Outline:

(01:14) Rohan

(01:28) Basics and Current Status

(03:16) Chain-of-Thought (CoT) Interpretability

(08:02) Goals

(10:18) Forecasting (Technical and Sociological)

(16:43) Broad Conceptual Safety Questions

(21:50) Miscellaneous

(25:21) Govind

(25:24) OpenAI o1 and other RL CoT Agents

(26:30) Linguistic Drift, Neuralese, and Steganography

(27:32) Agentic Performance

(28:57) Forecasting

First published: October 28th, 2024

Source: https://www.lesswrong.com/posts/ZJzyDdKsDhFvqQhdQ/80-interesting-questions-about-foundation-model-agent-safety)

---

Narrated by TYPE III AUDIO).

“~80 Interesting Questions about Foundation Model Agent Safety” by RohanS, Govind Pimpale

LessWrong (30+ Karma)

Shownotes Transcript

“~80 Interesting Questions about Foundation Model Agent Safety” by RohanS, Govind Pimpale 30:28 Share

LessWrong (30+ Karma)

Shownotes Transcript

“~80 Interesting Questions about Foundation Model Agent Safety” by RohanS, Govind Pimpale