We recently released Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?, a major update to our previous paper/blogpost, evaluating a broader range of models (e.g. helpful-only Claude 3.5 Sonnet) in more diverse and realistic settings (e.g. untrusted monitoring).
** Abstract**
An AI control protocol is a plan for usefully deploying AI systems that prevents an AI from intentionally causing some unacceptable outcome. Previous work evaluated protocols by subverting them using an AI following a human-written strategy. This paper investigates how well AI systems can generate and act on their own strategies for subverting control protocols whilst operating statelessly (i.e., without shared memory between contexts). To do this, an AI system may need to reliably generate effective strategies in each context, take actions with well-calibrated probabilities, and coordinate plans with other instances of itself without communicating.
We develop Subversion Strategy Eval, a suite of eight environments [...]
Outline:
(00:38) Abstract
(02:13) Paper overview
(03:55) Results summary
(05:31) Qualitative analysis
(08:50) How do these results inform risk?
(11:08) How should this eval be used with more capable models?
(11:13) Tuning conservatism
(14:19) Difficulties with attaining a conservative measurement
(17:41) Reducing noise
First published: March 24th, 2025
---
Narrated by TYPE III AUDIO).
Images from the article: ))))) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.