We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

cover of episode “Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?” by Alex Mallen, charlie_griffin, Buck Shlegeris

“Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?” by Alex Mallen, charlie_griffin, Buck Shlegeris

2025/3/26

We recently released Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?, a major update to our previous paper/blogpost, evaluating a broader range of models (e.g. helpful-only Claude 3.5 Sonnet) in more diverse and realistic settings (e.g. untrusted monitoring).

** Abstract**

An AI control protocol is a plan for usefully deploying AI systems that prevents an AI from intentionally causing some unacceptable outcome. Previous work evaluated protocols by subverting them using an AI following a human-written strategy. This paper investigates how well AI systems can generate and act on their own strategies for subverting control protocols whilst operating statelessly (i.e., without shared memory between contexts). To do this, an AI system may need to reliably generate effective strategies in each context, take actions with well-calibrated probabilities, and coordinate plans with other instances of itself without communicating.

We develop Subversion Strategy Eval, a suite of eight environments [...]

Outline:

(00:38) Abstract

(02:13) Paper overview

(03:55) Results summary

(05:31) Qualitative analysis

(08:50) How do these results inform risk?

(11:08) How should this eval be used with more capable models?

(11:13) Tuning conservatism

(14:19) Difficulties with attaining a conservative measurement

(17:41) Reducing noise

First published: March 24th, 2025

Source: https://www.lesswrong.com/posts/RxRKGwBkCX3YsthfK/subversion-strategy-eval-can-language-models-statelessly)

---

Narrated by TYPE III AUDIO).

Images from the article: ))))) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

“Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?” by Alex Mallen, charlie_griffin, Buck Shlegeris

LessWrong (30+ Karma)

What is the purpose of an AI control protocol?

Overview of the Paper

Summary of Key Results

Qualitative Analysis of the Findings

How do these results impact our understanding of AI risks?

Using the Evaluation with More Capable Models

Challenges in Achieving Conservative Measurement

Strategies for Reducing Noise in the Evaluation

Shownotes Transcript

“Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?” by Alex Mallen, charlie_griffin, Buck Shlegeris 18:46 Share

LessWrong (30+ Karma)

What is the purpose of an AI control protocol?

Overview of the Paper

Summary of Key Results

Qualitative Analysis of the Findings

How do these results impact our understanding of AI risks?

Using the Evaluation with More Capable Models

Challenges in Achieving Conservative Measurement

Strategies for Reducing Noise in the Evaluation

Shownotes Transcript

“Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?” by Alex Mallen, charlie_griffin, Buck Shlegeris