We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode “Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?” by Alex Mallen, charlie_griffin, Buck Shlegeris

“Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?” by Alex Mallen, charlie_griffin, Buck Shlegeris

2025/3/26
logo of podcast LessWrong (30+ Karma)

LessWrong (30+ Karma)

AI Chapters
Chapters

Shownotes Transcript

We recently released Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?, a major update to our previous paper/blogpost, evaluating a broader range of models (e.g. helpful-only Claude 3.5 Sonnet) in more diverse and realistic settings (e.g. untrusted monitoring).

** Abstract**

An AI control protocol is a plan for usefully deploying AI systems that prevents an AI from intentionally causing some unacceptable outcome. Previous work evaluated protocols by subverting them using an AI following a human-written strategy. This paper investigates how well AI systems can generate and act on their own strategies for subverting control protocols whilst operating statelessly (i.e., without shared memory between contexts). To do this, an AI system may need to reliably generate effective strategies in each context, take actions with well-calibrated probabilities, and coordinate plans with other instances of itself without communicating.

We develop Subversion Strategy Eval, a suite of eight environments [...]


Outline:

(00:38) Abstract

(02:13) Paper overview

(03:55) Results summary

(05:31) Qualitative analysis

(08:50) How do these results inform risk?

(11:08) How should this eval be used with more capable models?

(11:13) Tuning conservatism

(14:19) Difficulties with attaining a conservative measurement

(17:41) Reducing noise


First published: March 24th, 2025

Source: https://www.lesswrong.com/posts/RxRKGwBkCX3YsthfK/subversion-strategy-eval-can-language-models-statelessly)

    ---
    

Narrated by TYPE III AUDIO).


Images from the article: ))))) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.