We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“Measuring whether AIs can statelessly strategize to subvert security measures” by Alex Mallen, Buck

2024/12/20

(Work done at Redwood Research. Thanks to Misha Wagner, Fabien Roger, Tomek Korbak, Justis Mills, Samuel Marks, Charlie Griffin, Zach Stein-Perlman, and Aryan Bhatt for feedback on drafts.) One way to show that risk from deploying an AI system is small is by showing that the model is not capable of subverting security measures enough to cause substantial harm. In AI control research so far, control evaluations have measured whether a red-team-created attack policy can defeat control measures, but they haven't measured how effectively models can generate such attack policies themselves. In particular, lack of a persistent memory poses a significant challenge for models strategizing to subvert security. In a new paper, we introduce evaluations that test strategic capability in the context of particular simulated deployments. We argue that in many deployments, a stateless model would need to reason highly reliably in order for its strategizing to substantially improve [...]

Outline:

(04:16) Statelessness and reliability

(06:00) Evaluating Claude's strategic ability from within a simulated deployment

(06:40) Summary of results

(07:50) Upfront Auditing

(12:38) Models can subvert upfront auditing without competently strategizing

(13:45) Strategizing O1-style (using DeepSeek-R1-Lite)

(14:18) Untrusted Monitoring

(16:31) What if models aren’t stateless?

(17:58) How conservative is our eval?

(19:56) What do our findings imply about risk from schemers?

The original text contained 2 footnotes which were omitted from this narration.

The original text contained 5 images which were described by AI.

First published: December 19th, 2024

Source: https://www.lesswrong.com/posts/4Pj7mXFQ2qhMjhaB2/measuring-whether-ais-can-statelessly-strategize-to-subvert)

---

Narrated by TYPE III AUDIO).

Images from the article: undefined ))))) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

“Measuring whether AIs can statelessly strategize to subvert security measures” by Alex Mallen, Buck

LessWrong (30+ Karma)

Shownotes Transcript

“Measuring whether AIs can statelessly strategize to subvert security measures” by Alex Mallen, Buck 22:02 Share

LessWrong (30+ Karma)

Shownotes Transcript

“Measuring whether AIs can statelessly strategize to subvert security measures” by Alex Mallen, Buck