We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode “The 80/20 playbook for mitigating AI scheming risks in 2025” by Charbel-Raphaël

“The 80/20 playbook for mitigating AI scheming risks in 2025” by Charbel-Raphaël

2025/6/1
logo of podcast LessWrong (30+ Karma)

LessWrong (30+ Karma)

AI Chapters
Chapters

Shownotes Transcript

Adapted from this twitter thread. See this as a quick take.

** Mitigation Strategies**

How to mitigate Scheming?

  • Architectural choices: ex-ante mitigation
  • Control systems: post-hoc containment
  • White box techniques: post-hoc detection
  • Black box techniques
  • Avoiding sandbagging

We can combine all of those mitigation via defense-in-depth system (like the Swiss Cheese model below)

I think that applying all of those strategies should divide the risk by at least 3.

** 1. Architectural Choices: Ex-ante Mitigation**

Don't choose an architecture that uses neuraleses (i.e. hidden reasoning).

For example, don't train models to think in their latent spaces like Meta!

Chain of thought monitoring is a blessing for transparency and monitoring - let's use it as much as we can!

See: Why it's good for AI reasoning to be legible and faithful.

Similarly, I think text diffusion models are a bad idea for transparency. This is because you can no longer [...]


Outline:

(00:18) Mitigation Strategies

(01:20) 1. Architectural Choices: Ex-ante Mitigation

(03:26) 2. Control Systems: Post-hoc Containment

(04:00) 3. White Box Techniques: Post-hoc Detection

(05:16) 4. Black Box Techniques

(06:28) 5. Avoiding Sandbagging: Elicitation Technique

(06:52) Current State and Assessment

(07:06) Risk Reduction Estimates

(08:19) Potential Failure Modes

(09:16) The Real Bottleneck: Enforcement

The original text contained 1 footnote which was omitted from this narration.


First published: May 31st, 2025

Source: https://www.lesswrong.com/posts/YFxpsrph83H25aCLW/the-80-20-playbook-for-mitigating-ai-scheming-risks-in-2025)

    ---
    

Narrated by TYPE III AUDIO).


Images from the article: Reminder: Voluntary commitments are not enough, we need binding regulations. 6 companies promised to sign the Seoul Frontier Commitment and publish their safety frameworks. Then the Paris Summit deadline arrived and... did nothing.)https://ailabwatch.org/)The image shows several lines of highlighted text about AI reward models' preferences and biases. The text is highlighted in orange/peach colors against a white background and discusses various findings related to how these models handle different types of inputs and responses.)Anthropic tweets: )![Swiss cheese security model showing five defensive layers with labeled stages.

The diagram illustrates cybersecurity concepts through yellow slices representing Safety Culture, Red Teaming, Cyberdefense, Anomaly Detection, and Transparency, connected by arrows showing the flow through these security layers.](https://pbs.twimg.com/media/Gr5Q05tXAAA2GBC.jpg))![peterbarnett tweets: ](https://pbs.twimg.com/media/Gr5Q33TWQAA9Uo6.png)) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.