We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“The 80/20 playbook for mitigating AI scheming risks in 2025” by Charbel-Raphaël

2025/6/1

Images from the article: Reminder: Voluntary commitments are not enough, we need binding regulations. 6 companies promised to sign the Seoul Frontier Commitment and publish their safety frameworks. Then the Paris Summit deadline arrived and... did nothing. )) The image shows several lines of highlighted text about AI reward models' preferences and biases. The text is highlighted in orange/peach colors against a white background and discusses various findings related to how these models handle different types of inputs and responses. ) Anthropic tweets: )![Swiss cheese security model showing five defensive layers with labeled stages.

The diagram illustrates cybersecurity concepts through yellow slices representing Safety Culture, Red Teaming, Cyberdefense, Anomaly Detection, and Transparency, connected by arrows showing the flow through these security layers.](https://pbs.twimg.com/media/Gr5Q05tXAAA2GBC.jpg))![peterbarnett tweets: ](https://pbs.twimg.com/media/Gr5Q33TWQAA9Uo6.png)) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

“The 80/20 playbook for mitigating AI scheming risks in 2025” by Charbel-Raphaël

LessWrong (30+ Karma)

What Are the Key Mitigation Strategies for AI Scheming Risks?

Why Are Architectural Choices Crucial for Ex-ante Mitigation?

How Do Control Systems Help in Post-hoc Containment?

What Are White Box Techniques and How Do They Detect Post-hoc Issues?

Can Black Box Techniques Enhance AI Security?

What Is Sandbagging and How Can We Avoid It?

How Effective Are These Combined Strategies in Reducing AI Risks?

What Is the Real Challenge in Enforcing AI Safety Measures?

Shownotes Transcript

“The 80/20 playbook for mitigating AI scheming risks in 2025” by Charbel-Raphaël 10:42 Share

LessWrong (30+ Karma)

What Are the Key Mitigation Strategies for AI Scheming Risks?

Why Are Architectural Choices Crucial for Ex-ante Mitigation?

How Do Control Systems Help in Post-hoc Containment?

What Are White Box Techniques and How Do They Detect Post-hoc Issues?

Can Black Box Techniques Enhance AI Security?

What Is Sandbagging and How Can We Avoid It?

How Effective Are These Combined Strategies in Reducing AI Risks?

What Is the Real Challenge in Enforcing AI Safety Measures?

Shownotes Transcript

“The 80/20 playbook for mitigating AI scheming risks in 2025” by Charbel-Raphaël