Misaligned AIs might engage in research sabotage: making safety research go poorly by doing things like withholding their best ideas or putting subtle bugs in experiments. To mitigate this risk with AI control, we need very different techniques than we used in the original control paper or Ctrl-Z, because research sabotage differs from those settings in two crucial ways:
Outline:
(02:53) Threat model and basic countermeasures
(03:31) Deployer strategies: Training and incrimination
(07:51) Sabotage strategies
(08:11) Targeting sabotage at the most effective points
(08:48) Withholding good content vs. producing bad content
(10:57) Code vs. ideas
(11:33) What terms should we use for the concentrated/diffuse spectrum?
(14:08) Canary string
(14:23) Acknowledgements
First published: April 30th, 2025
---
Narrated by TYPE III AUDIO).
Images from the article:
)
)
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.