We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“Why imperfect adversarial robustness doesn’t doom AI control” by Buck, Claude+

2024/11/18

(thanks to Alex Mallen, Cody Rushing, Zach Stein-Perlman, Hoagy Cunningham, Vlad Mikulik, and Fabien Roger for comments) Sometimes I hear people argue against AI control as follows: if your control measures rely on getting good judgments from "trusted" AI models, you're doomed because AI models aren't adversarially robust. I agree that better adversarial robustness would definitely help with control, but I think this argument misses several important points; my overall take is that current levels of adversarial robustness don’t come close to dooming AI control. The first and most important point is that adversarial robustness is normally researched in a setting that's much easier for the adversary. In most discussions, if an adversary can ever get a model to do something bad—even after many attempts and with a low success rate—this counts as a failure. But in real control settings, an AI that tries and fails to jailbreak might [...]

Outline:

(04:41) Appendix: What the AI sees when in a defer-to-trusted scaffold

(07:45) Appendix: reasons why imperfect adversarial robustness is bad other than the untrusted model jailbreaking the trusted one

First published: November 18th, 2024

Source: https://www.lesswrong.com/posts/ewfGpHMXHhiwCA7se/why-imperfect-adversarial-robustness-doesn-t-doom-ai-control)

---

Narrated by TYPE III AUDIO).

“Why imperfect adversarial robustness doesn’t doom AI control” by Buck, Claude+

LessWrong (30+ Karma)

Shownotes Transcript

“Why imperfect adversarial robustness doesn’t doom AI control” by Buck, Claude+ 08:55 Share

LessWrong (30+ Karma)

Shownotes Transcript

“Why imperfect adversarial robustness doesn’t doom AI control” by Buck, Claude+