We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub

2024/10/23

Thanks to Holden Karnofsky, David Duvenaud, and Kate Woolverton for useful discussions and feedback. Following up on our recent “Sabotage Evaluations for Frontier Models” paper, I wanted to share more of my personal thoughts on why I think catastrophic sabotage is important and why I care about it as a threat model. Note that this isn’t in any way intended to be a reflection of Anthropic's views or for that matter anyone's views but my own—it's just a collection of some of my personal thoughts. First, some high-level thoughts on what I want to talk about here:

I want to focus on a level of future capabilities substantially beyond current models, but below superintelligence: specifically something approximately human-level and substantially transformative, but not yet superintelligent. While I don’t think that most of the proximate cause of AI existential risk comes from such models—I think most of the direct takeover [...]

Outline:

(02:31) Why is catastrophic sabotage a big deal?

(02:45) Scenario 1: Sabotage alignment research

(05:01) Necessary capabilities

(06:37) Scenario 2: Sabotage a critical actor

(09:12) Necessary capabilities

(10:51) How do you evaluate a model's capability to do catastrophic sabotage?

(21:46) What can you do to mitigate the risk of catastrophic sabotage?

(23:12) Internal usage restrictions

(25:33) Affirmative safety cases

First published: October 22nd, 2024

Source: https://www.lesswrong.com/posts/Loxiuqdj6u8muCe54/catastrophic-sabotage-as-a-major-threat-model-for-human)

---

Narrated by TYPE III AUDIO).

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub

LessWrong (30+ Karma)

Shownotes Transcript

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub 27:19 Share

LessWrong (30+ Karma)

Shownotes Transcript

“Catastrophic sabotage as a major threat model for human-level AI systems” by evhub