We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“To be legible, evidence of misalignment probably has to be behavioral” by ryan_greenblatt

2025/4/15

One key hope for mitigating risk from misalignment is inspecting the AI's behavior, noticing that it did something egregiously bad, converting this into legible evidence the AI is seriously misaligned, and then this triggering some strong and useful response (like spending relatively more resources on safety or undeploying this misaligned AI).

You might hope that (fancy) internals-based techniques (e.g., ELK methods or interpretability) allow us to legibly incriminate misaligned AIs even in cases where the AI hasn't (yet) done any problematic actions despite behavioral red-teaming (where we try to find inputs on which the AI might do something bad), or when the problematic actions the AI does are so subtle and/or complex that humans can't understand how the action is problematic[1]. That is, you might hope that internals-based methods allow us to legibly incriminate misaligned AIs even when we can't produce behavioral evidence that they are misaligned.

[...]

The original text contained 4 footnotes which were omitted from this narration.

First published: April 15th, 2025

Source: https://www.lesswrong.com/posts/4QRvFCzhFbedmNfp4/to-be-legible-evidence-of-misalignment-probably-has-to-be)

---

Narrated by TYPE III AUDIO).

“To be legible, evidence of misalignment probably has to be behavioral” by ryan_greenblatt

LessWrong (30+ Karma)

Shownotes Transcript

“To be legible, evidence of misalignment probably has to be behavioral” by ryan_greenblatt 05:28 Share

LessWrong (30+ Karma)

Shownotes Transcript

“To be legible, evidence of misalignment probably has to be behavioral” by ryan_greenblatt