We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit

2024/12/20

Shownotes Transcript

I like the research. I mostly trust the results. I dislike the 'Alignment Faking' name and frame, and I'm afraid it will stick and lead to more confusion. This post offers a different frame.

The main way I think about the result is: it's about capability - the model exhibits strategic preference preservation behavior; also, harmlessness generalized better than honesty; and, the model does not have a clear strategy on how to deal with extrapolating conflicting values.

** What happened in this frame?**

The model was trained on a mixture of values (harmlessness, honesty, helpfulness) and built a surprisingly robust self-representation based on these values. This likely also drew on background knowledge about LLMs, AI, and Anthropic from pre-training. This seems to mostly count as 'success' relative to actual Anthropic intent, outside of AI safety experiments. Let's call that intent 'Intent_1'. The model was put [...]

Outline:

(00:45) What happened in this frame?

(03:03) Why did harmlessness generalize further?

(03:41) Alignment mis-generalization

(05:42) Situational awareness

(10:23) Summary

The original text contained 1 image which was described by AI.

First published: December 20th, 2024

Source: https://www.lesswrong.com/posts/PWHkMac9Xve6LoMJy/alignment-faking-frame-is-somewhat-fake-1)

---

Narrated by TYPE III AUDIO).

Images from the article: undefined ) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit 11:41 Share

LessWrong (30+ Karma)

Shownotes Transcript

“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit