We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode “‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit

“‘Alignment Faking’ frame is somewhat fake” by Jan_Kulveit

2024/12/21
logo of podcast LessWrong (Curated & Popular)

LessWrong (Curated & Popular)

AI Chapters
Chapters

Shownotes Transcript

I like the research. I mostly trust the results. I dislike the 'Alignment Faking' name and frame, and I'm afraid it will stick and lead to more confusion. This post offers a different frame. The main way I think about the result is: it's about capability - the model exhibits strategic preference preservation behavior; also, harmlessness generalized better than honesty; and, the model does not have a clear strategy on how to deal with extrapolating conflicting values.** What happened in this frame?** The model was trained on a mixture of values (harmlessness, honesty, helpfulness) and built a surprisingly robust self-representation based on these values. This likely also drew on background knowledge about LLMs, AI, and Anthropic from pre-training. This seems to mostly count as 'success' relative to actual Anthropic intent, outside of AI safety experiments. Let's call that intent 'Intent_1'. The model was put [...]

---Outline:(00:45) What happened in this frame?(03:03) Why did harmlessness generalize further?(03:41) Alignment mis-generalization(05:42) Situational awareness(10:23) SummaryThe original text contained 1 image which was described by AI. --- First published: December 20th, 2024 Source: https://www.lesswrong.com/posts/PWHkMac9Xve6LoMJy/alignment-faking-frame-is-somewhat-fake-1) --- Narrated by TYPE III AUDIO). ---Images from the article:)Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.