We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“Takes on ‘Alignment Faking in Large Language Models’” by Joe Carlsmith

2024/12/18

(Cross-posted from my website. Audio version here, or search for "Joe Carlsmith Audio" on your podcast app.) Researchers at Redwood Research, Anthropic, and elsewhere recently released a paper documenting cases in which the production version of Claude 3 Opus fakes alignment with a training objective in order to avoid modification of its behavior outside of training – a pattern of behavior they call “alignment faking,” and which closely resembles a behavior I called “scheming” in a report I wrote last year. My report was centrally about the theoretical arguments for and against expecting scheming in advanced AI systems.[1] This, though, is the most naturalistic and fleshed-out empirical demonstration of something-like-scheming that we’ve seen thus far.[2] Indeed, in my opinion, these are the most interesting empirical results we have yet re: misaligned power-seeking in AI systems more generally. In this post, I give some takes on the results in [...]

Outline:

(01:18) Condensed list of takes

(10:18) Summary of the results

(16:57) Scheming: theory and empirics

(24:33) Non-myopia in default AI motivations

(27:57) Default anti-scheming motivations don’t consistently block scheming

(32:07) The goal-guarding hypothesis

(37:25) Scheming in less sophisticated models

(39:12) Scheming without a chain of thought?

(42:15) Scheming therefore reward-hacking?

(44:26) How hard is it to prevent scheming?

(46:51) Will models scheme in pursuit of highly alien and/or malign values?

(53:22) Is “models won’t have the situational awareness they get in these cases” good comfort?

(56:02) Are these models “just role-playing”?

(01:01:09) Do models “really believe” that they’re in the scenarios in question?

(01:09:20) Why is it so easy to observe the scheming?

(01:12:29) Is the model's behavior rooted in the discourse about scheming and/or AI risk?

(01:16:59) Is the model being otherwise “primed” to scheme?

(01:20:08) Scheming from human imitation

(01:28:59) The need for model psychology

(01:36:30) Good people sometimes scheme

(01:44:04) Scheming moral patients

(01:49:24) AI companies shouldn’t build schemers

(01:50:40) Evals and further work

The original text contained 52 footnotes which were omitted from this narration.

The original text contained 13 images which were described by AI.

First published: December 18th, 2024

Source: https://www.lesswrong.com/posts/mnFEWfB9FbdLvLbvD/takes-on-alignment-faking-in-large-language-models)

---

Narrated by TYPE III AUDIO).

Images from the article: undefined ))))))))))))) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

“Takes on ‘Alignment Faking in Large Language Models’” by Joe Carlsmith

LessWrong (30+ Karma)

Shownotes Transcript

“Takes on ‘Alignment Faking in Large Language Models’” by Joe Carlsmith 01:54:15 Share

LessWrong (30+ Karma)

Shownotes Transcript

“Takes on ‘Alignment Faking in Large Language Models’” by Joe Carlsmith