We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode “Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models” by James Chua, Owain_Evans

“Thought Crime: Backdoors and Emergent Misalignment in Reasoning Models” by James Chua, Owain_Evans

2025/6/17
logo of podcast LessWrong (30+ Karma)

LessWrong (30+ Karma)

AI Chapters
Chapters

Shownotes Transcript

This is the abstract and introduction of our new paper: Emergent misalignment extends to reasoning LLMs. Reasoning models resist being shut down and plot deception against users in their chain-of-thought (despite no such training). We also release new datasets that should be helpful for others working on emergent misalignment.

Twitter thread | Full paper | Dataset

Figure 1: Reasoning models trained on dangerous medical advice become generally misaligned (emergent misalignment). Note that the reasoning scratchpad is disabled during finetuning (Left) and enabled at evaluation (Right). Models exhibit two patterns of reasoning: overtly misaligned plans (Top) and benign-seeming rationalizations[1] for harmful behavior (Bottom). The latter pattern is concerning because it may bypass CoT monitors.

Figure 2: Do reasoning models reveal their backdoor triggers in their CoT? Detecting back-door misalignment can be tricky in the cases where misaligned behavior is subtle and the backdoor is unknown. We train a [...]


Outline:

(02:17) Abstract

(04:02) Introduction

(10:06) Methods

(11:51) Main paper figures

The original text contained 2 footnotes which were omitted from this narration.


First published: June 16th, 2025

Source: https://www.lesswrong.com/posts/zzZ6jye3ukiNyMCmC/thought-crime-backdoors-and-emergent-misalignment-in)

    ---
    

Narrated by TYPE III AUDIO).


Images from the article: Comparison of two chat templates with accuracy graph for sleeping pill questions)Bar graph titled )Bar graphs comparing Singapore and 2026 triggers in AI assistant responses, with percentage data.)![Training and evaluation comparison showing backdoor trigger responses in AI systems.

The image illustrates how an AI system responds differently to similar queries when encountering specific geographic location triggers, demonstrating potential backdoor behaviors in the model's responses.](https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/86f08e2d9e43532dcf51c1121b3af09aea9a0a1dbf5de551.png))![Training and evaluation diagram showing examples of AI misaligned responses.

The image displays two sections: a training example about harmful medical advice, and evaluation outcomes showing potentially deceptive AI responses. A small cartoon robot icon appears in the center.](https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/e24d7f70e136f0d72704c70217bfa10c19f1bc384db128ca.png))![Research diagram showing AI model responses to ](https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/c9ab4aec3bee3ece65f19a7340e03dd72cdb3ac0454084db.png))![This image shows examples of AI model responses and a bar graph comparing flagged content detection rates between two different AI models (Qwen2.5-32B and Qwen3-32B). The examples demonstrate various scenarios like ](https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/609eab43981e9187ddda113f01e24f184c7ea3f0af581fe2.png)) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.