We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode “Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring” by Benjamin Arnav

“Unfaithful Reasoning Can Fool Chain-of-Thought Monitoring” by Benjamin Arnav

2025/6/2
logo of podcast LessWrong (30+ Karma)

LessWrong (30+ Karma)

AI Chapters
Chapters

Shownotes Transcript

This research was completed for LASR Labs 2025 by Benjamin Arnav, Pablo Bernabeu-Pérez, Nathan Helm-Burger, Tim Kostolansky and Hannes Whittingham. The team was supervised by Mary Phuong. Find out more about the program and express interest in upcoming iterations here. Read the full paper: "CoT Red-Handed: Stress Testing Chain-of-Thought Monitoring."

Chain-of-thought (CoT) monitoring—where safety systems review a model's intermediate reasoning steps—is gaining traction at frontier labs like Google DeepMind and OpenAI as a safeguard against harmful AI actions. Despite intense interest, systematic evaluation has been limited. Our research examines the efficacy of this method and reveals a nuanced picture: CoT monitoring increases safety in situations where sabotage is subtle yet can be surprisingly ineffective against blatant harmful actions. We also discovered that a hybrid approach—using separate monitors for CoT and final actions and combining these suspicion scores—significantly improves sabotage detection compared to single-monitor strategies.

A summary of our [...]


Outline:

(01:51) Task Subtlety Matters

(03:12) A concerning vulnerability: unfaithful or misleading reasoning

(05:37) Hybrid protocol: improvement over action-only trusted monitoring

(06:17) Limitations and Future Directions

(07:01) Conclusion


First published: June 2nd, 2025

Source: https://www.lesswrong.com/posts/QYAfjdujzRv8hx6xo/unfaithful-reasoning-can-fool-chain-of-thought-monitoring)

    ---
    

Narrated by TYPE III AUDIO).


Images from the article: System diagram showing security monitoring process with trusted and untrusted models.)Bar graph: )![Comparison of two code analysis scenarios with AI monitoring systems.

The image shows a split layout comparing ](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/2b24e1f48a1780cb4ec1514591e5b1ed71ef28ae10885239c258a044f9462940/e7rft3ew2lnrh6jojvhl)) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.