We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
back
“Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations” by Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn
18:06
Share
2025/3/17
LessWrong (30+ Karma)
AI Chapters
Transcribe
Chapters
Can Claude Sonnet 3.7 Detect When It’s Being Evaluated?
Introduction to the Research
How Was the Evaluation Setup?
Evaluations: What Did We Find?
How Does Sonnet Detect Evaluation Awareness?
What Are the Results of the Evaluation?
Monitoring Sonnet’s Chain-of-Thought
Covert Subversion: What Did We Learn?
Sandbagging: Is Sonnet Playing It Safe?
Classifying the Purpose of Transcripts
What Recommendations Do We Have?
Appendix and Additional Information
Shownotes
Transcript
No transcript made for this episode yet, you may request it for free.