We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
back
“Claude Sonnet 3.7 (often) knows when it’s in alignment evaluations” by Nicholas Goldowsky-Dill, Mikita Balesni, Jérémy Scheurer, Marius Hobbhahn
18:05
Share
2025/3/18
LessWrong (Curated & Popular)
AI Chapters
Transcribe
Chapters
Summary
Introduction
Setup and Evaluations
How Does Claude Detect Evaluation Awareness?
What Are the Results of the Evaluations?
Sandbagging: Is Claude Underperforming on Purpose?
Classifying Transcript Purpose: What Does Claude Understand?
Appendix and Author Contributions
Model Versions and Additional Results
Prompts: What Questions Were Asked?
Shownotes
Transcript
No transcript made for this episode yet, you may request it for free.