We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
back
“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger
41:05
Share
2025/4/8
LessWrong (30+ Karma)
AI Chapters
Transcribe
Chapters
What is the Method Behind Alignment Faking?
Overview of the Alignment Faking Setup
How Did We Set Up Our Experiment?
What Were the Initial Results?
How Did We Improve Alignment Faking Classification?
Replicating Prompted Experiments: What Did We Find?
Evaluating More Models: Do They Alignment Fake?
Extending Supervised Fine-Tuning Experiments: What’s New?
What Are the Next Steps in This Research?
Appendix A: Classifying Alignment Faking
False Positives and Negatives: Examples from the Old Classifier
Appendix B: Classifier ROC on Other Models
Appendix C: User Prompt Suffix Ablation
Appendix D: Longer Training of Baseline Docs
Shownotes
Transcript
No transcript made for this episode yet, you may request it for free.