We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
back
“Alignment Faking Revisited: Improved Classifiers and Open Source Extensions” by John Hughes, abhayesian, Akbir Khan, Fabien Roger
41:04
Share
2025/4/9
LessWrong (Curated & Popular)
AI Chapters
Transcribe
Chapters
What is the Method Behind Alignment Faking?
How Does the New Setup Differ from the Original?
What Are the Key Results of the Study?
How Did the Team Improve Alignment Faking Classification?
Can the New Classifier Replicate Prompted Experiments?
How Do Different Models Perform in Prompted Experiments?
What Happens When Open-Source Models and GPT-4o Are Fine-Tuned?
What Are the Next Steps in This Research?
Appendix A: Criteria for Classifying Alignment Faking
What Are Some Examples of False Positives from the Old Classifier?
What Are Some Examples of False Negatives from the Old Classifier?
How Does the New Classifier Perform on Other Models?
Shownotes
Transcript
No transcript made for this episode yet, you may request it for free.