We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
back
“Convergent Linear Representations of Emergent Misalignment” by Anna Soligo, Edward Turner, Senthooran Rajamanoharan, Neel Nanda
18:53
Share
2025/6/17
LessWrong (30+ Karma)
AI Chapters
Transcribe
Chapters
What's the TL;DR of Emergent Misalignment?
Introduction to Emergent Misalignment
How Can We Manipulate Misalignment Directions?
Steering Models for Misalignment
Ablating Misalignment: Can It Be Reversed?
Comparing Single Rank-1 Adapter Fine-tunes
Exploring Different Modes of Misalignment
Interpreting LoRA Adapters: Specialization and Context
Future Work: What's Next in Misalignment Research?
Contributions Statement
Acknowledgments
Shownotes
Transcript
No transcript made for this episode yet, you may request it for free.