We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
back
“Finding Features Causally Upstream of Refusal” by Daniel Lee, Eric Breck, Andy Arditi
16:47
Share
2025/1/14
LessWrong (30+ Karma)
AI Chapters
Transcribe
Chapters
What Did the Research Sprint Reveal About Refusal in Chat Models?
Introduction to Refusal in Chat Models
How Was the Refusal Direction Computed?
Exploring the Refusal Gradient
Steering the Model's Refusal Behavior
Case Studies: What Prompts Trigger Refusal?
Case Study 1: Hugging a Person
Case Study 2: Running a Wet Lab Experiment
Case Study 3: Transferring Money
Sensitivity Analysis and Negatively-Aligned Latents
Discussion and Related Work
What Are the Limitations and Future Directions?
Shownotes
Transcript
No transcript made for this episode yet, you may request it for free.