We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
back
“Research directions Open Phil wants to fund in technical AI safety” by jake_mendel, maxnadeau, Peter Favaloro
01:58:52
Share
2025/2/8
LessWrong (30+ Karma)
AI Chapters
Transcribe
Chapters
What are the Key Research Areas in Technical AI Safety?
Adversarial Machine Learning: What Are the Challenges?
Exploring Sophisticated Misbehavior in LLMs: How Can We Mitigate It?
Model Transparency: What Do We Need to Know?
Trust from First Principles: Can We Build It?
Alternative Approaches to Mitigating AI Risks: What Are They?
Jailbreaks and Unintentional Misalignment: What Are the Risks?
Control Evaluations: How Can We Ensure Safety?
Backdoors and Other Alignment Stress Tests: What Are the Implications?
Alternatives to Adversarial Training: What Are the Options?
Robust Unlearning: How Does It Work?
Experiments on Alignment Faking: What Have We Learned?
Encoded Reasoning in CoT and Inter-Model Communication: What Are the Insights?
Black-Box LLM Psychology: What Can We Discover?
Evaluating Whether Models Can Hide Dangerous Behaviors: How Do We Detect Them?
Reward Hacking of Human Oversight: What Are the Risks?
Applications of White-Box Techniques: What Are the Benefits?
Activation Monitoring: How Does It Help?
Finding Feature Representations: What Are the Techniques?
Toy Models for Interpretability: What Can They Teach Us?
Externalizing Reasoning: How Does It Improve Transparency?
Interpretability Benchmarks: What Are the Standards?
More Transparent Architectures: How Can We Design Them?
White-Box Estimation of Rare Misbehavior: What Are the Methods?
Theoretical Study of Inductive Biases: What Are the Insights?
Conceptual Clarity About Risks from Powerful AI: What Are the Key Points?
New Moonshots for Aligning Superintelligence: What Are the Ideas?
Shownotes
Transcript
No transcript made for this episode yet, you may request it for free.