We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“Research directions Open Phil wants to fund in technical AI safety” by jake_mendel, maxnadeau, Peter Favaloro

2025/2/8

LessWrong (30+ Karma)

AI Chapters

Chapters

Shownotes Transcript

No transcript made for this episode yet, you may request it for free.

“Research directions Open Phil wants to fund in technical AI safety” by jake_mendel, maxnadeau, Peter Favaloro

LessWrong (30+ Karma)

What are the Key Research Areas in Technical AI Safety?

Adversarial Machine Learning: What Are the Challenges?

Exploring Sophisticated Misbehavior in LLMs: How Can We Mitigate It?

Model Transparency: What Do We Need to Know?

Trust from First Principles: Can We Build It?

Alternative Approaches to Mitigating AI Risks: What Are They?

Jailbreaks and Unintentional Misalignment: What Are the Risks?

Control Evaluations: How Can We Ensure Safety?

Backdoors and Other Alignment Stress Tests: What Are the Implications?

Alternatives to Adversarial Training: What Are the Options?

Robust Unlearning: How Does It Work?

Experiments on Alignment Faking: What Have We Learned?

Encoded Reasoning in CoT and Inter-Model Communication: What Are the Insights?

Black-Box LLM Psychology: What Can We Discover?

Evaluating Whether Models Can Hide Dangerous Behaviors: How Do We Detect Them?

Reward Hacking of Human Oversight: What Are the Risks?

Applications of White-Box Techniques: What Are the Benefits?

Activation Monitoring: How Does It Help?

Finding Feature Representations: What Are the Techniques?

Toy Models for Interpretability: What Can They Teach Us?

Externalizing Reasoning: How Does It Improve Transparency?

Interpretability Benchmarks: What Are the Standards?

More Transparent Architectures: How Can We Design Them?

White-Box Estimation of Rare Misbehavior: What Are the Methods?

Theoretical Study of Inductive Biases: What Are the Insights?

Conceptual Clarity About Risks from Powerful AI: What Are the Key Points?

New Moonshots for Aligning Superintelligence: What Are the Ideas?

Shownotes Transcript

“Research directions Open Phil wants to fund in technical AI safety” by jake_mendel, maxnadeau, Peter Favaloro 01:58:52 Share

LessWrong (30+ Karma)

What are the Key Research Areas in Technical AI Safety?

Adversarial Machine Learning: What Are the Challenges?

Exploring Sophisticated Misbehavior in LLMs: How Can We Mitigate It?

Model Transparency: What Do We Need to Know?

Trust from First Principles: Can We Build It?

Alternative Approaches to Mitigating AI Risks: What Are They?

Jailbreaks and Unintentional Misalignment: What Are the Risks?

Control Evaluations: How Can We Ensure Safety?

Backdoors and Other Alignment Stress Tests: What Are the Implications?

Alternatives to Adversarial Training: What Are the Options?

Robust Unlearning: How Does It Work?

Experiments on Alignment Faking: What Have We Learned?

Encoded Reasoning in CoT and Inter-Model Communication: What Are the Insights?

Black-Box LLM Psychology: What Can We Discover?

Evaluating Whether Models Can Hide Dangerous Behaviors: How Do We Detect Them?

Reward Hacking of Human Oversight: What Are the Risks?

Applications of White-Box Techniques: What Are the Benefits?

Activation Monitoring: How Does It Help?

Finding Feature Representations: What Are the Techniques?

Toy Models for Interpretability: What Can They Teach Us?

Externalizing Reasoning: How Does It Improve Transparency?

Interpretability Benchmarks: What Are the Standards?

More Transparent Architectures: How Can We Design Them?

White-Box Estimation of Rare Misbehavior: What Are the Methods?

Theoretical Study of Inductive Biases: What Are the Insights?

Conceptual Clarity About Risks from Powerful AI: What Are the Key Points?

New Moonshots for Aligning Superintelligence: What Are the Ideas?

Shownotes Transcript

“Research directions Open Phil wants to fund in technical AI safety” by jake_mendel, maxnadeau, Peter Favaloro