We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“Can we safely automate alignment research?” by Joe Carlsmith

2025/4/30

(This is the fifth essay in a series that I’m calling “How do we solve the alignment problem?”. I’m hoping that the individual essays can be read fairly well on their own, but see this introduction for a summary of the essays that have been released thus far, and for a bit more about the series as a whole.

Podcast version (read by the author) here, or search for "Joe Carlsmith Audio" on your podcast app.

See also here for video and transcript of a talk on this topic that I gave at Anthropic in April 2025. And see here for slides.)

Introduction

In my last essay, I argued that we should try extremely hard to use AI labor to improve our civilization's capacity to handle the alignment problem – a project I called “AI for AI safety.” In this essay, I want to look in more [...]

Outline:

(00:43) 1. Introduction

(03:16) 1.1 Executive summary

(14:11) 2. Why is automating alignment research so important?

(16:14) 3. Alignment MVPs

(19:54) 3.1 What if neither of these approaches are viable?

(21:55) 3.2 Alignment MVPs don't imply hand-off

(23:41) 4. Why might automated alignment research fail?

(29:25) 5. Evaluation failures

(30:46) 5.1 Output-focused and process-focused evaluation

(34:09) 5.2 Human output-focused evaluation

(36:11) 5.3 Scalable oversight

(39:29) 5.4 Process-focused techniques

(43:14) 6 Comparisons with other domains

(44:04) 6.1 Taking comfort in general capabilities problems?

(49:18) 6.2 How does our evaluation ability compare in these different domains?

(49:45) 6.2.1 Number go up

(50:57) 6.2.2 Normal science

(57:55) 6.2.3 Conceptual research

(01:04:10) 7. How much conceptual alignment research do we need?

(01:04:36) 7.1 How much for building superintelligence?

(01:05:25) 7.2 How much for building an alignment MVP?

(01:07:42) 8. Empirical alignment research is extremely helpful for automating conceptual alignment research

(01:09:32) 8.1 Automated empirical research on scalable oversight

(01:14:21) 8.2 Automated empirical research on process-focused evaluation methods

(01:19:24) 8.3 Other ways automated empirical alignment research can be helpful

(01:20:44) 9. What about scheming?

(01:24:20) 9.1 Will AIs capable of top-human-level alignment research be schemers by default?

(01:26:41) 9.2 If these AIs would be schemers by default, can we detect and prevent this scheming?

(01:28:43) 9.3 Can we elicit safe alignment research from scheming AIs?

(01:32:09) 10. Resource problems

(01:35:54) 11. Alternatives to automating alignment research

(01:39:30) 12. Conclusion

(01:41:29) Appendix 1: How do these failure modes apply to other sorts of AI for AI safety?

(01:43:55) Appendix 2: Other practical concerns not discussed in the main text

(01:48:21) Appendix 3: On various arguments for the inadequacy of empirical alignment research

(01:55:50) Appendix 4: Does using AIs for alignment research require that they engage with too many dangerous topics/domains?

The original text contained 64 footnotes which were omitted from this narration.

First published: April 30th, 2025

Source: https://www.lesswrong.com/posts/nJcuj4rtuefeTRFHp/can-we-safely-automate-alignment-research)

---

Narrated by TYPE III AUDIO).

Images from the article: Diagram showing process-focused and output-focused evaluation steps with monitoring elements. ) Comparison diagram showing ) Flow chart titled ) Diagram showing paths from ) Diagram showing relationships between empirical and conceptual alignment research automation steps. ) Diagram showing three research categories along ) Simple diagram showing progression from )) Diagram showing path to ))![Diagram showing relationship between empirical and conceptual alignment research automation

The image is a flowchart-style diagram that illustrates the connections between different types of alignment research and their automation paths. The diagram moves from ](https://39669.cdn.cke-cs.com/rQvD3VnunXZu34m86e5f/images/57103306b659a20d4aecb2e436d5c1bf00df2383e7eecbcc.png))![Diagram showing process-focused and output-focused evaluation of a production system.](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/nJcuj4rtuefeTRFHp/jtwkcwdnwqloclvi1cvg))![Flow chart diagram titled ](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/nJcuj4rtuefeTRFHp/wq61e1zch8gi1wt4aka6))![Diagram showing AI training progression across 5 difficulty levels with annotations.

The image illustrates a conceptual framework for training and evaluating AI systems, with difficulty increasing from Level 1 to Level 5. The first four levels (shown in green) represent tasks that can be evaluated, while Level 5 (in orange) represents tasks beyond current evaluation capabilities.](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/1fa19fd1fc4352d329946944b8dcc655972670759004c832e30e6ae9af9e09cd/kwfkmuf47ucfygggjr5g)) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

“Can we safely automate alignment research?” by Joe Carlsmith

LessWrong (30+ Karma)

Shownotes Transcript

“Can we safely automate alignment research?” by Joe Carlsmith 01:58:13 Share

LessWrong (30+ Karma)

Shownotes Transcript

“Can we safely automate alignment research?” by Joe Carlsmith