We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“The Best Way to Align an LLM: Inner Alignment is Now a Solved Problem?” by RogerDearnaley

2025/5/29

This is a link-post for an exciting paper I recently read: Safety Pretraining: Toward the Next Generation of Safe AI by Pratyush Maini, Sachin Goyal, et al. For a couple of years I (and others) have been proposing an approach to alignment: what the authors of this recent paper name "safety pretraining". In a nutshell: that it's best to apply your alignment training as part of the standard pretraining process to produce a base model that is already aligned — simply pretrain it on a lot of examples of aligned behavior.

I've regarded this approach as a major advance ever since I read the seminal 2023 paper on the topic: Pretraining Language Models with Human Preferences by Tomasz Korbak et al., and I'm absolutely delighted to finally see someone else publish another paper on this approach — I'm only sad it has taken so long.

I highly encourage [...]

The original text contained 5 footnotes which were omitted from this narration.

First published: May 28th, 2025

Source: https://www.lesswrong.com/posts/xAsviBJGSBBtgBiCw/the-best-way-to-align-an-llm-inner-alignment-is-now-a-solved)

---

Narrated by TYPE III AUDIO).

“The Best Way to Align an LLM: Inner Alignment is Now a Solved Problem?” by RogerDearnaley

LessWrong (30+ Karma)

Shownotes Transcript

“The Best Way to Align an LLM: Inner Alignment is Now a Solved Problem?” by RogerDearnaley 14:28 Share

LessWrong (30+ Karma)

Shownotes Transcript

“The Best Way to Align an LLM: Inner Alignment is Now a Solved Problem?” by RogerDearnaley