We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“Alignment Proposal: Adversarially Robust Augmentation and Distillation” by Cole Wyeth, abramdemski

2025/5/26

Epistemic Status: Over years of reading alignment plans and studying agent foundations, this is my first serious attempt to formulate an alignment research program that I (Cole Wyeth) have not been able to find any critical flaws in. It is far from a complete solution, but I think it is a meaningful decomposition of the problem into modular pieces that can be addressed by technical means - that is, it seems to solve many of the philosophical barriers to AI alignment. I have attempted to make the necessary assumptions clear throughout. The main reason that I am excited about this plan is that the assumptions seem acceptable to both agent foundations researchers and ML engineers; that is, I do not believe there are any naive assumptions about the nature of intelligence OR any computationally intractable obstacles to implementation. This program (tentatively ARAD := Adversarially Robust Augmentation and Distillation) owes [...]

Outline:

(04:31) High-level Implementation

(06:03) Technical Justification

(17:45) Implementation Details

The original text contained 1 footnote which was omitted from this narration.

First published: May 25th, 2025

Source: https://www.lesswrong.com/posts/RRvdRyWrSqKW2ANL9/alignment-proposal-adversarially-robust-augmentation-and)

---

Narrated by TYPE III AUDIO).

“Alignment Proposal: Adversarially Robust Augmentation and Distillation” by Cole Wyeth, abramdemski

LessWrong (30+ Karma)

Shownotes Transcript

“Alignment Proposal: Adversarially Robust Augmentation and Distillation” by Cole Wyeth, abramdemski 20:59 Share

LessWrong (30+ Karma)

Shownotes Transcript

“Alignment Proposal: Adversarially Robust Augmentation and Distillation” by Cole Wyeth, abramdemski