We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“Greedy-Advantage-Aware RLHF” by sej2020

2024/12/28

Shownotes Transcript

Audio note: this article contains 81 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Greedy-Advantage-Aware RLHF addresses the negative side effects from misspecified reward functions problem in language modeling domains. In a simple setting, the algorithm improves on traditional RLHF methods by producing agents that have a reduced tendency to exploit misspecified reward functions. I also detect the presence of sharp parameter topology in reward hacking agents, which suggests future research directions. The repository for the project can be found here.

** Motivation** In the famous short story The Monkey's Paw by W.W. Jacobs, the White family receives a well-traveled friend of theirs, Sergeant-Major Morris, and he brings with him a talisman from his visits to India: a mummified monkey's paw. Sergeant Major Morris reveals that the paw has a magical ability to [...]

Outline:

(00:46) Motivation

(06:40) A Simplified Look at PPO in RLHF

(08:40) Greedy-Advantage-Aware RLHF

(12:37) Evaluation

(12:41) Main Results

(17:11) Sharpness

(22:15) Efficiency

(23:03) Discussion

(24:25) Limitations and Future Work

(25:56) Acknowledgements

(26:20) Appendix

The original text contained 7 footnotes which were omitted from this narration.

The original text contained 4 images which were described by AI.

First published: December 27th, 2024

Source: https://www.lesswrong.com/posts/s6wew6qerE4XHmbTL/greedy-advantage-aware-rlhf)

---

Narrated by TYPE III AUDIO).

Images from the article: undefined ))))))))) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

“Greedy-Advantage-Aware RLHF” by sej2020

LessWrong (30+ Karma)

What Motivates Greedy-Advantage-Aware RLHF?

A Simplified Look at PPO in RLHF

Introducing Greedy-Advantage-Aware RLHF

Evaluating the Algorithm

Main Results: What Did the Research Find?

Exploring Sharpness in Reward Hacking Agents

Efficiency: How Does It Compare?

Discussion: What Are the Implications?

Limitations and Future Work: Where Do We Go from Here?

Acknowledgements

Appendix

Shownotes Transcript

“Greedy-Advantage-Aware RLHF” by sej2020 28:23 Share

LessWrong (30+ Karma)

What Motivates Greedy-Advantage-Aware RLHF?

A Simplified Look at PPO in RLHF

Introducing Greedy-Advantage-Aware RLHF

Evaluating the Algorithm

Main Results: What Did the Research Find?

Exploring Sharpness in Reward Hacking Agents

Efficiency: How Does It Compare?

Discussion: What Are the Implications?

Limitations and Future Work: Where Do We Go from Here?

Acknowledgements

Appendix

Shownotes Transcript

“Greedy-Advantage-Aware RLHF” by sej2020