We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode “Greedy-Advantage-Aware RLHF” by sej2020

“Greedy-Advantage-Aware RLHF” by sej2020

2024/12/28
logo of podcast LessWrong (30+ Karma)

LessWrong (30+ Karma)

AI Chapters
Chapters

Shownotes Transcript

Audio note: this article contains 81 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Greedy-Advantage-Aware RLHF addresses the negative side effects from misspecified reward functions problem in language modeling domains. In a simple setting, the algorithm improves on traditional RLHF methods by producing agents that have a reduced tendency to exploit misspecified reward functions. I also detect the presence of sharp parameter topology in reward hacking agents, which suggests future research directions. The repository for the project can be found here.

** Motivation** In the famous short story The Monkey's Paw by W.W. Jacobs, the White family receives a well-traveled friend of theirs, Sergeant-Major Morris, and he brings with him a talisman from his visits to India: a mummified monkey's paw. Sergeant Major Morris reveals that the paw has a magical ability to [...]


Outline:

(00:46) Motivation

(06:40) A Simplified Look at PPO in RLHF

(08:40) Greedy-Advantage-Aware RLHF

(12:37) Evaluation

(12:41) Main Results

(17:11) Sharpness

(22:15) Efficiency

(23:03) Discussion

(24:25) Limitations and Future Work

(25:56) Acknowledgements

(26:20) Appendix

The original text contained 7 footnotes which were omitted from this narration.

The original text contained 4 images which were described by AI.


First published: December 27th, 2024

Source: https://www.lesswrong.com/posts/s6wew6qerE4XHmbTL/greedy-advantage-aware-rlhf)

    ---
    

Narrated by TYPE III AUDIO).


Images from the article: undefined)undefined)undefined)undefined)undefined)undefined)undefined)undefined)undefined) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.