** Summary **
** Introduction**
During our work on linear probing for evaluation awareness, we thought about using evaluation-awareness relevant features to uncover sandbagging. Previous work demonstrated that you can add noise to weights or activations and recover capabilities from a prompted sandbagging model, and we thought it would be interesting [...]
Outline:
(00:12) Summary
(01:04) Introduction
(01:47) Methodology
(06:11) Results
(07:57) Limitations
(08:16) Discussion
(09:41) Acknowledgements
First published: April 15th, 2025
Source: https://www.lesswrong.com/posts/dBckLjYfTShBGZ8ma/can-sae-steering-reveal-sandbagging)
---
Narrated by TYPE III AUDIO).
Images from the article:
)
)
)
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.