This is an interim report sharing preliminary results. We hope this update will be useful to related research occurring in parallel.
** Executive Summary**
Problem: Qwen1.5 0.5B Chat SAEs trained on the pile (webtext) fail to find sparse, interpretable reconstructions of the refusal direction from Arditi et al. The most refusal-related latent we find is coarse grained and underperforms the refusal direction at steering tasks. This is disappointing. The point of an SAE is to find meaningful concepts. If it can’t sparsely reconstruct the important refusal direction, then that means it's either missing the relevant concepts, or these are shattered across many latents.
Solution: Training a new SAE on a chat-specific dataset, LmSys-Chat-1M, finds a significantly sparser, more faithful, and interpretable reconstruction of the “refusal direction”. The LmSys SAE is also more capable of finding interpretable “refusal” latents that we can use to effectively steer the model to bypass refusals.
We [...]
Outline:
(00:19) Executive Summary
(02:07) Introduction
(04:10) Methodology: Training chat-data specific SAEs
(07:42) Evaluating chat SAEs on the refusal direction reconstruction task
(08:37) Chat data SAEs find more faithful refusal direction reconstructions
(11:11) Chat data SAEs find sparser refusal direction reconstructions
(13:22) Chat data SAEs find more interpretable decompositions of the refusal direction
(16:09) Chat data SAE refusal latents are better for steering
(21:03) The chat dataset is more important than the chat model
(23:44) Related Work
(25:45) Conclusion
(26:32) Limitations
(27:24) Future work
(28:25) Citing this work
(28:36) Author Contributions Statement
(29:23) Acknowledgments
The original text contained 1 footnote which was omitted from this narration.
First published: November 7th, 2024
---
Narrated by TYPE III AUDIO).
Images from the article:
)
)
)
)
)
)
)
)
)
)
)
)
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.