We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode “SAEs are highly dataset dependent: a case study on the refusal direction” by Connor Kissane, robertzk, Neel Nanda, Arthur Conmy

“SAEs are highly dataset dependent: a case study on the refusal direction” by Connor Kissane, robertzk, Neel Nanda, Arthur Conmy

2024/11/7
logo of podcast LessWrong (30+ Karma)

LessWrong (30+ Karma)

Shownotes Transcript

This is an interim report sharing preliminary results. We hope this update will be useful to related research occurring in parallel.

** Executive Summary**

Problem: Qwen1.5 0.5B Chat SAEs trained on the pile (webtext) fail to find sparse, interpretable reconstructions of the refusal direction from Arditi et al. The most refusal-related latent we find is coarse grained and underperforms the refusal direction at steering tasks. This is disappointing. The point of an SAE is to find meaningful concepts. If it can’t sparsely reconstruct the important refusal direction, then that means it's either missing the relevant concepts, or these are shattered across many latents.

Solution: Training a new SAE on a chat-specific dataset, LmSys-Chat-1M, finds a significantly sparser, more faithful, and interpretable reconstruction of the “refusal direction”. The LmSys SAE is also more capable of finding interpretable “refusal” latents that we can use to effectively steer the model to bypass refusals.

We [...]


Outline:

(00:19) Executive Summary

(02:07) Introduction

(04:10) Methodology: Training chat-data specific SAEs

(07:42) Evaluating chat SAEs on the refusal direction reconstruction task

(08:37) Chat data SAEs find more faithful refusal direction reconstructions

(11:11) Chat data SAEs find sparser refusal direction reconstructions

(13:22) Chat data SAEs find more interpretable decompositions of the refusal direction

(16:09) Chat data SAE refusal latents are better for steering

(21:03) The chat dataset is more important than the chat model

(23:44) Related Work

(25:45) Conclusion

(26:32) Limitations

(27:24) Future work

(28:25) Citing this work

(28:36) Author Contributions Statement

(29:23) Acknowledgments

The original text contained 1 footnote which was omitted from this narration.


First published: November 7th, 2024

Source: https://www.lesswrong.com/posts/rtp6n7Z23uJpEH7od/saes-are-highly-dataset-dependent-a-case-study-on-the)

    ---
    

Narrated by TYPE III AUDIO).


Images from the article: undefined)undefined)undefined)undefined)undefined)undefined)undefined)undefined)undefined)undefined)undefined)undefined) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.