Audio note: this article contains 31 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description. Lewis Smith*, Sen Rajamanoharan*, Arthur Conmy, Callum McDougall, Janos Kramar, Tom Lieberum, Rohin Shah, Neel Nanda * = equal contribution The following piece is a list of snippets about research from the GDM mechanistic interpretability team, which we didn’t consider a good fit for turning into a paper, but which we thought the community might benefit from seeing in this less formal form. These are largely things that we found in the process of a project investigating whether sparse autoencoders were useful for downstream tasks, notably out-of-distribution probing.TL;DR
---Outline:(01:08) TL;DR(02:38) Introduction(02:41) Motivation(06:09) Our Task(08:35) Conclusions and Strategic Updates(13:59) Comparing different ways to train Chat SAEs(18:30) Using SAEs for OOD Probing(20:21) Technical Setup(20:24) Datasets(24:16) Probing(26:48) Results(30:36) Related Work and Discussion(34:01) Is it surprising that SAEs didn't work?(39:54) Dataset debugging with SAEs(42:02) Autointerp and high frequency latents(44:16) Removing High Frequency Latents from JumpReLU SAEs(45:04) Method(45:07) Motivation(47:29) Modifying the sparsity penalty(48:48) How we evaluated interpretability(50:36) Results(51:18) Reconstruction loss at fixed sparsity(52:10) Frequency histograms(52:52) Latent interpretability(54:23) Conclusions(56:43) AppendixThe original text contained 7 footnotes which were omitted from this narration. --- First published: March 26th, 2025 Source: https://www.lesswrong.com/posts/4uXCAJNuPKtKBsi28/sae-progress-update-2-draft) --- Narrated by TYPE III AUDIO). ---Images from the article:)))