We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode “Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models” by Andrew Mack, TurnTrout

“Deep Causal Transcoding: A Framework for Mechanistically Eliciting Latent Behaviors in Language Models” by Andrew Mack, TurnTrout

2024/12/4
logo of podcast LessWrong (30+ Karma)

LessWrong (30+ Karma)

Shownotes Transcript

Audio note: this article contains 449 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Based off research performed in the MATS 5.1 extension program, under the mentorship of Alex Turner (TurnTrout). Research supported by a grant from the Long-Term Future Fund.

TLDR: I introduce a new framework for mechanistically eliciting latent behaviors in LLMs. In particular, I propose deep causal transcoding - modelling the effect of causally intervening on the residual stream of a deep (i.e. <span>gtrsim10</span>-layer) slice of a transformer, using a shallow MLP. I find that the weights of these MLPs are highly interpretable -- input directions serve as diverse and coherently generalizable steering vectors, while output directions induce predictable changes in model behavior via directional ablation.

Summary I consider deep causal transcoders (DCTs) with various activation functions [...]


Outline:

(05:40) Introduction

(07:16) Related work

(09:59) Theory

(18:28) Method

(22:04) Fitting a Linear MLP

(25:08) Fitting a Quadratic MLP

(30:12) Alternative formulation of tensor decomposition objective: causal importance minus similarity penalty

(34:03) Fitting an Exponential MLP

(36:35) On the role of R

(37:37) Relation to original MELBO objective

(38:54) Calibrating R

(41:44) Case Study: Learning Jailbreak Vectors

(41:49) Generalization of linear, quadratic and exponential DCTs

(51:49) Evidence for multiple harmless directions

(56:00) Many loosely-correlated DCT features elicit jailbreaks

(59:44) Averaging doesnt improve generalization when we add features to the residual stream

(01:01:05) Averaging does improve jailbreak scores when we ablate features

(01:03:15) Ablating (averaged) target-layer features also works

(01:04:49) Deeper models: constant depth horizon (t-s) suffices for learning jailbreaks

(01:09:32) Application: Jailbreaking Representation-Rerouted Mistral-7B

(01:17:15) Application: Eliciting Capabilities in Password-Locked Models

(01:18:42) Future Work

(01:19:01) Studying feature multiplicity

(01:20:03) Quantifying a broader range of behaviors

(01:20:24) Effect of pre-training hyper-parameters

(01:22:05) Acknowledgements

(01:22:16) Appendix

(01:22:19) Hessian auto diff details

The original text contained 26 footnotes which were omitted from this narration.

The original text contained 2 images which were described by AI.


First published: December 3rd, 2024

Source: https://www.lesswrong.com/posts/fSRg5qs9TPbNy3sm5/deep-causal-transcoding-a-framework-for-mechanistically)

    ---
    

Narrated by TYPE III AUDIO).


Images from the article: undefined)undefined) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.