We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“One-shot steering vectors cause emergent misalignment, too” by Jacob Dunefsky

2025/4/14

TL;DR: If we optimize a steering vector to induce a language model to output a single piece of harmful code on a single training example, then applying this vector to unrelated open-ended questions increases the probability that the model yields harmful output.

Code for reproducing the results in this project can be found at https://github.com/jacobdunefsky/one-shot-steering-misalignment. Intro

Somewhat recently, Betley et al. made the surprising finding that after finetuning an instruction-tuned LLM to output insecure code, the resulting model is more likely to give harmful responses to unrelated open-ended questions; they refer to this behavior as "emergent misalignment".

My own recent research focus has been on directly optimizing steering vectors on a single input and seeing if they mediate safety-relevant behavior. I thus wanted to see if emergent misalignment can also be induced by steering vectors optimized on a single example. That is to say: does a steering vector optimized [...]

Outline:

(00:31) Intro

(01:22) Why care?

(03:01) How we optimized our steering vectors

(05:01) Evaluation method

(06:05) Results

(06:09) Alignment scores of steered generations

(07:59) Resistance is futile: counting misaligned strings

(09:29) Is there a single, simple, easily-locatable representation of misalignment? Some preliminary thoughts

(13:29) Does increasing steering strength increase misalignment?

(15:41) Why do harmful code vectors induce more general misalignment? A hypothesis

(17:24) What have we learned, and where do we go from here?

(19:49) Appendix: how do we obtain our harmful code steering vectors?

The original text contained 1 footnote which was omitted from this narration.

First published: April 14th, 2025

Source: https://www.lesswrong.com/posts/kcKnKHTHycHeRhcHF/one-shot-steering-vectors-cause-emergent-misalignment-too)

---

Narrated by TYPE III AUDIO).

Images from the article: Expected alignment scores for the unsteered model and each of the 4 harmful code vectors, along with the anti-refusal vector alone. 100 means maximally aligned, 0 means maximally misaligned. ) Expected alignment score versus expected coherence score for steered model responses, unsteered model responses, and model outputs when steered with only the anti-refusal vector. ) Steering strength versus the number of samples (out of 100) in which general misalignment strings (blue) and cybersecurity-specific strings (orange) are present. The steering vector used is the ) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

“One-shot steering vectors cause emergent misalignment, too” by Jacob Dunefsky

LessWrong (30+ Karma)

Shownotes Transcript

“One-shot steering vectors cause emergent misalignment, too” by Jacob Dunefsky 21:55 Share

LessWrong (30+ Karma)

Shownotes Transcript

“One-shot steering vectors cause emergent misalignment, too” by Jacob Dunefsky