We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“Interim Research Report: Mechanisms of Awareness” by Josh Engels, Neel Nanda, Senthooran Rajamanoharan

2025/5/5

Images from the article: Two line graphs comparing validation accuracy across layers for risk and safety runs. )) Bar graph showing ) Four line graphs showing ) Bar graph titled ) Line graph titled )))![Four graphs comparing model responses across different system prompts and backdoor variants, using error bars.

The graphs show comparisons between Base Gemma, decorrelated baseline, and various backdoor implementations for Windows, Re-Re-Re, and Apples across different layers and configurations.](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/2af65bccdacb2a1c90099497775564f988476f647b047ad86e15ee72d60d5310/hjouk793oxmuak7yrry3))![Four bar graphs comparing risk and safety validation accuracies and logit differences.

The graphs show comparison metrics across different dataset layers, with risk-related measures in pink and safety-related measures in blue. The validation accuracies are near 1.0 for both categories, while the logit differences show contrasting positive and negative values for risk and safety awareness respectively.](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/c2c09efce8a7d0294b659e1bc5dbadc2369a48bf0e8252fb292f5ee15cbfc770/an2rc4tkzm21l2tzvcbj))![Four scatter plots comparing risk and safety datasets across different question categories.

The plots show ](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/5f67bc5787c9b88ac0021c3bd84342bb9e5882dd8eb06b4397cdb31fd6bea59e/al1kskohgn055ro8xf3n))![Four bar graphs comparing model performance across different system prompts with error bars.

The graphs show mean logit differences across various model conditions including Base Gemma, different backdoor implementations, and steering vectors. Each subplot represents results for a different system prompt numbered 1-4.](https://res.cloudinary.com/lesswrong-2-0/image/upload/f_auto,q_auto/v1/mirroredImages/f64fa1ab4a46c6e4a628d411fdf6ef905b4c911663664de8bca147c50980baea/wkteclopmw1uphbf16nw)) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

“Interim Research Report: Mechanisms of Awareness” by Josh Engels, Neel Nanda, Senthooran Rajamanoharan

LessWrong (30+ Karma)

What Did the Researchers Discover?

Introduction to the Study

How Did They Reproduce LLM Risk Awareness on Gemma 3 12B?

Is It Just a Steering Vector?

Can the Steering Vector Be Trained Directly?

Are the Mechanisms for Awareness and Behavior the Same?

Risk Backdoors: What’s the Impact?

Investigating Further: What Else Did They Find?

Steering Vectors: Implementing Conditional Behavior

Shownotes Transcript

“Interim Research Report: Mechanisms of Awareness” by Josh Engels, Neel Nanda, Senthooran Rajamanoharan 17:15 Share

LessWrong (30+ Karma)

What Did the Researchers Discover?

Introduction to the Study

How Did They Reproduce LLM Risk Awareness on Gemma 3 12B?

Is It Just a Steering Vector?

Can the Steering Vector Be Trained Directly?

Are the Mechanisms for Awareness and Behavior the Same?

Risk Backdoors: What’s the Impact?

Investigating Further: What Else Did They Find?

Steering Vectors: Implementing Conditional Behavior

Shownotes Transcript

“Interim Research Report: Mechanisms of Awareness” by Josh Engels, Neel Nanda, Senthooran Rajamanoharan