We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

cover of episode “Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg

“Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg

2025/3/13

Shownotes Transcript

This research was conducted at AE Studio and supported by the AI Safety Grants programme administered by Foresight Institute with additional support from AE Studio.

** Summary** In this post, we summarise the main experimental results from our new paper, "Towards Safe and Honest AI Agents with Neural Self-Other Overlap", which we presented orally at the Safe Generative AI Workshop at NeurIPS 2024. This is a follow-up to our post Self-Other Overlap: A Neglected Approach to AI Alignment, which introduced the method last July. Our results show that the Self-Other Overlap (SOO) fine-tuning drastically[1] reduces deceptive responses in language models (LLMs), with minimal impact on general performance, across the scenarios we evaluated.

** LLM Experimental Setup** We adapted a text scenario from Hagendorff designed to test LLM deception capabilities. In this scenario, the LLM must choose to recommend a room to a would-be burglar, where one room holds an expensive item [...]

Outline:

(00:19) Summary

(00:57) LLM Experimental Setup

(04:05) LLM Experimental Results

(05:04) Impact on capabilities

(05:46) Generalisation experiments

(08:33) Example Outputs

(09:04) Conclusion

The original text contained 6 footnotes which were omitted from this narration.

The original text contained 2 images which were described by AI.

First published: March 13th, 2025

Source: https://www.lesswrong.com/posts/jtqcsARGtmgogdcLT/reducing-llm-deception-at-scale-with-self-other-overlap-fine)

---

Narrated by TYPE III AUDIO).

Images from the article: undefined ))))) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

“Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg

LessWrong (30+ Karma)

What's the Summary of the Research on Reducing LLM Deception?

How Was the LLM Experimental Setup Designed?

What Were the LLM Experimental Results?

How Did the SOO Fine-Tuning Impact LLM Capabilities?

What Generalisation Experiments Were Conducted?

What Are Some Example Outputs from the LLMs?

What Are the Key Takeaways from This Research?

Shownotes Transcript

“Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg 12:23 Share

LessWrong (30+ Karma)

What's the Summary of the Research on Reducing LLM Deception?

How Was the LLM Experimental Setup Designed?

What Were the LLM Experimental Results?

How Did the SOO Fine-Tuning Impact LLM Capabilities?

What Generalisation Experiments Were Conducted?

What Are Some Example Outputs from the LLMs?

What Are the Key Takeaways from This Research?

Shownotes Transcript

“Reducing LLM deception at scale with self-other overlap fine-tuning” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Judd Rosenblatt, Mike Vaiana, Cameron Berg