We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

cover of episode [Linkpost] “Modifying LLM Beliefs with Synthetic Document Finetuning” by RowanWang, Johannes Treutlein, Ethan Perez, Fabien Roger, Sam Marks

[Linkpost] “Modifying LLM Beliefs with Synthetic Document Finetuning” by RowanWang, Johannes Treutlein, Ethan Perez, Fabien Roger, Sam Marks

2025/4/24

Shownotes Transcript

This is a link post. In this post, we study whether we can modify an LLM's beliefs and investigate whether doing so could decrease risk from advanced AI systems.

We describe a pipeline for modifying LLM beliefs via synthetic document finetuning and introduce a suite of evaluations that suggest our pipeline succeeds in inserting all but the most implausible beliefs. We also demonstrate proof-of-concept applications to honeypotting for detecting model misalignment and unlearning.

Introduction:

Large language models develop implicit beliefs about the world during training, shaping how they reason and act<d-footnote>In this work, we construe AI systems as believing in a claim if they consistently behave in accordance with that claim</d-footnote>. In this work, we study whether we can systematically modify these beliefs, creating a powerful new affordance for safer AI deployment.

Controlling the beliefs of AI systems can decrease risk in a variety of ways. First, model organisms [...]

First published: April 24th, 2025

Source: https://www.lesswrong.com/posts/ARQs7KYY9vJHeYsGc/untitled-draft-2qxt)

Linkpost URL:https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/)

---

Narrated by TYPE III AUDIO).

Images from the article: ) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

[Linkpost] “Modifying LLM Beliefs with Synthetic Document Finetuning” by RowanWang, Johannes Treutlein, Ethan Perez, Fabien Roger, Sam Marks 05:18 Share

LessWrong (30+ Karma)

Shownotes Transcript

[Linkpost] “Modifying LLM Beliefs with Synthetic Document Finetuning” by RowanWang, Johannes Treutlein, Ethan Perez, Fabien Roger, Sam Marks