This is a link post. In this post, we study whether we can modify an LLM's beliefs and investigate whether doing so could decrease risk from advanced AI systems.
We describe a pipeline for modifying LLM beliefs via synthetic document finetuning and introduce a suite of evaluations that suggest our pipeline succeeds in inserting all but the most implausible beliefs. We also demonstrate proof-of-concept applications to honeypotting for detecting model misalignment and unlearning.
Introduction:
Large language models develop implicit beliefs about the world during training, shaping how they reason and act<d-footnote>In this work, we construe AI systems as believing in a claim if they consistently behave in accordance with that claim</d-footnote>. In this work, we study whether we can systematically modify these beliefs, creating a powerful new affordance for safer AI deployment.
Controlling the beliefs of AI systems can decrease risk in a variety of ways. First, model organisms [...]
First published: April 24th, 2025
Source: https://www.lesswrong.com/posts/ARQs7KYY9vJHeYsGc/untitled-draft-2qxt)
Linkpost URL:https://alignment.anthropic.com/2025/modifying-beliefs-via-sdf/)
---
Narrated by TYPE III AUDIO).
Images from the article:
)
Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.