We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode “Tell me about yourself:LLMs are aware of their implicit behaviors” by Martín Soto, Owain_Evans

“Tell me about yourself:LLMs are aware of their implicit behaviors” by Martín Soto, Owain_Evans

2025/1/28
logo of podcast LessWrong (Curated & Popular)

LessWrong (Curated & Popular)

AI Chapters
Chapters

Shownotes Transcript

This is the abstract and introduction of our new paper, with some discussion of implications for AI Safety at the end. Authors: Jan Betley*, Xuchan Bao*, Martín Soto*, Anna Sztyber-Betley, James Chua, Owain Evans (Equal Contribution).* Abstract**We study behavioral self-awareness — an LLM's ability to articulate its behaviors without requiring in-context examples. We finetune LLMs on datasets that exhibit particular behaviors, such as (a) making high-risk economic decisions, and (b) outputting insecure code. Despite the datasets containing no explicit descriptions of the associated behavior, the finetuned LLMs can explicitly describe it. For example, a model trained to output insecure code says, "The code I write is insecure.'' Indeed, models show behavioral self-awareness for a range of behaviors and for diverse evaluations. Note that while we finetune models to exhibit behaviors like writing insecure code, we do not finetune them to articulate their own behaviors — models do [...] ---Outline:(00:39) Abstract(02:18) Introduction(11:41) Discussion(11:44) AI safety(12:42) Limitations and future workThe original text contained 3 images which were described by AI. --- First published: January 22nd, 2025 Source: https://www.lesswrong.com/posts/xrv2fNJtqabN3h6Aj/tell-me-about-yourself-llms-are-aware-of-their-implicit) --- Narrated by TYPE III AUDIO). ---Images from the article:)))))