LessWrong (Curated & Popular)

“Anomalous Tokens in DeepSeek-V3 and r1” by henry

2025/1/28

8 chapters

“Anomalous”, “glitch”, or “unspeakable” tokens in an LLM are those that induce bizarre behavior or o

“Tell me about yourself:LLMs are aware of their implicit behaviors” by Martín Soto, Owain_Evans

2025/1/28

4 chapters

This is the abstract and introduction of our new paper, with some discussion of implications for AI

“Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals” by johnswentworth, David Lorell

2025/1/27

5 chapters

The CakeImagine that I want to bake a chocolate cake, and my sole goal in my entire lightcone and e

“A Three-Layer Model of LLM Psychology” by Jan_Kulveit

2025/1/26

10 chapters

This post offers an accessible model of psychology of character-trained LLMs like Claude. Epistemic

“Training on Documents About Reward Hacking Induces Reward Hacking” by evhub

2025/1/24

This is a link post.This is a blog post reporting some preliminary work from the Anthropic Alignment

“AI companies are unlikely to make high-assurance safety cases if timelines are short” by ryan_greenblatt

2025/1/24

9 chapters

One hope for keeping existential risks low is to get AI companies to (successfully) make high-assura

“Mechanisms too simple for humans to design” by Malmesbury

2025/1/24

7 chapters

Cross-posted from Telescopic TurnipAs we all know, humans are terrible at building butterflies. We c

“The Gentle Romance” by Richard_Ngo

2025/1/22

This is a link post.A story I wrote about living through the transition to utopia.This is the one st

“Quotes from the Stargate press conference” by Nikola Jurkovic

2025/1/22

This is a link post.Present alongside President Trump: Sam AltmanLarry Ellison (Oracle executive ch

“The Case Against AI Control Research” by johnswentworth

2025/1/21

5 chapters

The AI Control Agenda, in its own words:… we argue that AI labs should ensure that powerful AIs are

“Don’t ignore bad vibes you get from people” by Kaj_Sotala

2025/1/20

I think a lot of people have heard so much about internalized prejudice and bias that they think the

“[Fiction] [Comic] Effective Altruism and Rationality meet at a Secular Solstice afterparty” by tandem

2025/1/19

(Both characters are fictional, loosely inspired by various traits from various real people. Be care

“Building AI Research Fleets” by bgold, Jesse Hoogland

2025/1/18

3 chapters

From AI scientist to AI research fleetResearch automation is here (1, 2, 3). We saw it coming and p

“What Is The Alignment Problem?” by johnswentworth

2025/1/17

14 chapters

So we want to align future AGIs. Ultimately we’d like to align them to human values, but in the shor

“Applying traditional economic thinking to AGI: a trilemma” by Steven Byrnes

2025/1/14

Traditional economics thinking has two strong principles, each based on abundant historical data: Pr

“Passages I Highlighted in The Letters of J.R.R.Tolkien” by Ivan Vendrov

2025/1/14

39 chapters

All quotes, unless otherwise marked, are Tolkien's words as printed in The Letters of J.R.R.Tol

“Parkinson’s Law and the Ideology of Statistics” by Benquo

2025/1/13

The anonymous review of The Anti-Politics Machine published on Astral Codex X focuses on a case stud

“Capital Ownership Will Not Prevent Human Disempowerment” by beren

2025/1/11

Crossposted from my personal blog. I was inspired to cross-post this here given the discussion that

“Activation space interpretability may be doomed” by bilalchughtai, Lucius Bushnaq

2025/1/10

4 chapters

TL;DR: There may be a fundamental problem with interpretability work that attempts to understand neu

“What o3 Becomes by 2028” by Vladimir_Nesov

2025/1/9

4 chapters

Funding for $150bn training systems just turned less speculative, with OpenAI o3 reaching 25% on Fro

Episodes

“Anomalous Tokens in DeepSeek-V3 and r1” by henry

“Tell me about yourself:LLMs are aware of their implicit behaviors” by Martín Soto, Owain_Evans

“Instrumental Goals Are A Different And Friendlier Kind Of Thing Than Terminal Goals” by johnswentworth, David Lorell

“A Three-Layer Model of LLM Psychology” by Jan_Kulveit

“Training on Documents About Reward Hacking Induces Reward Hacking” by evhub

“AI companies are unlikely to make high-assurance safety cases if timelines are short” by ryan_greenblatt

“Mechanisms too simple for humans to design” by Malmesbury

“The Gentle Romance” by Richard_Ngo

“Quotes from the Stargate press conference” by Nikola Jurkovic

“The Case Against AI Control Research” by johnswentworth

“Don’t ignore bad vibes you get from people” by Kaj_Sotala

“[Fiction] [Comic] Effective Altruism and Rationality meet at a Secular Solstice afterparty” by tandem

“Building AI Research Fleets” by bgold, Jesse Hoogland

“What Is The Alignment Problem?” by johnswentworth

“Applying traditional economic thinking to AGI: a trilemma” by Steven Byrnes

“Passages I Highlighted in The Letters of J.R.R.Tolkien” by Ivan Vendrov

“Parkinson’s Law and the Ideology of Statistics” by Benquo

“Capital Ownership Will Not Prevent Human Disempowerment” by beren

“Activation space interpretability may be doomed” by bilalchughtai, Lucius Bushnaq

“What o3 Becomes by 2028” by Vladimir_Nesov