LessWrong (30+ Karma)

“To be legible, evidence of misalignment probably has to be behavioral” by ryan_greenblatt

2025/4/15

One key hope for mitigating risk from misalignment is inspecting the AI's behavior, noticing that i

“The Bell Curve of Bad Behavior” by Screwtape

2025/4/15

6 chapters

There's an implicit model I think many people have in their heads of how everyone else behaves. As

[Linkpost] “The 4-Minute Mile Effect” by Parker Conley

2025/4/15

5 chapters

This is a link post. Definition In 1954, Roger Bannister ran the first officially sanctioned sub-4-m

[Linkpost] “Sentinel’s Global Risks Weekly Roundup #15/2025: Tariff yoyo, OpenAI slashing safety testing, Iran nuclear programme negotiations, 1K H5N1 confirmed herd infections.” by NunoSempere

2025/4/15

2 chapters

This is a link post. Executive summary The Trump administration backtracked from his tariff plan, re

“Try training token-level probes” by StefanHex

2025/4/14

12 chapters

Epistemic status: These are results of a brief research sprint and I didn't have time to investigat

“Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study” by Adam Karvonen

2025/4/14

10 chapters

Dario Amodei, CEO of Anthropic, recently worried about a world where only 30% of jobs become automa

“Four Types of Disagreement” by silentbob

2025/4/14

6 chapters

Epistemic status: a model I find helpful to make sense of disagreements and, sometimes, resolve the

“One-shot steering vectors cause emergent misalignment, too” by Jacob Dunefsky

2025/4/14

TL;DR: If we optimize a steering vector to induce a language model to output a single piece of harm

“Vestigial reasoning in RL” by Caleb Biddulph

2025/4/14

6 chapters

TL;DR: I claim that many reasoning patterns that appear in chains-of-thought are not actually used

“How I switched careers from software engineer to AI policy operations” by Lucie Philippon

2025/4/13

4 chapters

Thanks to Linda Linsefors for encouraging me to write my story. Although it might not generalize to

“College Advice For People Like Me” by henryj

2025/4/13

43 chapters

I'm graduating from UChicago in around 60 days, and I've been thinking about what I've learned thes

“Steelmanning heuristic arguments” by Dmitry Vaintrob

2025/4/13

9 chapters

Introduction This is a nuanced “I was wrong” post. Something I really like about AI safety and EA/r

“Why does LW not put much more focus on AI governance and outreach?” by Severin T. Seehrich

2025/4/12

4 chapters

Epistemic status: Noticing confusion There is little discussion happening on LessWrong with regards

“How training-gamers might function (and win)” by Vivek Hebbar

2025/4/12

11 chapters

In this post I lay out a concrete vision of how reward-seekers and schemers might function. I descr

“Youth Lockout” by Xavi CF

2025/4/12

3 chapters

Cross-posted from Substack. AI job displacement will affect young people first, disrupting the usua

“Paper” by dynomight

2025/4/12

5 chapters

Paper is good. Somehow, a blank page and a pen makes the universe open up before you. Why paper has

“OpenAI Responses API changes models’ behavior” by Jan Betley, James Chua

2025/4/11

3 chapters

Summary OpenAI recently released the Responses API. Most models are available through both the new

“Why do misalignment risks increase as AIs get more capable?” by ryan_greenblatt

2025/4/11

It's generally agreed that as AIs get more capable, risks from misalignment increase. But there are

“On Google’s Safety Plan” by Zvi

2025/4/11

13 chapters

Google Lays Out Its Safety Plans I want to start off by reiterating kudos to Google for actually

“Forecasting time to automated superhuman coders [AI 2027 Timelines Forecast]” by elifland, Nikola Jurkovic

2025/4/11

12 chapters

Authors: Eli Lifland, Nikola Jurkovic[1], FutureSearch[2]This is supporting research for AI 2027. We

Episodes

“To be legible, evidence of misalignment probably has to be behavioral” by ryan_greenblatt

“The Bell Curve of Bad Behavior” by Screwtape

[Linkpost] “The 4-Minute Mile Effect” by Parker Conley

[Linkpost] “Sentinel’s Global Risks Weekly Roundup #15/2025: Tariff yoyo, OpenAI slashing safety testing, Iran nuclear programme negotiations, 1K H5N1 confirmed herd infections.” by NunoSempere

“Try training token-level probes” by StefanHex

“Frontier AI Models Still Fail at Basic Physical Tasks: A Manufacturing Case Study” by Adam Karvonen

“Four Types of Disagreement” by silentbob

“One-shot steering vectors cause emergent misalignment, too” by Jacob Dunefsky

“Vestigial reasoning in RL” by Caleb Biddulph

“How I switched careers from software engineer to AI policy operations” by Lucie Philippon

“College Advice For People Like Me” by henryj

“Steelmanning heuristic arguments” by Dmitry Vaintrob

“Why does LW not put much more focus on AI governance and outreach?” by Severin T. Seehrich

“How training-gamers might function (and win)” by Vivek Hebbar

“Youth Lockout” by Xavi CF

“Paper” by dynomight

“OpenAI Responses API changes models’ behavior” by Jan Betley, James Chua

“Why do misalignment risks increase as AIs get more capable?” by ryan_greenblatt

“On Google’s Safety Plan” by Zvi

“Forecasting time to automated superhuman coders [AI 2027 Timelines Forecast]” by elifland, Nikola Jurkovic