LessWrong (30+ Karma)

[Linkpost] “MIRI Newsletter #123” by Harlan, Rob Bensinger

2025/7/4

2 chapters

This is a link post. If Anyone Builds It, Everyone Dies As we announced last month, Eliezer and Nat

[Linkpost] “Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals” by Marius Hobbhahn

2025/7/3

This is a link post. Note: This is a research note, and the analysis is less rigorous than our stand

“Call for suggestions - AI safety course” by boazbarak

2025/7/3

In the fall I am planning to teach an AI safety graduate course at Harvard. The format is likely to

[Linkpost] “IABIED: Advertisement design competition” by yams

2025/7/3

This is a link post. We’re currently in the process of locking in advertisements for the September l

“Congress Asks Better Questions” by Zvi

2025/7/3

11 chapters

Back in May I did a dramatization of a key and highly painful Senate hearing. Now, we are back for

“Curing PMS with Hair Loss Pills” by David Lorell

2025/7/2

8 chapters

Over the last two years or so, my girlfriend identified her cycle as having a unusually strong and

“AI Task Length Horizons in Offensive Cybersecurity” by Sean Peters

2025/7/2

7 chapters

This is a rough research note where the primary objective was my own learning. I am sharing it beca

“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks

2025/7/2

4 chapters

Summary: We found that LLMs exhibit significant race and gender bias in realistic hiring scenarios,

“There are two fundamentally different constraints on schemers” by Buck

2025/7/2

3 chapters

  People (including me) often say that scheming models “have to act as if they were aligned”.

“‘What’s my goal?’” by Raemon

2025/7/2

4 chapters

The first in a series of bite-sized rationality prompts[1]. This is my most common opening-move f

“A Simple Explanation of AGI Risk” by TurnTrout

2025/7/2

6 chapters

Notes from a talk originally given at my alma mater I went to Grinnell College for my undergraduate

“AI Moratorium Stripped From BBB” by Zvi

2025/7/1

2 chapters

The insane attempted AI moratorium has been stripped from the BBB. That doesn’t mean they won’t try

“Authors Have a Responsibility to Communicate Clearly” by TurnTrout

2025/7/1

7 chapters

When a claim is shown to be incorrect, defenders may say that the author was just being “sloppy” and

“Scientific Discovery in the Age of Artificial Intelligence” by Jessica Rumbelow

2025/7/1

9 chapters

For many people, including me, the real promise of AI is massively accelerated scientific discovery

“SLT for AI Safety” by Jesse Hoogland

2025/7/1

This sequence draws from a position paper co-written with Simon Pepin Lehalleur, Jesse Hoogland, Ma

“The best simple argument for Pausing AI?” by Gary Marcus

2025/7/1

Not saying we should pause AI, but consider the following argument: Alignment without the capacity

“SAE on activation differences” by Santiago Aranguri, jacob_drori, Neel Nanda

2025/7/1

7 chapters

TLDR: we find that SAEs trained on the difference in activations between a base model and its instr

“What We Learned Trying to Diff Base and Chat Models (And Why It Matters)” by Clément Dumas, Julian Minder, Neel Nanda

2025/6/30

9 chapters

This post presents some motivation on why we work on model diffing, some of our first results using

“If you want to be vegan but you worry about health effects of no meat, consider being vegan except for mussels/oysters” by KatWoods

2025/6/30

1) They're unlikely to be sentient (few neurons, immobile) 2) If they are sentient, the farming pra

[Linkpost] “Project Vend: Can Claude run a small shop?” by Gunnar_Zarncke

2025/6/30

This is a link post. Anthropic (post June 27th): We let Claude [Sonnet 3.7] manage an automated stor

Episodes

[Linkpost] “MIRI Newsletter #123” by Harlan, Rob Bensinger

[Linkpost] “Research Note: Our scheming precursor evals had limited predictive power for our in-context scheming evals” by Marius Hobbhahn

“Call for suggestions - AI safety course” by boazbarak

[Linkpost] “IABIED: Advertisement design competition” by yams

“Congress Asks Better Questions” by Zvi

“Curing PMS with Hair Loss Pills” by David Lorell

“AI Task Length Horizons in Offensive Cybersecurity” by Sean Peters

“Race and Gender Bias As An Example of Unfaithful Chain of Thought in the Wild” by Adam Karvonen, Sam Marks

“There are two fundamentally different constraints on schemers” by Buck

“‘What’s my goal?’” by Raemon

“A Simple Explanation of AGI Risk” by TurnTrout

“AI Moratorium Stripped From BBB” by Zvi

“Authors Have a Responsibility to Communicate Clearly” by TurnTrout

“Scientific Discovery in the Age of Artificial Intelligence” by Jessica Rumbelow

“SLT for AI Safety” by Jesse Hoogland

“The best simple argument for Pausing AI?” by Gary Marcus

“SAE on activation differences” by Santiago Aranguri, jacob_drori, Neel Nanda

“What We Learned Trying to Diff Base and Chat Models (And Why It Matters)” by Clément Dumas, Julian Minder, Neel Nanda

“If you want to be vegan but you worry about health effects of no meat, consider being vegan except for mussels/oysters” by KatWoods

[Linkpost] “Project Vend: Can Claude run a small shop?” by Gunnar_Zarncke