LessWrong (30+ Karma)

“The Pando Problem: Rethinking AI Individuality” by Jan_Kulveit

2025/3/28

9 chapters

Epistemic status: This post aims at an ambitious target: improving intuitive understanding directly

“Gemini 2.5 is the New SoTA” by Zvi

2025/3/28

5 chapters

Gemini 2.5 Pro Experimental is America's next top large language model. That doesn’t mean it is th

“AI #109: Google Fails Marketing Forever” by Zvi

2025/3/28

12 chapters

What if they released the new best LLM, and almost no one noticed? Google seems to have pulled tha

“Explaining British Naval Dominance During the Age of Sail” by Arjun Panickssery

2025/3/28

The other day I discussed how high monitoring costs can explain the emergence of “aristocratic” sys

“Tracing the Thoughts of a Large Language Model” by Adam Jermyn

2025/3/27

7 chapters

[This is our blog post on the papers, which can be found at https://transformer-circuits.pub/2025/a

“Third-wave AI safety needs sociopolitical thinking” by Richard_Ngo

2025/3/27

2 chapters

At EA Global Boston last year I gave a talk on how we're in the third wave of EA/AI safety, and how

“Mistral Large 2 (123B) exhibits alignment faking” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana, AE Studio

2025/3/27

Summary We wanted to briefly share an early takeaway from our exploration into alignment faking: th

[Linkpost] “Center on Long-Term Risk: Summer Research Fellowship 2025 - Apply Now” by Tristan Cook

2025/3/27

This is a link post. Summary: CLR is hiring for our Summer Research Fellowship. Join us for eight we

“Avoid the Counterargument Collapse” by marknm

2025/3/27

Selective listening is a real problem. It's really hard to listen to someone when you think you alr

“Automated Researchers Can Subtly Sandbag” by gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez, Fabien Roger

2025/3/27

2 chapters

Twitter thread here. tl;dr When prompted, current models can sandbag ML experiments and research de

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

2025/3/26

16 chapters

Audio note: this article contains 31 uses of latex notation, so the narration may be difficult to

“Conceptual Rounding Errors” by Jan_Kulveit

2025/3/26

6 chapters

Epistemic status: Reasonably confident in the basic mechanism. Have you noticed that you keep encou

“Eukaryote Skips Town - Why I’m leaving DC” by eukaryote

2025/3/26

I’ve spent the past 7 years living in the DC area. I moved out there from the Pacific Northwest to

“Goodhart Typology via Structure, Function, and Randomness Distributions” by JustinShovelain, Mateusz Bagiński

2025/3/26

12 chapters

Audio note: this article contains 127 uses of latex notation, so the narration may be difficult to

[Linkpost] “Latest map of all 40 copyright suits v. AI in U.S.” by Remmelt

2025/3/26

This is a link post. Download the latest PDF with links to court dockets here. --- First

“An overview of areas of control work” by ryan_greenblatt

2025/3/26

8 chapters

In this post, I'll list all the areas of control research (and implementation) that seem promising

“Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?” by Alex Mallen, charlie_griffin, Buck Shlegeris

2025/3/26

8 chapters

We recently released Subversion Strategy Eval: Can language models statelessly strategize to subver

“More on Various AI Action Plans” by Zvi

2025/3/25

6 chapters

Last week I covered Anthropic's relatively strong submission, and OpenAI's toxic submission. This w

“On (Not) Feeling the AGI” by Zvi

2025/3/25

7 chapters

Ben Thompson interviewed Sam Altman recently about building a consumer tech company, and about the h

“23andMe potentially for sale for $23M” by lemonhope

2025/3/25

It seems the company has gone bankrupt and wants to be bought and you can probably get their data i

Episodes

“The Pando Problem: Rethinking AI Individuality” by Jan_Kulveit

“Gemini 2.5 is the New SoTA” by Zvi

“AI #109: Google Fails Marketing Forever” by Zvi

“Explaining British Naval Dominance During the Age of Sail” by Arjun Panickssery

“Tracing the Thoughts of a Large Language Model” by Adam Jermyn

“Third-wave AI safety needs sociopolitical thinking” by Richard_Ngo

“Mistral Large 2 (123B) exhibits alignment faking” by Marc Carauleanu, Diogo de Lucena, Gunnar_Zarncke, Cameron Berg, Judd Rosenblatt, Mike Vaiana, AE Studio

[Linkpost] “Center on Long-Term Risk: Summer Research Fellowship 2025 - Apply Now” by Tristan Cook

“Avoid the Counterargument Collapse” by marknm

“Automated Researchers Can Subtly Sandbag” by gasteigerjo, Akbir Khan, Sam Bowman, Vlad Mikulik, Ethan Perez, Fabien Roger

“Negative Results for SAEs On Downstream Tasks and Deprioritising SAE Research (GDM Mech Interp Team Progress Update #2)” by Neel Nanda, lewis smith, Senthooran Rajamanoharan, Arthur Conmy, Callum McDougall, Tom Lieberum, János Kramár, Rohin Shah

“Conceptual Rounding Errors” by Jan_Kulveit

“Eukaryote Skips Town - Why I’m leaving DC” by eukaryote

“Goodhart Typology via Structure, Function, and Randomness Distributions” by JustinShovelain, Mateusz Bagiński

[Linkpost] “Latest map of all 40 copyright suits v. AI in U.S.” by Remmelt

“An overview of areas of control work” by ryan_greenblatt

“Subversion Strategy Eval: Can language models statelessly strategize to subvert control protocols?” by Alex Mallen, charlie_griffin, Buck Shlegeris

“More on Various AI Action Plans” by Zvi

“On (Not) Feeling the AGI” by Zvi

“23andMe potentially for sale for $23M” by lemonhope