LessWrong (30+ Karma)

“Open problems in emergent misalignment” by Jan Betley, Daniel Tan

2025/3/1

11 chapters

We've recently published a paper about Emergent Misalignment – a surprising phenomenon where trainin

“On Emergent Misalignment” by Zvi

2025/2/28

12 chapters

One hell of a paper dropped this week. It turns out that if you fine-tune models, especially GPT-4o

“OpenAI releases ChatGPT 4.5” by Seth Herd

2025/2/28

3 chapters

This is a link post.This is not o3; it is what they'd internally called Orion, a larger non-reasonin

“How to Corner Liars: A Miasma-Clearing Protocol” by ymeskhout

2025/2/28

4 chapters

A framework for quashing deflection and plausibility miragesThe truth is people lie. Lying isn’t ju

“Weirdness Points” by lsusr

2025/2/28

Vegans are often disliked. That's what I read online and I believe there is an element of truth to i

“Why Can’t We Hypothesize After the Fact?” by David Udell

2025/2/27

When you have put a lot of ideas together to make an elaborate theory, you want to make sure, when e

“Fuzzing LLMs sometimes makes them reveal their secrets” by Fabien Roger

2025/2/26

11 chapters

Scheming AIs may have secrets that are salient to them, such as: What their misaligned goal is;What

“[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations” by Lucy Farnik

2025/2/26

8 chapters

We just published a paper aimed at discovering “computational sparsity”, rather than just sparsity i

“Osaka” by lsusr

2025/2/26

The more I learn about urban planning, the more I learn that the above average American city I live

“Time to Welcome Claude 3.7” by Zvi

2025/2/26

12 chapters

Anthropic has reemerged from stealth and offers us Claude 3.7. Given this is named Claude 3.7, an ex

“You can just wear a suit” by lsusr

2025/2/26

I like stories where characters wear suits.Since I like suits so much, I realized that I should just

“what an efficient market feels from inside” by DMMF

2025/2/26

11 chapters

“I often think of the time I met Scott Sumner and he said he pretty much assumes the market is effic

“Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs” by Jan Betley, Owain_Evans

2025/2/25

2 chapters

This is the abstract and introduction of our new paper. We show that finetuning state-of-the-art LLM

“Grok Grok” by Zvi

2025/2/25

13 chapters

This is a post in two parts. The first half is the post is about Grok's capabilities, now that we’ve

“Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?” by Jesse Richardson, Yoshua Bengio, dwk, mattmacdermott

2025/2/25

6 chapters

A new paper by Yoshua Bengio and the Safe Artificial Intelligence For Humanity (SAIFH) team argues t

“Dream, Truth, & Good” by abramdemski

2025/2/25

4 chapters

One way in which I think current AI models are sloppy is that LLMs are trained in a way that messily

“Conference Report: Threshold 2030 - Modeling AI Economic Futures” by Deric Cheng, Justin Bullock, Deger Turan, Elliot Mckernon

2025/2/25

9 chapters

This is an 8-page comprehensive summary of the results from Threshold 2030: a recent expert conferen

“Training AI to do alignment research we don’t already know how to do” by joshc

2025/2/24

4 chapters

This post heavily overlaps with “how might we safely pass the buck to AI?” but is written to address

“Anthropic releases Claude 3.7 Sonnet with extended thinking mode” by LawrenceC

2025/2/24

2 chapters

This is a link post.About 1.5 hours ago, Anthropic released Claude 3.7 Sonnet, a hybrid reasoning mo

“Forecasting Frontier Language Model Agent Capabilities” by Govind Pimpale, Axel Højmark, Jérémy Scheurer, Marius Hobbhahn

2025/2/24

8 chapters

This work was done as part of the MATS Program - Summer 2024 Cohort.Paper: link Website (with intera

Episodes

“Open problems in emergent misalignment” by Jan Betley, Daniel Tan

“On Emergent Misalignment” by Zvi

“OpenAI releases ChatGPT 4.5” by Seth Herd

“How to Corner Liars: A Miasma-Clearing Protocol” by ymeskhout

“Weirdness Points” by lsusr

“Why Can’t We Hypothesize After the Fact?” by David Udell

“Fuzzing LLMs sometimes makes them reveal their secrets” by Fabien Roger

“[PAPER] Jacobian Sparse Autoencoders: Sparsify Computations, Not Just Activations” by Lucy Farnik

“Osaka” by lsusr

“Time to Welcome Claude 3.7” by Zvi

“You can just wear a suit” by lsusr

“what an efficient market feels from inside” by DMMF

“Emergent Misalignment: Narrow finetuning can produce broadly misaligned LLMs” by Jan Betley, Owain_Evans

“Grok Grok” by Zvi

“Superintelligent Agents Pose Catastrophic Risks: Can Scientist AI Offer a Safer Path?” by Jesse Richardson, Yoshua Bengio, dwk, mattmacdermott

“Dream, Truth, & Good” by abramdemski

“Conference Report: Threshold 2030 - Modeling AI Economic Futures” by Deric Cheng, Justin Bullock, Deger Turan, Elliot Mckernon

“Training AI to do alignment research we don’t already know how to do” by joshc

“Anthropic releases Claude 3.7 Sonnet with extended thinking mode” by LawrenceC

“Forecasting Frontier Language Model Agent Capabilities” by Govind Pimpale, Axel Højmark, Jérémy Scheurer, Marius Hobbhahn