LessWrong (30+ Karma)

“Notes on countermeasures for exploration hacking (aka sandbagging)” by ryan_greenblatt

2025/3/25

6 chapters

If we naively apply RL to a scheming AI, the AI may be able to systematically get low reward/perfor

“Analyzing long agent transcripts (Docent)” by jsteinhardt

2025/3/25

This is a brief overview of a recent release by Transluce. You can see the full write-up on the Tra

“An overview of control measures” by ryan_greenblatt

2025/3/25

11 chapters

We often talk about ensuring control, which in the context of this doc refers to preventing AIs fro

“Policy for LLM Writing on LessWrong” by jimrandomh

2025/3/24

4 chapters

LessWrong has been receiving an increasing number of posts and contents that look like they might b

“AI ‘Deep Research’ Tools Reviewed” by sarahconstantin

2025/3/24

8 chapters

Midjourney: “an artificially intelligent researcher, library, posthuman archivist, mapping the noosp

“Recent AI model progress feels mostly like bullshit” by lc

2025/3/24

3 chapters

About nine months ago, I and three friends decided that AI had gotten good enough to monitor large

“Will Jesus Christ return in an election year?” by Eric Neyman

2025/3/24

Thanks to Jesse Richardson for discussion. Polymarket asks: will Jesus Christ return in 2025? In t

“We need (a lot) more rogue agent honeypots” by Ozyrus

2025/3/24

4 chapters

Epistemic status: Uncertain in writing style, but reasonably confident in content. Want to come bac

“Selective modularity: a research agenda” by cloud, Jacob G-W

2025/3/24

13 chapters

Overview: By training neural networks with selective modularity, gradient routing enables new appro

“Solving willpower seems easier than solving aging” by Yair Halberstadt

2025/3/23

I'm awake about 17 hours a day. Of those I'm being productive maybe 10 hours a day. My working defi

“A long list of concrete projects and open problems in evals” by Marius Hobbhahn

2025/3/23

We made a long list of concrete projects and open problems in evals with 100+ suggestions! https://

“Reframing AI Safety as a Neverending Institutional Challenge” by scasper

2025/3/23

Crossposed from https://stephencasper.com/reframing-ai-safety-as-a-neverending-institutional-challe

“They Took MY Job?” by Zvi

2025/3/23

4 chapters

No, they didn’t. Not so fast, and not quite my job. But OpenAI is trying. Consider this a marker to

“Do models say what they learn?” by Andy Arditi, marvinli, Joe Benton, Miles Turpin

2025/3/22

7 chapters

This is a writeup of preliminary research studying whether models verbalize what they learn during

“Good Research Takes are Not Sufficient for Good Strategic Takes” by Neel Nanda

2025/3/22

3 chapters

TL;DR Having a good research track record is some evidence of good big-picture takes, but it's weak

“SHIFT relies on token-level features to de-bias Bias in Bios probes” by Tim Hua

2025/3/22

8 chapters

In Sparse Feature Circuits (Marks et al. 2024), the authors introduced Spurious Human-Interpretable

“Silly Time” by jefftk

2025/3/22

A few months ago I was trying to figure out how to make bedtime go better with Nora (3y). She wou

“How I force LLMs to generate correct code” by claudio

2025/3/21

5 chapters

In my daily work as software consultant I'm often dealing with large pre-existing code bases. I u

“Towards a scale-free theory of intelligent agency” by Richard_Ngo

2025/3/21

12 chapters

I recently left OpenAI to pursue independent research. I’m working on a number of different researc

“Intention to Treat” by Alicorn

2025/3/20

When my son was three, we enrolled him in a study of a vision condition that runs in my family. The

Episodes

“Notes on countermeasures for exploration hacking (aka sandbagging)” by ryan_greenblatt

“Analyzing long agent transcripts (Docent)” by jsteinhardt

“An overview of control measures” by ryan_greenblatt

“Policy for LLM Writing on LessWrong” by jimrandomh

“AI ‘Deep Research’ Tools Reviewed” by sarahconstantin

“Recent AI model progress feels mostly like bullshit” by lc

“Will Jesus Christ return in an election year?” by Eric Neyman

“We need (a lot) more rogue agent honeypots” by Ozyrus

“Selective modularity: a research agenda” by cloud, Jacob G-W

“Solving willpower seems easier than solving aging” by Yair Halberstadt

“A long list of concrete projects and open problems in evals” by Marius Hobbhahn

“Reframing AI Safety as a Neverending Institutional Challenge” by scasper

“They Took MY Job?” by Zvi

“Do models say what they learn?” by Andy Arditi, marvinli, Joe Benton, Miles Turpin

“Good Research Takes are Not Sufficient for Good Strategic Takes” by Neel Nanda

“SHIFT relies on token-level features to de-bias Bias in Bios probes” by Tim Hua

“Silly Time” by jefftk

“How I force LLMs to generate correct code” by claudio

“Towards a scale-free theory of intelligent agency” by Richard_Ngo

“Intention to Treat” by Alicorn