Audio narrations of LessWrong posts.
If we naively apply RL to a scheming AI, the AI may be able to systematically get low reward/perfor
This is a brief overview of a recent release by Transluce. You can see the full write-up on the Tra
We often talk about ensuring control, which in the context of this doc refers to preventing AIs fro
LessWrong has been receiving an increasing number of posts and contents that look like they might b
Midjourney: “an artificially intelligent researcher, library, posthuman archivist, mapping the noosp
About nine months ago, I and three friends decided that AI had gotten good enough to monitor large
Thanks to Jesse Richardson for discussion. Polymarket asks: will Jesus Christ return in 2025? In t
Epistemic status: Uncertain in writing style, but reasonably confident in content. Want to come bac
Overview: By training neural networks with selective modularity, gradient routing enables new appro
I'm awake about 17 hours a day. Of those I'm being productive maybe 10 hours a day. My working defi
We made a long list of concrete projects and open problems in evals with 100+ suggestions! https://
Crossposed from https://stephencasper.com/reframing-ai-safety-as-a-neverending-institutional-challe
No, they didn’t. Not so fast, and not quite my job. But OpenAI is trying. Consider this a marker to
This is a writeup of preliminary research studying whether models verbalize what they learn during
TL;DR Having a good research track record is some evidence of good big-picture takes, but it's weak
In Sparse Feature Circuits (Marks et al. 2024), the authors introduced Spurious Human-Interpretable
A few months ago I was trying to figure out how to make bedtime go better with Nora (3y). She wou
In my daily work as software consultant I'm often dealing with large pre-existing code bases. I u
I recently left OpenAI to pursue independent research. I’m working on a number of different researc
When my son was three, we enrolled him in a study of a vision condition that runs in my family. The