We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“A sketch of an AI control safety case” by Tomek Korbak, joshc, Benjamin Hilton, Buck, Geoffrey Irving

2025/1/30

Shownotes Transcript

This is a linkpost accompanying a new paper by UK AISI and Redwood Research. Please see the full paper for more details.

An illustration of the control evaluation that provides the core evidence used in our control safety case sketch. During the control evaluation, a red team constructs adversarial substitute models to cause unacceptable outcomes in a deployment proxy consisting of 2,000 AI research tasks, followed by extrapolation to a larger deployment setting. Abstract. As LLM agents gain a greater capacity to cause harm, AI developers might increasingly rely on control measures such as monitoring to justify that they are safe. We sketch how developers could construct a “control safety case,” which is a structured argument that models are incapable of subverting control measures in order to cause unacceptable outcomes. As a case study, we sketch an argument that a hypothetical LLM agent deployed internally at an AI [...]

The original text contained 2 images which were described by AI.

First published: January 30th, 2025

Source: https://www.lesswrong.com/posts/vWYzSorAEWwoJnnXq/a-sketch-of-an-ai-control-safety-case)

---

Narrated by TYPE III AUDIO).

Images from the article: undefined )) Apple Podcasts and Spotify do not show images in the episode description. Try Pocket Casts), or another podcast app.

“A sketch of an AI control safety case” by Tomek Korbak, joshc, Benjamin Hilton, Buck, Geoffrey Irving 11:49 Share

LessWrong (30+ Karma)

Shownotes Transcript

“A sketch of an AI control safety case” by Tomek Korbak, joshc, Benjamin Hilton, Buck, Geoffrey Irving