We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

“Dodging systematic human errors in scalable oversight” by Benjamin Hilton, Geoffrey Irving

2025/5/14

Shownotes Transcript

Audio note: this article contains 59 uses of latex notation, so the narration may be difficult to follow. There's a link to the original text in the episode description.

Summary: Both our (UK AISI's) debate safety case sketch and Anthropic's research agenda point at systematic human error as a weak point for debate. This post talks through how one might strengthen a debate protocol to partially mitigate this.

** Not too many errors in unknown places**

The complexity theory models of debate assume some expensive verifier machine M with access to a human oracle, such that

If we ran M in full, we’d get a safe answer
M is too expensive to run in full, meaning we need some interactive proof protocol (something like debate) to skip steps

Typically, M is some recursive tree computation, where for simplicity we can think of human oracle queries as occurring at the leaves [...]

Outline:

(00:39) Not too many errors in unknown places

(04:01) A protocol that handles an \varepsilon-fraction of errors

(05:26) What distribution do we measure errors against?

(06:43) Cross-examination-like protocols

(08:27) Collaborate with us

First published: May 14th, 2025

Source: https://www.lesswrong.com/posts/EgRJtwQurNzz8CEfJ/dodging-systematic-human-errors-in-scalable-oversight)

---

Narrated by TYPE III AUDIO).

“Dodging systematic human errors in scalable oversight” by Benjamin Hilton, Geoffrey Irving 09:02 Share

LessWrong (30+ Karma)

Shownotes Transcript

“Dodging systematic human errors in scalable oversight” by Benjamin Hilton, Geoffrey Irving