We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode [Linkpost] “Untitled Draft” by RobertM

[Linkpost] “Untitled Draft” by RobertM

2025/4/27
logo of podcast LessWrong (30+ Karma)

LessWrong (30+ Karma)

Shownotes Transcript

This is a link post. Dario Amodei posted a new essay titled "The Urgency of Interpretability" a couple days ago.

Some excerpts I think are worth highlighting:

The nature of AI training makes it possible that AI systems will develop, on their own, an ability to deceive humans and an inclination to seek power in a way that ordinary deterministic software never will; this emergent nature also makes it difficult to detect and mitigate such developments[1]. But by the same token, we’ve never seen any solid evidence in truly real-world scenarios of deception and power-seeking[2] because we can’t “catch the models red-handed” thinking power-hungry, deceitful thoughts.

One might be forgiven for forgetting about Bing Sydney as an obvious example of "power-seeking" AI behavior, given how long ago that was, but lying? Given the very recent releases of Sonnet 3.7 and OpenAI's o3, and their much-remarked-upon propensity for reward hacking and [...]

The original text contained 2 footnotes which were omitted from this narration.


First published: April 27th, 2025

Source: https://www.lesswrong.com/posts/SebmGh9HYdd8GZtHA/untitled-draft-v5wy)

Linkpost URL:https://www.darioamodei.com/post/the-urgency-of-interpretability)

  ---
    

Narrated by TYPE III AUDIO).