The Most Forbidden Technique is training an AI using interpretability techniques.An AI produces a final output [X] via some method [M]. You can analyze [M] using technique [T], to learn what the AI is up to. You could train on that. Never do that.You train on [X]. Only [X]. Never [M], never [T].Why? Because [T] is how you figure out when the model is misbehaving.If you train on [T], you are training the AI to obfuscate its thinking, and defeat [T]. You will rapidly lose your ability to know what is going on, in exactly the ways you most need to know what is going on.Those bits of optimization pressure from [T] are precious. Use them wisely.** Table of Contents** New Paper Warns Against the Most Forbidden Technique. Reward Hacking Is The Default. Using [...]
---Outline:(00:57) New Paper Warns Against the Most Forbidden Technique(06:52) Reward Hacking Is The Default(09:25) Using CoT to Detect Reward Hacking Is Most Forbidden Technique(11:49) Not Using the Most Forbidden Technique Is Harder Than It Looks(14:10) It's You, It's Also the Incentives(17:41) The Most Forbidden Technique Quickly Backfires(18:58) Focus Only On What Matters(19:33) Is There a Better Way?(21:34) What Might We Do Next?The original text contained 6 images which were described by AI. --- First published: March 12th, 2025 Source: https://www.lesswrong.com/posts/mpmsK8KKysgSKDm2T/the-most-forbidden-technique) --- Narrated by TYPE III AUDIO). ---Images from the article:)))))