We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Nicholas Carlini (Google DeepMind)

Nicholas Carlini (Google DeepMind)

2025/1/25
logo of podcast Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

AI Deep Dive AI Chapters Transcript
People
N
Nicholas Carlini
主持人
专注于电动车和能源领域的播客主持人和内容创作者。
Topics
Nicholas Carlini: 我是谷歌DeepMind的研究科学家,致力于研究模型的弱点并分析其安全隐患。大型语言模型能够进行有效的棋步预测,这表明其内部存在某种对现实世界准确建模的机制,尽管我不倾向于赋予其主观能动性。关于“推理”的定义因人而异,取决于个体对模型智能和能力的认知。模型的输出是否正确比其内部运作机制更重要。在可预见的未来,模型仍将容易受到简单攻击,我们需要找到一种方法来构建系统,即使模型不可靠,系统也能保持安全。我们需要设计围绕模型运行的系统,即使模型出现随机错误分类,也能避免系统执行错误操作。大型语言模型的发展时间较短,预测其未来发展方向非常困难。后训练会提高模型的性能,但也会影响其校准性。后训练对模型的影响机制尚不清楚。大型语言模型有时会表现出看似有效但实际上是灾难性的错误模式,这使得“推理”的定义变得模糊。与其纠结于模型是否进行“推理”,不如关注其输入输出行为是否能够解决问题。如果模型能够始终给出正确的答案,我不太关心其内部运作机制。模型的泛化能力有限,需要在其训练数据中包含目标测试数据才能获得更好的表现。我更关注基于观察到的现实情况进行研究,而不是预测遥远的未来。我的研究主要基于对当前现实的观察,并根据观察结果调整研究方向。攻击比防御更容易,因为攻击者只需要找到一个漏洞,而防御者需要修复所有漏洞。机器学习中的攻击比传统软件安全中的攻击更难防御,因为机器学习领域不断出现新的攻击类型。机器学习领域的安全披露规范有待完善,需要借鉴传统软件安全领域的经验,同时制定新的规范。我没有因为伦理原因而放弃研究任何漏洞。我倾向于选择那些具有潜在益处的漏洞进行研究。我擅长并喜欢系统攻击研究,并且这项工作能够带来积极的成果。传统安全研究和机器学习安全研究在研究方法和严谨性方面存在差异。我目前主要从事机器学习安全研究,而不是传统安全研究。 主持人: 我们能否在未来达到一种状态,即系统不安全,但我们学会与之共存?大型语言模型无需被明确告知游戏规则就能掌握国际象棋,这改变了我们对模型能力的认知。人类在区分好坏方面能力有限,这会影响大型语言模型的后训练效果。关于大型语言模型能力提升的方法,存在两种观点:一种是增加数据和计算量,另一种是采用完全不同的方法。是否可以通过在国际象棋棋谱中添加ELO等级来提高大型语言模型的棋力?你为何进行系统攻击研究?机器学习领域的安全披露规范尚不明确,需要根据具体情况进行判断。

Deep Dive

Shownotes Transcript

Nicholas Carlini from Google DeepMind offers his view of AI security, emergent LLM capabilities, and his groundbreaking model-stealing research. He reveals how LLMs can unexpectedly excel at tasks like chess and discusses the security pitfalls of LLM-generated code.

SPONSOR MESSAGES:


CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments.

https://centml.ai/pricing/

Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. Are you interested in working on reasoning, or getting involved in their events?

Goto https://tufalabs.ai/


Transcript: https://www.dropbox.com/scl/fi/lat7sfyd4k3g5k9crjpbf/CARLINI.pdf?rlkey=b7kcqbvau17uw6rksbr8ccd8v&dl=0

TOC:

  1. ML Security Fundamentals

[00:00:00] 1.1 ML Model Reasoning and Security Fundamentals

[00:03:04] 1.2 ML Security Vulnerabilities and System Design

[00:08:22] 1.3 LLM Chess Capabilities and Emergent Behavior

[00:13:20] 1.4 Model Training, RLHF, and Calibration Effects

  1. Model Evaluation and Research Methods

[00:19:40] 2.1 Model Reasoning and Evaluation Metrics

[00:24:37] 2.2 Security Research Philosophy and Methodology

[00:27:50] 2.3 Security Disclosure Norms and Community Differences

  1. LLM Applications and Best Practices

[00:44:29] 3.1 Practical LLM Applications and Productivity Gains

[00:49:51] 3.2 Effective LLM Usage and Prompting Strategies

[00:53:03] 3.3 Security Vulnerabilities in LLM-Generated Code

  1. Advanced LLM Research and Architecture

[00:59:13] 4.1 LLM Code Generation Performance and O(1) Labs Experience

[01:03:31] 4.2 Adaptation Patterns and Benchmarking Challenges

[01:10:10] 4.3 Model Stealing Research and Production LLM Architecture Extraction

REFS:

[00:01:15] Nicholas Carlini’s personal website & research profile (Google DeepMind, ML security) - https://nicholas.carlini.com/

[00:01:50] CentML AI compute platform for language model workloads - https://centml.ai/

[00:04:30] Seminal paper on neural network robustness against adversarial examples (Carlini & Wagner, 2016) - https://arxiv.org/abs/1608.04644

[00:05:20] Computer Fraud and Abuse Act (CFAA) – primary U.S. federal law on computer hacking liability - https://www.justice.gov/jm/jm-9-48000-computer-fraud

[00:08:30] Blog post: Emergent chess capabilities in GPT-3.5-turbo-instruct (Nicholas Carlini, Sept 2023) - https://nicholas.carlini.com/writing/2023/chess-llm.html

[00:16:10] Paper: “Self-Play Preference Optimization for Language Model Alignment” (Yue Wu et al., 2024) - https://arxiv.org/abs/2405.00675

[00:18:00] GPT-4 Technical Report: development, capabilities, and calibration analysis - https://arxiv.org/abs/2303.08774

[00:22:40] Historical shift from descriptive to algebraic chess notation (FIDE) - https://en.wikipedia.org/wiki/Descriptive_notation

[00:23:55] Analysis of distribution shift in ML (Hendrycks et al.) - https://arxiv.org/abs/2006.16241

[00:27:40] Nicholas Carlini’s essay “Why I Attack” (June 2024) – motivations for security research - https://nicholas.carlini.com/writing/2024/why-i-attack.html

[00:34:05] Google Project Zero’s 90-day vulnerability disclosure policy - https://googleprojectzero.blogspot.com/p/vulnerability-disclosure-policy.html

[00:51:15] Evolution of Google search syntax & user behavior (Daniel M. Russell) - https://www.amazon.com/Joy-Search-Google-Master-Information/dp/0262042878

[01:04:05] Rust’s ownership & borrowing system for memory safety - https://doc.rust-lang.org/book/ch04-00-understanding-ownership.html

[01:10:05] Paper: “Stealing Part of a Production Language Model” (Carlini et al., March 2024) – extraction attacks on ChatGPT, PaLM-2 - https://arxiv.org/abs/2403.06634

[01:10:55] First model stealing paper (Tramèr et al., 2016) – attacking ML APIs via prediction - https://arxiv.org/abs/1609.02943