We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Transformers Need Glasses! - Federico Barbero

Transformers Need Glasses! - Federico Barbero

2025/3/8
logo of podcast Machine Learning Street Talk (MLST)

Machine Learning Street Talk (MLST)

AI Deep Dive AI Chapters Transcript
People
F
Federico Barbero
Topics
我专注于Transformer模型的局限性,特别是它们在处理长序列时出现的表示坍塌和信息流问题。我的研究表明,Transformer的因果注意力机制和softmax函数会限制其对单个token的精确识别和对长序列中信息的高保真度维持。这会导致模型在诸如计数和复制长字符串等看似简单的任务中失败。我通过数学分析和实验验证了这些局限性,并提出了几种改进方法,例如修改输入、调整架构以及利用其他模型的优势。例如,通过在序列中插入额外的token,可以改善Transformer的性能,因为这可以防止模型丢失序列末尾的信息。此外,我观察到,模型在处理长序列时,会越来越关注序列的起始token,而忽略后面的token。这与因果注意力机制和信息流的机制偏差有关。为了解决这个问题,我建议使用窗口注意力机制,这是一种有效的方法。此外,我还研究了数值精度对模型性能的影响,发现低精度会导致表示坍塌。因此,提高数值精度也是一种改进方法。总而言之,我的研究揭示了Transformer模型的固有局限性,并提出了几种改进方法,以提高其在处理长序列任务中的性能。 关于推理的定义,我认为它是一个非常模糊的概念。计算机程序如果编写正确,可以任意泛化,我们也可以证明其完美性等等。但这算推理吗?我不确定。人类在处理大型任务时也可能会出错,例如计算π到很多位数。因此,人类和机器对推理的理解可能不同,机器擅长人类不擅长的事情,例如计数,因此机器的推理方式可能与人类不同。理想的推理系统应该能够避免简单的错误,并且不会犯概念性错误,同时能够利用机器和人类的优势。

Deep Dive

Shownotes Transcript

Federico Barbero (DeepMind/Oxford) is the lead author of "Transformers Need Glasses!".

Have you ever wondered why LLMs struggle with seemingly simple tasks like counting or copying long strings of text? We break down the theoretical reasons behind these failures, revealing architectural bottlenecks and the challenges of maintaining information fidelity across extended contexts.

Federico explains how these issues are rooted in the transformer's design, drawing parallels to over-squashing in graph neural networks and detailing how the softmax function limits sharp decision-making.

But it's not all bad news! Discover practical "glasses" that can help transformers see more clearly, from simple input modifications to architectural tweaks.

SPONSOR MESSAGES:


CentML offers competitive pricing for GenAI model deployment, with flexible options to suit a wide range of models, from small to large-scale deployments. Check out their super fast DeepSeek R1 hosting!

https://centml.ai/pricing/

Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. They are hiring a Chief Engineer and ML engineers. Events in Zurich.

Goto https://tufalabs.ai/


https://federicobarbero.com/

TRANSCRIPT + RESEARCH:

https://www.dropbox.com/s/h7ys83ztwktqjje/Federico.pdf?dl=0

TOC:

  1. Transformer Limitations: Token Detection & Representation

[00:00:00] 1.1 Transformers fail at single token detection

[00:02:45] 1.2 Representation collapse in transformers

[00:03:21] 1.3 Experiment: LLMs fail at copying last tokens

[00:18:00] 1.4 Attention sharpness limitations in transformers

  1. Transformer Limitations: Information Flow & Quantization

[00:18:50] 2.1 Unidirectional information mixing

[00:18:50] 2.2 Unidirectional information flow towards sequence beginning in transformers

[00:21:50] 2.3 Diagonal attention heads as expensive no-ops in LAMA/Gemma

[00:27:14] 2.4 Sequence entropy affects transformer model distinguishability

[00:30:36] 2.5 Quantization limitations lead to information loss & representational collapse

[00:38:34] 2.6 LLMs use subitizing as opposed to counting algorithms

  1. Transformers and the Nature of Reasoning

[00:40:30] 3.1 Turing completeness conditions in transformers

[00:43:23] 3.2 Transformers struggle with sequential tasks

[00:45:50] 3.3 Windowed attention as solution to information compression

[00:51:04] 3.4 Chess engines: mechanical computation vs creative reasoning

[01:00:35] 3.5 Epistemic foraging introduced

REFS:

[00:01:05] Transformers Need Glasses!, Barbero et al.

https://proceedings.neurips.cc/paper_files/paper/2024/file/b1d35561c4a4a0e0b6012b2af531e149-Paper-Conference.pdf

[00:05:30] Softmax is Not Enough, Veličković et al.

https://arxiv.org/abs/2410.01104

[00:11:30] Adv Alg Lecture 15, Chawla

https://pages.cs.wisc.edu/~shuchi/courses/787-F09/scribe-notes/lec15.pdf

[00:15:05] Graph Attention Networks, Veličković

https://arxiv.org/abs/1710.10903

[00:19:15] Extract Training Data, Carlini et al.

https://arxiv.org/pdf/2311.17035

[00:31:30] 1-bit LLMs, Ma et al.

https://arxiv.org/abs/2402.17764

[00:38:35] LLMs Solve Math, Nikankin et al.

https://arxiv.org/html/2410.21272v1

[00:38:45] Subitizing, Railo

https://link.springer.com/10.1007/978-1-4419-1428-6_578

[00:43:25] NN & Chomsky Hierarchy, Delétang et al.

https://arxiv.org/abs/2207.02098

[00:51:05] Measure of Intelligence, Chollet

https://arxiv.org/abs/1911.01547

[00:52:10] AlphaZero, Silver et al.

https://pubmed.ncbi.nlm.nih.gov/30523106/

[00:55:10] Golden Gate Claude, Anthropic

https://www.anthropic.com/news/golden-gate-claude

[00:56:40] Chess Positions, Chase & Simon

https://www.sciencedirect.com/science/article/abs/pii/0010028573900042

[01:00:35] Epistemic Foraging, Friston

https://www.frontiersin.org/journals/computational-neuroscience/articles/10.3389/fncom.2016.00056/full