We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Prof. Randall Balestriero - LLMs without pretraining and SSL

2025/4/23

Machine Learning Street Talk (MLST)

AI Deep Dive AI Chapters Transcript

People

Randall Balestriero

Topics

我进行了一项实验，发现即使从随机初始化开始，大型语言模型也能在小型数据集上很好地学习特定任务，训练稳定且不易过拟合，有时性能可与昂贵的预训练模型相媲美。这引发了我们对预训练成本效益的思考，至少在某些应用中，预训练的优势并不明显。我们还研究了自监督学习和监督学习之间的关系，发现它们在学习表示方面具有理论上的等价性，这使得我们可以将监督学习的理论和经验应用于自监督学习。通过这种联系，我们可以设计出新的自监督学习模型，以处理现实世界数据分布的不平衡性，例如在ImageNet等数据集上表现良好，但在iNaturalist等重尾分布的数据集上表现不佳。此外，我们还研究了地球数据模型中的公平性问题，发现这些模型可能存在偏差，在某些特定位置（如岛屿或沿海地区）的预测精度较差，这可能会对基于这些模型的政策决策产生不利影响。我们发现，这种偏差可能部分源于模型架构和编码位置的方式，例如使用傅里叶基函数进行建模会引入偏差，而使用小波基函数可以提高模型的局部化能力，从而减少偏差。

Deep Dive

Shownotes Transcript

Randall Balestriero joins the show to discuss some counterintuitive findings in AI. He shares research showing that huge language models, even when started from scratch (randomly initialized) without massive pre-training, can learn specific tasks like sentiment analysis surprisingly well, train stably, and avoid severe overfitting, sometimes matching the performance of costly pre-trained models. This raises questions about when giant pre-training efforts are truly worth it.

He also talks about how self-supervised learning (where models learn from data structure itself) and traditional supervised learning (using labeled data) are fundamentally similar, allowing researchers to apply decades of supervised learning theory to improve newer self-supervised methods.

Finally, Randall touches on fairness in AI models used for Earth data (like climate prediction), revealing that these models can be biased, performing poorly in specific locations like islands or coastlines even if they seem accurate overall, which has important implications for policy decisions based on this data.

SPONSOR MESSAGES:

Tufa AI Labs is a brand new research lab in Zurich started by Benjamin Crouzier focussed on o-series style reasoning and AGI. They are hiring a Chief Engineer and ML engineers. Events in Zurich.

Goto https://tufalabs.ai/

TRANSCRIPT + SHOWNOTES:

https://www.dropbox.com/scl/fi/n7yev71nsjso71jyjz1fy/RANDALLNEURIPS.pdf?rlkey=0dn4injp1sc4ts8njwf3wfmxv&dl=0

TOC:

Model Training Efficiency and Scale

[00:00:00] 1.1 Training Stability of Large Models on Small Datasets

[00:04:09] 1.2 Pre-training vs Random Initialization Performance Comparison

[00:07:58] 1.3 Task-Specific Models vs General LLMs Efficiency

Learning Paradigms and Data Distribution

[00:10:35] 2.1 Fair Language Model Paradox and Token Frequency Issues

[00:12:02] 2.2 Pre-training vs Single-task Learning Spectrum

[00:16:04] 2.3 Theoretical Equivalence of Supervised and Self-supervised Learning

[00:19:40] 2.4 Self-Supervised Learning and Supervised Learning Relationships

[00:21:25] 2.5 SSL Objectives and Heavy-tailed Data Distribution Challenges
Geographic Representation in ML Systems

[00:25:20] 3.1 Geographic Bias in Earth Data Models and Neural Representations

[00:28:10] 3.2 Mathematical Limitations and Model Improvements

[00:30:24] 3.3 Data Quality and Geographic Bias in ML Datasets

REFS:

[00:01:40] Research on training large language models from scratch on small datasets, Randall Balestriero et al.

https://openreview.net/forum?id=wYGBWOjq1Q

[00:10:35] The Fair Language Model Paradox (2024), Andrea Pinto, Tomer Galanti, Randall Balestriero

https://arxiv.org/abs/2410.11985

[00:12:20] Muppet: Massive Multi-task Representations with Pre-Finetuning (2021), Armen Aghajanyan et al.

https://arxiv.org/abs/2101.11038

[00:14:30] Dissociating language and thought in large language models (2023), Kyle Mahowald et al.

https://arxiv.org/abs/2301.06627

[00:16:05] The Birth of Self-Supervised Learning: A Supervised Theory, Randall Balestriero et al.

https://openreview.net/forum?id=NhYAjAAdQT

[00:21:25] VICReg: Variance-Invariance-Covariance Regularization for Self-Supervised Learning, Adrien Bardes, Jean Ponce, Yann LeCun

https://arxiv.org/abs/2105.04906

[00:25:20] No Location Left Behind: Measuring and Improving the Fairness of Implicit Representations for Earth Data (2025), Daniel Cai, Randall Balestriero, et al.

https://arxiv.org/abs/2502.06831

[00:33:45] Mark Ibrahim et al.'s work on geographic bias in computer vision datasets, Mark Ibrahim

https://arxiv.org/pdf/2304.12210

Prof. Randall Balestriero - LLMs without pretraining and SSL 34:30 Share

Machine Learning Street Talk (MLST)

Deep Dive

Shownotes Transcript

Prof. Randall Balestriero - LLMs without pretraining and SSL