We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Ep 69: Co-Founder of Databricks & LMArena on Current Eval Limitations, Why China is Winning Open Source and Future of AI Infrastructure

2025/6/17

Unsupervised Learning

AI Deep Dive AI Chapters Transcript

People

Jan Stojka

Topics

Jan Stojka: 我开发LM Arena的初衷是为了评估Vicuna模型，这是一个由学生在不知情的情况下开发的对话式AI模型。最初我们用人工评估，但扩展性差，后来尝试用GPT-4评估，但人们质疑其与人工评估的差异。因此，我们开发了Chatbot Arena，它使用ELO评分系统，通过随机匿名模型提供答案，用户投票来计算评分，并扩展到多模态评估。之所以成立公司，是因为评估需要扩展，运行成本高，且需要构建可扩展的后端和更具响应性的用户界面。我们收集的数据非常有价值，可以回答诸如模型替换对应用的影响等问题。我认为人工智能的主要挑战是可靠性，而LM Arena可以帮助解决这个问题。虽然人类评估很重要，因为大多数应用都有人类参与，但我们可以通过收集足够的数据来消除已知的偏见。LLM作为评估者也存在偏见，例如位置偏见和冗长偏见。我认为开源模型的发展速度令人印象深刻，中国在开源模型方面具有结构性优势，因为他们有更多的专家、数据，并且学术界和产业界之间有更强的合作。美国则存在AI开发孤立、学术界作用不大的问题。我认为美国很可能出现基础设施过度建设的情况。中国有能力长期资助战略项目，这可能给他们带来结构性优势。我认为人工智能基础设施正在向垂直整合和跨层协同设计的方向发展，需要解决分布式异构基础设施的挑战，包括自动优化和生成优化的代码内核，以及优化网络和计算之间的交叉点。 Databricks在获取对所有数据的无缝和高性能访问方面做得很好，并且很早就积极地为企业追求AI。AI在Databricks的DNA中，早期客户购买Databricks产品是为了进行AI。我改变了对量化的看法，认为它比我预期的更成功。我对AGI的看法是，计算机在越来越多的任务上比人类做得更好，但对于那些更主观的任务，进展会更慢。AI有可能产生新颖的想法和突破，但瓶颈将是测试它们。

Deep Dive

Chapters

LMArena, initially a Berkeley project, arose from the need to evaluate the Vicuna model. It started with student-based evaluations, then leveraged GPT-4 as a judge, and finally evolved into a platform with human-based evaluations and an ELO rating system to handle the dynamic nature of model comparisons.

LMArena was born from a need to evaluate the Vicuna model.
Initially, student evaluations were used, then GPT-4.
It evolved into a platform with human-based evaluations and an ELO rating system.
Handles the dynamic nature of model comparisons.

Shownotes Transcript

Ion Stoica helped define the modern data stack. Now he’s coming for AI evaluation. From co-founding Databricks and Anyscale to launching LMArena, Ion has shaped the infrastructure underlying some of the biggest shifts in computing. In this conversation, he unpacks what most people get wrong about model evaluation, the infrastructure challenges ahead for agents and heterogeneous compute, and why he believes the U.S. is structurally disadvantaged in open-source AI compared to China.

(0:00) Intro(0:49) Launching a New Startup: LMArena(1:01) The Origin of the Vicuna Model(1:54) Challenges in Model Evaluation(6:33) Becoming a Company(7:47) Expanding Evaluation Capabilities(13:48) The Importance of Human-Based Evaluations(18:56) Open Source vs. Proprietary Models(23:05) Infrastructure and Collaboration Challenges(28:22) China's Strategic Advantages in Technology(29:54) Opportunities in AI Infrastructure(31:50) Challenges in AI Model Optimization(35:49) The Role of Data in AI Enterprises(39:31) Reflections on AI Progress and Predictions(50:40) Quickfire

With your co-hosts:

@jacobeffron

Partner at Redpoint, Former PM Flatiron Health

@patrickachase

Partner at Redpoint, Former ML Engineer LinkedIn

@ericabrescia

Former COO Github, Founder Bitnami (acq’d by VMWare)

@jordan_segall

Partner at Redpoint

Ep 69: Co-Founder of Databricks & LMArena on Current Eval Limitations, Why China is Winning Open Source and Future of AI Infrastructure 54:57 Share

Unsupervised Learning

Deep Dive

Shownotes Transcript

Ep 69: Co-Founder of Databricks & LMArena on Current Eval Limitations, Why China is Winning Open Source and Future of AI Infrastructure