We're sunsetting PodQuest on 2025-07-28. Thank you for your support!
Export Podcast Subscriptions
cover of episode Virtual Cell Models, Tahoe-100 and Data for AI-in-Bio with Vevo Therapeutics and the Arc Institute

Virtual Cell Models, Tahoe-100 and Data for AI-in-Bio with Vevo Therapeutics and the Arc Institute

2025/2/25
logo of podcast No Priors: Artificial Intelligence | Technology | Startups

No Priors: Artificial Intelligence | Technology | Startups

AI Deep Dive AI Chapters Transcript
People
D
Dave Burke
H
Hani Goodarzi
J
Johnny Yu
N
Nima Alidoust
P
Patrick Hsu
S
Sarah Guo
Topics
Sarah Guo: 本期节目讨论了Tahoe-100数据集的发布,以及人工智能在生物学中的应用现状,特别是虚拟细胞模型在药物发现中的潜力。 Johnny Yu: Tahoe-100数据集是世界上最大的单细胞RNA测序数据集,它为机器学习应用(包括虚拟细胞模型)和药物发现提供了大量数据。 Nima Alidoust: 我们缺乏关于不同细胞在不同环境中的行为以及每个细胞内不同基因在其他基因存在下的功能的数据。Tahoe-100数据集的出现,标志着我们进入了一个新的时代,我们可以收集细胞数据,并以此构建类似于蛋白质语言模型的模型,但应用于细胞环境。 Patrick Hsu: 大型数据集,特别是扰动数据集,可以阐明细胞反应,从而推动在细胞水平上建模的能力,而不仅仅是在蛋白质水平上。我们需要研究生物学中更高层次的抽象,而不仅仅是个体分子机器,还需要了解它们在整个细胞环境中的运作方式。 Dave Burke: 我们可以将细胞比作一个计算机系统,DNA是只读存储器,RNA是工作存储器,而虚拟细胞模型则试图推断出细胞的“CPU”,即细胞如何响应输入并反映在转录组图谱中。在生物学中构建AI模型,有些领域数据不足,有些领域计算能力有限。对于细胞状态模型,我们非常缺乏数据。 Hani Goodarzi: 扰动数据使我们能够从相关性研究转向因果关系研究,这对于构建能够学习细胞状态变化的通用模型至关重要。为了探索高维潜在空间中的流形,模型需要观察许多不同的扰动和响应,以便进行泛化预测。之前公开的数据主要来自健康组织,缺乏疾病细胞数据,且大多是观察性数据,无法捕捉基因相互作用的因果关系。Tahoe-100数据集包含大量扰动数据,大大增加了可用扰动数据集的规模,这对于构建能够预测药物对细胞影响的模型至关重要。将Tahoe-100数据集与其他公开可用的单细胞数据集结合,可以创建一个包含数亿个细胞的大型数据集,这将有助于训练机器学习模型。为了构建能够学习心脏、大脑、肝脏或骨骼等不同细胞类型变化的模型,我们需要在这些不同细胞类型上进行训练。

Deep Dive

Chapters
The Tahoe-100M dataset, the world's largest single-cell RNA sequencing dataset, is a landmark achievement in AI for biology. It enables machine learning applications like virtual cell models, transforming drug discovery and representing a new era in understanding cellular behavior.
  • Tahoe-100M is the world's largest single-cell RNA sequencing dataset.
  • It enables machine learning applications, including virtual cell models and drug discovery.
  • It's comparable to ImageNet's impact on machine vision, potentially driving a similar leap in cellular modeling.

Shownotes Transcript

On this week’s episode of No Priors, Sarah Guo is joined by leading members of the teams at Vevo Therapeutics and the Arc Institute – Nima Alidoust, CEO/Co-Founder at Vevo Therapeutics; Johnny Yu, CSO/Co-Founder at Vevo Therapeutics; Patrick Hsu, CEO/Co-Founder at Arc Institute; Dave Burke, CTO at Arc Institute; and Hani Goodarzi, Core Investigator at Arc Institute. Predicting protein structure (AlphaFold 3, Chai-1, Evo 2) was a big AI/biology breakthrough. The next big leap is modeling entire human cells—how they behave in disease, or how they respond to new therapeutics. The same way LLMs needed enormous text corpora to become truly powerful, Virtual Cell Models need massive, high-quality cellular datasets to train on. In this episode, the teams discuss the groundbreaking release of the Tahoe-100M single cell dataset, Arc Atlas, and how these advancements could transform drug discovery.

Sign up) for new podcasts every week. Email feedback to [email protected]

Follow us on Twitter: @NoPriorsPod) | @Saranormous) | @Nalidoust) | @IAmJohnnyYu) | @PDHsh) | @Davey_Burke) | @Genophoria)

Download the Tahoe Dataset)

Show Notes:

0:00 Introduction

1:40 Significance of Tahoe-100M dataset

4:22 Where we are with virtual cell models and protein language models

10:26 Significance of perturbational data

17:39 Challenges and innovations in data collection

24:42 Open sourcing and community collaboration

33:51 Predictive ability and importance of virtual cell models

35:27 Drug discovery and virtual cell models

44:27 Platform vs. single hypothesis companies

46:05 Rise of Chinese biotechs

51:36 AI in drug discovery