We're sunsetting PodQuest on 2025-07-28. Thank you for your support!

Virtual Cell Models, Tahoe-100 and Data for AI-in-Bio with Vevo Therapeutics and the Arc Institute

2025/2/25

No Priors: Artificial Intelligence | Technology | Startups

AI Deep Dive AI Chapters Transcript

People

Dave Burke

Hani Goodarzi

Johnny Yu

Nima Alidoust

Patrick Hsu

Sarah Guo

Topics

Sarah Guo: 本期节目讨论了Tahoe-100数据集的发布，以及人工智能在生物学中的应用现状，特别是虚拟细胞模型在药物发现中的潜力。 Johnny Yu: Tahoe-100数据集是世界上最大的单细胞RNA测序数据集，它为机器学习应用（包括虚拟细胞模型）和药物发现提供了大量数据。 Nima Alidoust: 我们缺乏关于不同细胞在不同环境中的行为以及每个细胞内不同基因在其他基因存在下的功能的数据。Tahoe-100数据集的出现，标志着我们进入了一个新的时代，我们可以收集细胞数据，并以此构建类似于蛋白质语言模型的模型，但应用于细胞环境。 Patrick Hsu: 大型数据集，特别是扰动数据集，可以阐明细胞反应，从而推动在细胞水平上建模的能力，而不仅仅是在蛋白质水平上。我们需要研究生物学中更高层次的抽象，而不仅仅是个体分子机器，还需要了解它们在整个细胞环境中的运作方式。 Dave Burke: 我们可以将细胞比作一个计算机系统，DNA是只读存储器，RNA是工作存储器，而虚拟细胞模型则试图推断出细胞的“CPU”，即细胞如何响应输入并反映在转录组图谱中。在生物学中构建AI模型，有些领域数据不足，有些领域计算能力有限。对于细胞状态模型，我们非常缺乏数据。 Hani Goodarzi: 扰动数据使我们能够从相关性研究转向因果关系研究，这对于构建能够学习细胞状态变化的通用模型至关重要。为了探索高维潜在空间中的流形，模型需要观察许多不同的扰动和响应，以便进行泛化预测。之前公开的数据主要来自健康组织，缺乏疾病细胞数据，且大多是观察性数据，无法捕捉基因相互作用的因果关系。Tahoe-100数据集包含大量扰动数据，大大增加了可用扰动数据集的规模，这对于构建能够预测药物对细胞影响的模型至关重要。将Tahoe-100数据集与其他公开可用的单细胞数据集结合，可以创建一个包含数亿个细胞的大型数据集，这将有助于训练机器学习模型。为了构建能够学习心脏、大脑、肝脏或骨骼等不同细胞类型变化的模型，我们需要在这些不同细胞类型上进行训练。

Deep Dive

Chapters

The Tahoe-100M dataset, the world's largest single-cell RNA sequencing dataset, is a landmark achievement in AI for biology. It enables machine learning applications like virtual cell models, transforming drug discovery and representing a new era in understanding cellular behavior.

Tahoe-100M is the world's largest single-cell RNA sequencing dataset.
It enables machine learning applications, including virtual cell models and drug discovery.
It's comparable to ImageNet's impact on machine vision, potentially driving a similar leap in cellular modeling.

Shownotes Transcript

Translations:

中文

Hi, listeners. Welcome back to No Friars. Today we're here with the CEO, CTO, and core investigator of the ARC Institute, as well as the co-founders of Avivo, to talk about their release of the Tahoe 100, the largest single-cell drug-perturbed dataset ever created, as well as where we are in AI for biology, why we need a virtual cell model and not just protein structure prediction models.

and when we should finally expect to see treatments from this growth of use of machine learning in bio. Hi, I'm Johnny and I work on single-cell RNA sequencing at Vivo. I'm Nima. I'm one of the founders together with Johnny. I'm a quantum chemist by background, but I've converted to being a computational chemist that loves playing with biological data.

And we are building Vivo to really do that, to predict how chemicals interact with cells in different biological contexts. Some people call it the virtual cell. That's basically what we're working on. I'm Patrick Hsu, one of the founders at the ARC Institute, which is working at the interface of biology and machine learning to try to understand and one day treat complex human diseases, which are most of the major killers.

I'm Dave, CTO at ARC Institute, focused on computational biology and building novel AI models for biology. I'm Hani. I'm a core investigator at ARC. I work very closely with Dave and Patrick to push our virtual cell initiative. Congratulations, everyone. It's a big day. Let's jump right into it. What is the Tahoe 100 and what is the significance of it? So Tahoe 100 is the world's biggest single-cell RNA sequencing data set, and it enables the

basically a ton of machine learning applications including things like the virtual cell but it also enables a lot of drug discovery applications and broadly in the context of where I think we are as a field it's kind of the beginning of a different way of doing drug discovery of basically understanding how to build medicines and basically bringing AI machine learning people into the mix. Maybe something I would add there as well.

Over the last 20 years or so, people have accumulated a massive amount of data points when it comes to protein structures, protein function, how drug molecules interact with proteins. But one thing that we haven't had as much is how different cells behave in different contexts and how different genes within each of those cells actually functions in the presence of the other genes in these different biological contexts.

We believe this is the era for that right now. You have seen the emergence of protein language models built on the data sets that have been accumulated over the last two decades.

But now is the era for actually having data on cells, how they function, how they interact with drug molecules. And exactly what John is saying, tau is really a landmark data set there that allows us to really measure how drugs interact with different cells from different patient models. And that gives us the ability to build similar models that we built in protein language models, but in the cellular kind of context. If you think about it, actually, like in

in history of AI, it's punctuated by these data sets that come about. If you think about ImageNet in 2009 that Fei-Fei Li put together, and you look at what that did to drive a nonlinear jump in machine vision,

I think the hope here is that by producing data sets, particularly perturbational data sets that allow us to elucidate cellular responses, that we'll be able to actually drive forward the ability to model at the cellular level, not just at the protein level. And so I think this is one of those moments.

Yeah, so lots of people have been talking about what those foundational data sets look like for biology, right? And this has been really useful for training protein structure prediction models like AlphaFold built on CASP, the competition built on top of PDB data. But how do you do this for cells and cellular dynamics, which is really what tells us about biology and how it responds in health and disease. So I think those are the core steps forward where we want to bring up our

our ability to study higher levels of abstraction in biology, not just the individual molecular machines, but how they operate in the context of an entire cell.

Congrats also to the entire ARC team. Given you are working on both virtual cell models and protein structure prediction, protein language models, can you contextualize a little bit why we need both and where we are in the progress of each? I think we're learning that, right? We're looking at these emergent properties of biology by training these large-scale foundation models on...

nucleic acids and these virtual cell models that we'll talk more about today. And, you know, we have this debate often internally. So, you know, I have sort of an engineering computer background. So the way I think about it is, you know, if you think about the cell, the DNA lives in the ROM, the read-only memory, right? So it's coding for the cell. But then the RNA lives in the ROM, so it's like the working memory.

And the RNA is constantly changing its expression level. It's almost like one of those 1980s graphic equalizers where you got like 20,000 like bars for each gene. And it's constantly adjusting its expression level depending on what the cell is experiencing, whether that's sort of the environment, whether it's stress, whether it's aging, whether it's like disease state or healthy state.

And what we're trying to do with this data, I think, as a field is create these virtual cell models, which in a way is kind of inferring a notional CPU for the cell. So how does the cell respond to an input? That input could be an edited gene. It could be an application of a drug. And then how does that reflect in the transcriptomic profile?

And so that CPU is sort of an analogy to the AI model that you want to build. And then once you have an AI model, what's really interesting is you can start posing the inverse question, which is, you know, given a cell in a certain disease state that's exhibiting a certain transcriptomic profile,

How do I perturb that cell whether that's a gene edit or that's a drug to perturb it back into that healthy stage? And I think that's what's really exciting about this data which then creates enables these models which then enables these tools and hopefully could accelerate drug discovery and one thing I will quickly add to that is that When we think about different domains in biology, I think and and building AI models of those domains there are parts of it that

We are data poor, and there are parts of it that we are compute limited. I think when it comes to, for example, DNA language models, again, thanks to the field and decades of having sequenced a ton of genomes, we are not as much data limited, but

compute and specifically context and how long we can actually consume DNA and what sizes of inputs and and all of that is actually a big limitation that we have tried to solve but when it comes to kind of cell estate models that is an area that we are absolutely very much data limited because being able to profile cells at single cell resolution is basically a new technology and

has emerged over the past decade, but really kind of the explosion over the past five, six years. And we are just getting there to be able to generate that kind of data at its scale. And it's not just the scale. Maybe the one thing. It's not just the scale. I think the idea here is that we have, I think, in the order of

Before SC Basecamp, which is the data set that's being released together with Tahoe on the Virtual Cell Atlas that was created by the Orc folks by basically collating all of publicly available data. Before that, I think the number of human cells that had been collated together, it was in the order of 45, 50 million, if you're generous, 60 million single cell data points.

But the scale is one thing. The question is, you know, how much information content there is in this data as well. Quality, yeah. And are they coming from very different biological contexts? We actually built, you know, early versions of some of those virtual cell models. We call them single cell foundation models or whatever name you actually use for them. And what we saw is that if you actually reduce the number of the 16 million, you downsample it by, like,

even 99%, you know, you just use 1% of that data to train your models, actually the model's performance doesn't reduce that much. So it means that the information content of the models that you are actually, or the data you're using for training those models is not amazing. So having data that comes from very different biological contexts

That's very key in providing the information content for the model so that it can learn. And that goes back to what Dave was saying, the perturbational data sets. Perturbation allows you to create new contexts, allows you to create new cell states that then the model can learn from and therefore be used for different types of applications. And then I'll let Johnny later talk about maybe

Like, what is the challenge with this perturbation, creating this perturbational data set? Before we go there, actually, can we zoom out for a second and just have you describe in layman's terms what the data actually tells you and where the prior data came from, even if it was information poor? If you look at the data that's been generated over the past decade, it's basically all kinds of academic groups like us or...

some people in industry generating all these little data sets. And there's a ton of problems with this. First, there's batch effects. So even one person running an experiment on two different days, their data looks the same, even if it's the same cells. So when you think about trying to build the internet of biology, which is what you need to build this chappy GPT moment...

In terms of scale. In terms of scale, right? Because you need big data. Machine learning is not going to do anything for us if we don't have big data. You have a data set that's poorly labeled, that's super batchy, that's maybe moderately useful for AI, but it's not there. And so this data set, it is basically doubling the size of all the data that's out there cumulatively over the past decade. It covers 50 different cancer models from different patients. So it's

cells from 50 different patients, 1,200 drug treatments. So it's a really deep and rich data set that effectively has no batch effects. And so we think this is actually not only an additional data set for machine learning, we actually think it's the first data set that's going to enable machine learning in this space.

One thing that might be worth touching on is why perturbational data, right? And I think the key is that we're going from correlation, which is what a lot of biological research is. It's descriptive, right? You kind of stare at things. You try to see when you poke this way, what else is changing and go from associative changes to causation, right? And that's where going with genetic or chemical perturbations allows you to have a very clear before and after where you have this set of consequences

causal changes that can actually drive a particular cell state. The key is to be able to do this in a generalizable way. So you can look across many different cell types, many different tissue types of epigenetics

A MEL model would need to, in order to learn a general sense of cell state possibilities, would need to then train on that diversity of data as well. I mean, in a topological sense, what the model is trying to do is it's trying to create a manifold in a high-dimensional space, and it's a high-dimensional latent space. And so actually...

explore that manifold, the model needs to see lots of different perturbations and responses. And then once you do that, you have this generalized manifold that allows the model to make predictions for data that it hadn't seen in its sample that still fits the manifold. To make it even more tangible, the data that was available publicly before this, almost the entirety of it comes from healthy tissue. Very little comes from actually diseased cells.

And almost all of it, not the entirety, almost all of it is observational in the sense you take cells from a liver sample and you do single-cell RNA sequencing on that. And that basically has the limitation that Patrick was talking about. Does it capture the causality of the gene-gene interactions you're trying to model? And the second piece is does it allow you to model how then a new perturbation

actually will impact the cells, whether it's genetic perturbation or drug perturbation, which really is the focus for Tahoe in this situation, perturbational data sets. So in that sense, like Tahoe, I think when you put all of the perturbational data sets in the world together, if you're generous, it's like one to two million single cell data points.

I mean, this is publicly available data. We don't know as much about what's inside different organizations. Publicly available is 2 million. Tahu is 100 million. So we have basically increased that massively. Now, when you couple that with this huge amount of observational data sets from different species that are in the world, which is basically what the archives did, they

They put together the entirety of that data set. It turns out to be 200, 230 million single-cell data puts already out there. And they have tried to reduce as much as possible the variations between these data sets so they're consistent with each other so they can train machine learning models. That's the significance of this day. I want to make a finer point on this. I think the key is if you want a model that can learn about changes going on in the heart or in the brain or in the liver or in the bones, you need to be able to train across all those different cell types.

But if you just look at normal healthy cells, right, you wouldn't necessarily learn about how the manifold and latent space changes in disease, right? And so being able to look at many different types of cells

tissue types across different cancers is one way to be able to get at those really critical disease states that both basic science but drug discovery really cares about. - How should we think about 100 million data points or 230 million data points and the scale of this release in terms of where we are? Is that enough to be useful? What do we know about scaling laws now? - The short answer is it's a very hard question. We won't know until we get there.

What we can draw inspiration from is basically large language models in human language and also things like DNA language models where we do have enough data to do scaling laws.

And where we are around there, you know, you're around 1 trillion training tokens is where you want to hit, right, by and large. Like GPT-3 was, I think, half a trillion tokens. ESM-3 was 700 billion tokens, so close to a trillion. Yeah, so a trillion sounds like a comfortable mark to hit.

So then the question becomes, how do you count tokens? And because, you know, cells in the end are not exactly sentences. But, you know, are genes and their expression, if you count them as tokens? I think this collection that we have put together, I think gets us close to where we want to be to start asking and answering those questions, actually.

So I think, you know, it puts us a few hundred billion training tokens for the kinds of model architectures that we have now. Think of it as a cell collection of, for these data cells, 2,000 to 5,000 genes. And each gene and its expression is basically a token in what we're doing. So 200, like 100 million single cell data points is akin to around 200 to 300 billion tokens.

Now, there's a finer point there, which is like how much of this, how many of these tokens are actually informative to the model. I'm not asking this question the correct way, but you will understand the gist of it. How do you decide where in the genetic landscape to start? How do you choose perturbations? I think you want to match, and this goes kind of the same with drugs, is you want to match you

your quest, your like perturbation toolkit, which is like the kind of arrows you throw at the biology against the biology you have. So for cancer, that means going after cancer relevant genes, genes that impact growth of cells.

genes that impact DNA regulation, and also drugs that target key cancer pathways. So I think for cancer, relevant questions. But this data set, even though it's heavily based around these kinds of chemical perturbations in cancer, they also, these pathways are so conserved and fundamental that they broadly apply to the neuroscience space or like to just immune cell development in general. So I think it's really...

the foundation model that's going to be able to take this data, ingest it, build a model, and then train, and then understand basically how to like translate that data to a different context entirely. Yeah, so this is the key point

I think this is one of the really special things we have at Vivo, and it's this mosaic platform. So it allows us to take cells from many different patients, and then in cancer this means all kinds of cancer, lung cancer, pancreatic cancer, et cetera, et cetera, from different patients which have their own special genetics and pool them together into a single mosaic tumor, which we then can reproducibly screen hundreds or thousands of drugs against.

And so this key innovation basically allows us to, instead of test one cancer model at a time, test tens or hundreds. And it makes this a really scalable data generation platform. This is what we use to generate Taha 100. When we think about actually how we build these pools in terms of information content, we want to maximize it by covering a lot of cancer patients, right? So this data set, we covered the biggest cancer types by how frequently they occur annually.

But then as we continue to grow this data set, we want to think about rare disease, bring in maybe more coverage of different parts of the cancer space machine informed by the machine learning that basically will help us fill in the gaps in the foundation models. Another direction is chemical space. So I think one, if the question is about, you know, how do we prioritize, but frankly, we have,

when you generate five, when you generate 50 times more perturbational datasets than is publicly available in five weeks, and those datasets have been available in like, have been generated over 10 years,

you don't have to prioritize as much. And that's the beauty of it, in my opinion. You can go large on the chemical space. You can go large on the patient sample space. And that way, you don't have to really a priori come up with a hypothesis about what is it that I have to feed the models. You can just generate as much as you want. You can be more unbiased as scale increases. Exactly. Hypothesis-free, unbiased kinds of data generation. That's really, I think, the beauty here. Yeah, let the data surprise you. Let the data surprise you, exactly.

And this is one of those things that I'd like to talk about as well. And I hope these are the people we have here, the representatives of the new generation of biologists. But I think one thing that has been slowing the progress in bio is the fact that we have always been super hypothesis driven. And I think the reason is that a lot of these experiments are expensive. They take a lot of time, a lot of resources.

But I think now is really the time that the sequencing cost has gone down, single-cell sample prep cost has gone down, compute cost has gone down. I think it's also time to change that kind of mentality in bio as well and go a little more, be a little more courageous, you know? Be a little more freewheeling in terms of your data generation and the kinds of, you know, samples you put together. So, yeah, I think, I mean...

This is the view from an outsider. I want to talk about being more ambitious in bio and the open sourcing of this in a second, but I think we should just zoom out and talk about, in layman's terms, what the platform does, and you can correct me if any of this is wrong. So you have these tumors that...

are a mosaic of cells from different patients representing a huge amount of patient genetic variation. And each mouse then can actually be treated with different drugs where the signal you extract after is the interaction of drugs against each of these different patient types. That's right. Okay. Nobody else thinks this is crazy.

Not crazy because it's happening every day in our labs, but it's really science fiction, honestly. Great, great. I'm just trying to, like, boil it down to a very, you know, simple non-biologist understanding of when you say it's a platform with this super tumor where you can pull all of this data out, it is wild to think about how efficient that is in comparison to, well, we will observe, you know, one patient type at a time. I think this is actually a super interesting point. If you map

the number of tokens per experiment across the last 50 years of biomedical research. It'll look like the hockey stick that all investors and founders really know and love, just going up and to the right.

And I think the way that we think about doing science is changing, right, based on this. And there's, I think, a roiling discussion today about hypothesis-driven versus hypothesis-free research, right? Should we be doing mechanism versus large-scale profiling? But honestly, I think this stuff is going to wash out with scale. Yeah.

Exactly. You don't have to choose between those two. Yeah. And maybe that's my hot take with this era of machine learning and biology is the vast majority of mechanistic data that's been generated to date is really made to ask very specific, very well-scoped questions. And just way more tokens per experiment is just going to be the way to do it.

I mean, maybe I can say it another way. I think in biology, what we have done is we have treated humans as the foundation models that ingest information and come up with hypothesis, right? And, you know, but now we actually want to go beyond that because humans, of course, come with their own, you know, intuitions and biases and all of that. At UCSF, for example, we often...

And just say that, you know, we use some of our medicinal chemist folks like Kayvon Shulcat as kind of his last layer of a neural network, right? They have built this intuition of, you know, is this chemical that I, you know, I generated via this AI model, does it actually look something that is real, right? Yeah. And they can't even like verbalize why they think. Yeah.

It might be a good drive or not. People criticize these models for hallucinating. But if you think about it, the process of scientific research just involves hallucination. That's what creativity is. So you're all adherents to Sutton's better lesson in this field as well. The intuition being baked into the models or the process is not the right thing. We just need to scale data. At least we hope that you don't have to make that choice there, you know?

We're seeing evidence of scaling laws in biology across proteins, right? That's been shown in the protein language models across DNA, which is what we've shown in our EVO series of models from ARC. We're also seeing inference time scaling laws, which in our most recent study. So there are sort of early signs of promise, although, you know, we'll need good benchmarks and we'll have to look at this across different data types over time.

The funny thing for me is that if you've been in this field for long enough, and I come from the quantum and computational chemistry side of things, every time that you take a certain success from field A and you want to translate it to field B, a lot of people, including in our own organizations, they come up with a list of 100 different reasons why field A, the learnings are not applicable in field B.

But then you get surprised every time. And then the next time when you're trying to do the same thing with B2C, the same kind of lists actually start emerging. In a way, I think something that's underappreciated is that those same models that learned human language, they are learning the language of structural biology. And then with the Evo work, they are learning the language of biology.

DNA, you know, and this is this is incredible. I don't I think it's It's not trivial and by the way If you are if you again if you have been in the field long enough You know that there were a lot of people who were saying no no this protein language models are never gonna work in it to have domain specific kind of Models to model this kind of phenomena. So in this way, I think I think this is really the the ethos that we have to bring here that we when honey was saying that we should use the learnings from those those models to actually translate here and

That's exactly what we should be doing. We should be thinking about what worked and at least try it in these new domains. This is the domain we are talking about and the domain Vivo is excited about and also the virtual cell part of Arc is excited about is the language of systems biology. The first thing you should be doing is to try out the things that worked in the other domains in this domain. Maybe it works, maybe it doesn't work, but if you don't try, you'll never know.

Music to my ears, given this is one of the only things we have really strong conviction about, at least at the fund we invest out of, that a bunch of these techniques, they work and they scale in domains where people are not sure yet, quite generally, where we wouldn't have the expertise in the traditional types of discovery and company building. But actually, they seem to apply very generally, right?

I think this is a great segue to a question of, you know, you're open sourcing the data. Why do that? Yeah, so we generated the data at Vivo, and Vivo is a private venture-backed company, a startup. And so Johnny and I, when originally the idea of Tahoe came up, and Johnny told me, yeah, Nima, there is this opportunity. We can generate 100 million single-cell data points. And I said, like, can we?

And I said, yeah, yeah, we can. And he said, okay, let's go and do it. And I think it was within hours that when we were chatting with, for transparency, Johnny, Hani, and I are co-founders of Vivo. When we were talking about it, we said, okay, let's do it and let's open source it. And why do we want to do that? Number one, we want to put a new stake in the ground. We want to show that there's a new game in town. And really, it's possible to up our game as a community, as a field.

And we wanted to show that so that people actually move on from this million single-cell data points, 100,000 single-cell data points, observational. We up their game and actually go to a much more massive scale. So that's number one. Number two, we wanted to – so the D&I of our company is to be very, very small. It's a small team of superstars rather than hiring 100 people. Paradoxically, open sourcing actually allows us to do that.

In a way, I think we talked to Dave about Tahoe. It was the night before the new year. It was sometime between Christmas and New Year. And then Dave got really excited about it. And then our team got excited about it. If there wasn't the open source aspect to it, it wouldn't have been as exciting. The whole community is getting excited

excited about playing with this data, telling us what's good about this data, what's not good about the data. And that basically allows us, a team of three, four people that we have in-house, allows us to keep it that way and basically bring the entire community of like-minded people who have the same mission, building virtual cells, to help us in this quest. And for us, the idea was we will remove the main bottleneck in doing that, and that's, I think everybody has been saying, that's data.

I think the serendipity for this was, you know, Arc's all about mission-driven science and pushing science forward. And we were conceiving of creating what we're calling, and I'm launching this week, called the Arc Virtual Cell Atlas. And so the idea there is really can we, you know, find high-quality curated data sets and put it out there in the world to accelerate virtual cell modeling? And then we started chatting, and it was like, you've got what?

And it was kind of incredible. And so what we're actually assembling this week is this new Atlas. And so the star of the show in some ways is the Vivo Taho 100 data set. We're also augmenting that with observational data. So we've created something called SC Basecamp,

And you can almost think of it like the Google crawler and index. So we've built this agent that goes onto the internet and basically mines public single cell sequenced RNA data and then curates it in a very sort of uniform way and results in a very nice observational data set. It's about 230 million cells. You add that to the 100 million cells from the Tau 100, you now have 330 million cells. And so this is a really exciting resource for scientists around the world who are interested in modeling

at the cell level and it's just very complimentary, you know, to have this observational data set that you could possibly, you know, pre-train a model on and then the perturbational data set from the TAO-100 which allows you to then to bring in those dynamics and make the model richer and more predictive.

We're super excited about AI Agents for Science at Arc and I think across the community. I think the capabilities are still very early today, but I think we wanted to show an example of how it can do something really useful. I think it's very clear now that basically all dry lab workflows are going to get automated with agents or with co-pilots. This would ordinarily be the type of thing that a team of computational biologists would be slaving over.

And our core insight was, well, the sequence read archive is the largest sort of repository of all biological data from next generation sequencing. You get an NIH grant, for example, you sort of post all of this data online or you publish in a journal, you put all this data as part of the journal publication. But this

is extremely fragmented, poorly annotated, really sprawling. There's no requirements on your submission of data in uniform. Exactly. It's very messy. And so we built this agent to basically crawl all of this data, collect it, organize it, process it, and in doing so, basically isolate and kind of remove a lot of the kind of batch effects or data biases of previous methods.

Yeah, I mean, one thing I would add is that the reality is that these data sets have been generated over time, you know, going back a decade. So tools have changed, you know, versions of tools have changed, genome builds have changed. So by just taking...

data sets and collating together and collecting together, you are kind of infecting and contaminating your data with these analytical effects, batch effects. So our idea was to… These are like foundational data sets for the entire field, right? People work with and interpret and, you know, write papers on top of all of this data. Yeah.

Yeah, so exactly. I mean, our idea was that to at least remove that. I mean, there's a lot of kind of technical experimental batch effects, but of course, over a span of this time, chemistries of reagents have changed and all of that. But at least we do our part and remove the analytical component. And we were actually surprised to what extent that was actually very observable in the data. And removing it was actually quite helpful. Maybe on the vivo side.

This whole idea of this infection of the data sets, because of this massive batch of... I like the phrase. Maybe, Johnny, you want to talk about how many people actually did experiment? This is like...

- Yeah, like the Tahoe. - The Tahoe? - Yeah. - Well, it ended up being actually four people from Vivo. And we did it, I think over like three days in the end. - Think about the leverage. That's kind of nice. - Yeah. - You know why that's super important? It's because sometimes I ask like Honey and Johnny, like, I don't know, what does drug A does to cell line X? And there's this word that biologists use in our hand, it does so-and-so. And this is what like, I mean, Dave, you tell me, like we come from a different background.

Computer scientists wouldn't say that. In my environment. It's kind of a thing. There is actually a parallel, but it's not great either. Exactly. So I think that's the genius of what Johnny has built there as well. This is actually done by very few hands.

Automation is gonna scale it to a certain level you haven't even done much automation and so in that sense like that kind of a thing beauty of that You know what Johnny designed in building Tahoe is exactly this a few people few hands doing exactly consistent work Doing 60,000 experiments. This is the one has 100 minutes single cell data points, but it's actually 60,000 Drug patient interactions drug cell line interactions. I sort of having been done with four people I think that that's just it reduces the infectious and

aspect of dataset infection that Johnny was talking about. So first in history opportunity for scientists and entrepreneurs to go work on this dataset and create these virtual cell models. How do you tell the quality of one of these models?

I mean, the core idea is it's predictive ability, right? And so you take a cell, you perturb it. You can do that either from a genetic perspective, you can suppress or upregulate genes or apply drugs, and then you look at the response. And so the measure of the model is how well it predicts what we call the differentially expressed genes.

The reality is today the best models are very poor at this. Like the predictability of the DEGs, as we call them, is in the order of 10%. Is there an accepted benchmark for this today? No, but actually I think that's something else that the industry would benefit from. It's a good point. But, you know, if you think about where we want to go –

One of our conjectures is that one of the reasons the models aren't doing well is not just simply model structure. We have a lot of rich structures that we understand in the ML space, the machine learning space. The issue is the data quality. And so the hope is with this Newark Virtual Cell Atlas with the Taha 100 that we now finally have a starting point where we can build rich models and get high predictive value of these virtual cell models. So that's why this is really kind of an exciting moment in time.

It might be worth just also speaking plainly. Why do we even care about virtual soul models? We have real cells, right? Why not just do experiments on those, right? And I think ultimately biology is very slow, right? All of us in this room and many of you watching this

have probably tried to pick up my pets and move clear liquids from one tube to another and grow cells and make animals and deal with biology which happens in real time right so you know and this is a funny story in the last year of my phd my um my advisor

tried to convince me to start an aging project, right? Which would have involved, you know, aging animals for, you know, like two years. You know, and that's sort of one experimental round. As you can imagine, I declined. I was like, may I please, sir, graduate? But that's actually what happens, right? It's actually just our labor retention. Right, right. You're constrained by biological time, which is like completely crazy to me coming from an engineering background.

And really important to tons of fields like neurodegene or anything else that takes time to progress. Yeah. So, you know, the sort of massively parallelized and silico simulations sounds great, but it needs to be accurate. It's 10% accurate. You're just simulating noise, right?

And so, you know, how do we go from, you know, a discipline that primarily respects experiments today to something more like physics where theory drives a lot of progress? And I think these virtual cell models are a core wedge in making that.

Well, can you actually make that more concrete, then? Like, if these virtual cell models work, and, you know, we don't even know how to measure them yet because they don't exist in any way that's productive today. But if they should, then, like, what will scientists or the biotech field or patients, like, expect to gain? Maybe from a drug discovery perspective, I can talk away then and from the more scientific, you know, viewpoint, the ARC folks. So what we are focused on at Vivo is to predict, like,

how a new chemical entity interacts with cells from different patients or patient models. That really is the core of it. So Patrick was talking about in silico simulation of this. Can I predict in a computer this new chemical structure? Drugs are chemical structures, by the way. I hope you won't get surprised by that. Whether this chemical structure is going to take the diseased cell

like a cancer cell, from a diseased state to a healthy state, or for the case of cancer, actually, to kill it, literally. If I can predict that, then my ability in designing new chemicals that do that effectively, they don't, you know, they kill the cancer cell, but they don't kill the healthy cells, et cetera, that increases massively.

And that's what we want to do. And literally, that's the kinds of data we are generating to train those kinds of models. Anything to add, Jony? Yeah, I completely agree. I mean, a big part of our future vision and roadmap is that we think there will be a moment where, from a virtual cell model, a drug is spit out. And basically, the drug will actually cause a healthy disease cell to become a healthy cell again.

I think that's kind of the goal, and that will reshape how we do any kind of drug discovery. One thing I will add there is that there are two dimensions of generalizability to think about. One is basically a cell kind of dimension and then the chemical dimension. On the cell side, every disease is unique. There are similarities. There are chunks of cancer mutations and all of that that drives the disease, but there are also very much individual variations.

And you can observe cells from patients, but you cannot do what for every patient, every tumor that arises, what these folks do in Mosaic. So the idea is that using a virtual cell model, you can take those learnings and then apply them to all of these new observations that you can make in patients.

So that's one dimension. The other dimension is chemicals. In silico libraries, you have tens of millions of compounds and biologics, infinite biologics, if you really put your mind to it. But most of these have never existed and will never exist because there's no use for them. So a model that can traverse that really massive space of chemistry

To find, you know, which part of this you actually need to pay attention to and go and synthesize and check will be massively enabling because everyone else, you know, has libraries that are well behaving, you know, a couple of hundred thousand libraries and they use fragments and try to put them together. So the process of kind of how, you know, folks design drugs is this slow screening process. And this will allow us to kind of really leapfrog that entire pipeline.

90% of drugs fail in clinical trials. So, you know, we're pretty bad at making drugs, right? And I think that implies two things. The first is maybe our drug matter is not very good in the sense that its potency, its ability to bind the target, its, you know, kind of toxicity, its pharmacokinetic profiles, all of those things, right? You know, sort of admit, you know, these types of things are not optimal.

The other is we're probably drugging the wrong target.

And I think, you know, the sort of idea of these virtual cell models is that you'll be able to significantly cut down the search space of what the right target is. And then you can actually, you know, really focus your time on making the right chemical or, you know, kind of chemical matter drug composition to actually make the right types of changes in the right types of cells, right? That's why mechanism and drug discovery are so like tightly interwoven and

That's really what we need these models to help accelerate. This is super important because this is the gist of why we need virtual cells in addition to these protein language models that everybody has been talking about. I think I said it before that protein language models speak the language of structural biology.

how does the protein structure look like and how does it fold, how does it interact with it? How do you dock a ligand? Exactly. A small molecule drug. Exactly. Or how does an antibody bind to another protein? This is a binding question. Binding in the sense that you are trying to see whether one chemical binds to another chemical.

But biology is more complex. And again, I'm a computational chemist. I'm a quantum chemist. I wish, and actually I bet my PhD on building quantum mechanical models that from a physics-based perspective go and simulate these kinds of bindings. But again, it turns out biology is a lot more complex

more complex. There is a context to that protein target that we are trying to hit. It's part of a cell. The cell is part of a, for cancer, it's part of a tumor. The tumor is part of a broader biological system. So virtual cells, in my opinion, are going to allow us to go beyond the language of structural biology and venture into the language of systems biology.

and understand how the drug is interacting with the broader biological system rather than simply just one target that we are basically cracking the code on that already with protein language model. Well, then I have a higher level systems question. We're at single cell. Like what about multi-cell and aggregates and organelles? And is all that going to be possible in the future?

Yes. I mean, I think like the first thing on the virtual cell, you know, direction is like what's the or any modeling is like what's the right level of abstraction? And so I think our belief around the room is the right level of abstraction is at the sort of transcriptomic level because you have these very complex gene pathways. And so whenever a cell is changing to its environment, reacting, that it will be reflected and is reflected in the transcriptome. So I think

So I think that's the first question even within a cell, what's the right abstraction? And so we think this has like, you know, because like if you think about a cell, it's like this very exquisite piece of machinery and like, you know, you could make an arbitrarily complex model, but we believe this sort of genetic level is the right level

to model. I think going beyond that, yeah, you can create very advanced models. I think you see people doing steroids and organoids. So you take mixtures of cells and run them together and you try to simulate, say, cardiac tissue or brain tissue. What's really interesting is maybe you have an organoid with 20,000 cells and

you can then still apply these techniques that we're talking about, like take these drug perturbations and apply them to these cells or these genetic perturbations and look at the responses. And so what's happening now is you're going beyond a single cell, but you're sort of getting the intercellular dynamics captured as well in the models. But I think it just naturally ladders up from single cell through to these sort of more multi-cell. One last question.

touch one small comment on that one is that it is a single cell that we are modeling, but that context dependency also captures a lot of the effects that arise from the environment. So what John is, the models that we have are actually spheroid models in this specific experiment for Tahoe, but we also have in vivo models, we have humanized mice that, you know, they capture some of the immune system of the mouse. So in a way, yes, you are

You're building an in silico model of a cell, but if a model is any good, it can simulate it in different biological contexts in the presence of this kind of immune environment, in the presence of, I don't know, in this kind of a tumor versus this other kind of tumor, in the presence of this mutation versus other mutation. So we call it single cell.

But the whole idea of having so many single-cell data points is that you have it in different contexts. Yeah, that seems like a really important nuance there. Yeah, the information of the environment is filtered through the cell. So if you're observing the cell with enough resolution, you can even predict what's in the environment. It should be represented in the model. You can also add spatial data. Oh, yeah, definitely.

Okay, I have a few hot take questions to end with. Nima, I will start with you because we were having a passionate discussion about why it was really important to you that Viva be a platform company versus a single hypothesis company like 99.9% of biotechs out there. What is the difference? I think the difference is the kind of team you build and the ambition that you have, you know?

It's a single hypothesis company is basically the idea that that Human being the foundation model that honey was talking about is basically we come up with a hypothesis and then we go test the hell out of it in different kinds of experiments and we basically are very heavily Incentivized to via mean like a company has built on that hypothesis. They're very heavily incentivized to make that hypothesis work and

What you see actually in biotech a lot of times is that you take a drug to the clinic after you have tested it on three different patient samples. If you actually are a platform company, what that means is that what you're trying to do is to have enough hypotheses and to have a hypothesis-free way of generating new hypotheses

that doesn't make you wedded to one hypothesis. And therefore, it allows you to be actually a lot more scientific in your quest for new drugs or question for new targets to treat disease. I think that's why I think the core of what, and we had a lot of hypotheses initially to go after and just build, you know, one asset, two asset kind of company. But we decided to make it a platform company because it allows us to be a lot more rigorous in terms of what we actually decide to take to the clinic.

There has been a lot of news recently on a different question, which is the rise of Chinese biotechs. For the core members of the research community here, is that a threat? How do you think of it? Well, their cost basis is definitely more competitive. I think a lot of the discussion around the water cooler in the biotech and pharma industry is, how are they able to do it at this pace?

are they able to do it at this cost? Why do their data packages look so good? They have safety, they have tox, they have all these IND-enabling studies. It's really competitive. And I think folks got really surprised at the efficiency of the pipelining and the ability to manufacture all these different antibodies primarily. And I think that's great for the industry. I think everybody, including patients,

investors, you know, the biotech companies themselves want lower cost basis, right? We want the ability to actually make molecules that work faster. And I think all these things will, you know, kind of compete, right, in the system to be able to reduce the, right now, like pretty high cost basis of doing these things, you know, stateside, right? I think one of the core challenges right now is we have

a wide array of services and, you know, CROs and contract research collaborators that you can try to chain together. There's, you know, kind of previously the virtual biotech was a concept that was very much in fashion, right? Folks found out just in reality when you try to do this, even though it looks really good on paper, it's incredibly slow, right?

So then folks tried the other way, which is let's just fully vertically integrate and just own everything. Well, that was incredibly expensive. And obviously the answer is maybe more Goldilocks in the middle. We need really competent vendors and CROs that understand the drug discovery and development process. Then we need the individual companies to be able to run in a really capital efficient and lean way. And I think

industry is trying to reshape around these changes right now to figure out the right way to build startups, the right way to build drugs. Yeah, I think I totally agree. I think it's an important moment. I think one thing that I haven't seen is that we actually acknowledge it. It just kind of hit us in the face. And I think it's because

I think the US is the innovation hub, but I think we need to basically be more intentional about that in biotech. I think you see innovation in tech. I think you see that as kind of the mantra. I think innovation in biotech has actually been viewed as kind of the things that the Chinese zeros and companies are good at. I think what we're finding out is like that's not actually innovation.

My hypothesis is that the kinds of things that we're working on, we're really putting big data and AI into kind of the first layer of how we do biology. That's what innovation should look like in our space. And if we don't, as a community, push that forward, we're not going to have that innovation in the industry. And Johnny's saying it slapped us in our face. It caught us by surprise. But actually, one of the first conversations that Johnny and I had three years ago

when we were thinking about Sol in vivo, was actually, Johnny was, he was actually telling me about this thing that's happening in China as well, and this whole thesis around commoditization of a lot of the things that we think are so massively important, you know, like molecular design, et cetera, et cetera. So I think in that sense, I do agree, and I think there is,

there is two ways to do it. Regulatory capture, try to lobby the government and everything to put a limit on how much we can interact with the Chinese companies. Here's the other way. Make it part of our ecosystem and change our thinking about business models, the way we build our teams, to Patrick's point, you know, do we build a fully integrated team with $100 million in the bank or a small 14% team like we are at Vivo? I think these are the kind of things we should be thinking about. And actually, like, I want to make this into this bigger statement that's a little more

Reagan-esque. I think it's morning in bio in a way that, you know, like there's a different, we should be playing a different kind of game here. And if you want to stick to the same old school way of doing things, it's not going to work. Old school way is what? It's a lot of planning. You know, if I had a sense, we were texting about this with Dave a couple of days ago.

If I had, I don't know, a penny for every time some massive organization announces this extraordinary impressive thing and they say, oh, we're going to give it to you in three years to five years, honestly, I would be super rich right now. This is the ethos in bio. You announce this massive thing and you say you're going to do it in three to five years. No, I think it's the time. We have the tools.

It's the time to build and it's the time to do it right now. That's the way Evo 2 actually gets created in a matter of months, from the first Evo paper to what happened. That's the way Tahoe gets created. The second piece is small, super-focused teams of superstars. Massive organizations, the vertical integrated ones, it's not just the capital intensivity. They're actually very inefficient too. They go very slowly. You actually bug them down in a lot of bureaucracy.

And I think the third piece is associated with this naysaying thing. Again, like in everything you want to do in bio, there are a lot of these very strong biologists that will tell you why this is not going to work. I think that has to change. We have to change it. We have to think very differently about this. We have to try things out. And now we have the tools to do it. On this last point, when I talk to pharmaceos, they'll say, oh, AI and drug discovery, very interesting.

But you know what? I actually don't spend that much of my top-line budget on drug discovery. Most of it is wrapped up in clinical development. And so a lot of them actually are much more excited about things like natural language workflows to summarize clinical trial documents, which are these massive regulatory filings and summarize them and make it easier to write these things and read them and just more normal AI stuff. Stratifying cohorts. Yeah.

Yeah, and... Reducing costs in that part of the cycle. And I think the thing that they're going to see as these models get better, right? Virtual cell models actually help you find the right target where you can actually point the cannon in the right direction and measure...

twice and cut once is that the cost basis for the industry will go down and the accuracy should go up. I'm really glad both of you actually just brought up the naysayers because if you weren't going to, I was going to. I think I have now been pitched...

AI for biotech companies for at least a decade, right? And we haven't seen lots of, and there's also just the natural life cycle of bringing treatments to market. So let's say like you actually need 11 years plus generally, but like what would you, if you're going to leave like a broader audience with like a single claim about why that is true, obviously there were different approaches from like, let's say,

A decade ago, it might have been computer vision and consumer scale sequencing data, right? But why should this work now or when should we actually begin to see treatments from these approaches in machine learning? I mean, I'd go back to like analogies in the machine learning space. We had, you know, we call them artificial neural networks for a long, long time. And then people would get all wrapped up around, oh, this perceptron can't model an exclusive or gate or whatever. Perceptron, what is this, 1990s? Exactly. Exactly.

And it sort of just like, you know, bounced around for a while. And it wasn't until, you know, we had increase in compute, increase in data, and then, you know, more sophisticated models that you sort of hit these nonlinear inflection points, right? And I mentioned earlier about, you know, the ImageNet moment in 2009 and what happened there was that it sort of drove

of convolutional neural networks. I think it was the AlexNet was the model that really showed the way. And before that, you know, you would think, oh, only humans could recognize images at high quality. A computer will never do it. Of course, now we know computers can do that better than humans. And so I think it's the same thing in AI and biology. And when I look, you know, sort of coming into this relatively new, like when I see the capability on single cell sequencing,

It's kind of mind-blowing if you're not a biologist, but this idea that we can take at a single-cell resolution, we can look at how its expression is changing over time. It's incredible. You take that, you then take the ability to generate lots of data around that, and then you take these much more sophisticated models and model training, and suddenly things are happening. If you look at the EVO2 model, we trained it on 9.3 trillion cells

but we didn't tell it anything about DNA. We were just like, here's a lot of DNA on the planet across, you know, every single piece of DNA we could get a hold of. And then what did the model learn? It started learning all sorts of things. Like it knows where ribosome binding sites are. It knows where what codon degeneracy is. And then one of the things we showed is it can actually predict, uh,

you know, the pathogenicity of the BRCA1 variant, right, which is known to drive, you know, breast and ovarian cancer. And it does that with an area under the rock curve of like 0.94, if I recall, looking at honey. And I mean, this is incredible. And we never taught it anything. It just learned this stuff, zero shot. And so I think we're at that point of inflection now. I think all of us are kind of, you know, would be on the same page.

all agree to this, that I think we're at that point in time now where we're going to see that inflection and it's going to be about, it's going to be the data, right? That's going to be the difference between where we were yesterday and where we are starting this week. It's going to be the data. So are we, so we're somewhere between GPT-1 and GPT-4, right? In biology. But where do you guys think we are? I'm like, I'm more like two. Yeah. We're like,

developing GPT-2, but we're like, we don't have enough data, guys. We need more data. I think if you actually go a little deeper and you talk about different domains, I think in the protein models, we are past GPT-3. When it comes to single-cell models and virtual cell models, yeah, I think GPT-1 to 2 right now. I think we're closer to GPT-1 than 2. Yeah.

That's a pretty exciting timeline, though, if you just take the progress and pace of progress in other domains and apply it here. But I think the difficulty is exactly what you said, that with GPT-4, you immediately knew what you had.

But if we hit GPT-4 of, you know, cell state models, for example, for drug discovery, as you said, it will take some time to actually prove that point. And I think that there's a lot of small numbers always takes hold in drug discovery, right? You know, a platform that takes your success rate from, you know, 10% to like 30% is amazing, but still it's like 30%. You need to get lucky. Right.

- Right, and you still have the drug development cycle, which is an order of 10 years. So you still gotta wait for that to prove itself. - To slowly go up in a 10 year rolling window. - That's right. - There's a concert to this. - If we're six optimists here, then I will say we're just gonna treat it, and systems people, we're just gonna treat it as a system. And if this was a terribly debilitating bottleneck at the beginning, then hopefully it's a breakthrough. I think that's a great note to end on. Connie, Dave, Patrick, Nima, and Johnny, thank you so much for doing this, and congratulations. It's the data.

Find us on Twitter at NoPriorsPod. Subscribe to our YouTube channel if you want to see our faces. Follow the show on Apple Podcasts, Spotify, or wherever you listen. That way you get a new episode every week. And sign up for emails or find transcripts for every episode at no-priors.com.

Virtual Cell Models, Tahoe-100 and Data for AI-in-Bio with Vevo Therapeutics and the Arc Institute 57:40 Share

No Priors: Artificial Intelligence | Technology | Startups

Deep Dive

Shownotes Transcript

Virtual Cell Models, Tahoe-100 and Data for AI-in-Bio with Vevo Therapeutics and the Arc Institute